Pattern Matching

I’ve just recently had a task where I had to do table lookups by a unique Code.
It was a pretty special sort of Code though. The first character has special meaning and the last two characters have special meaning.

I had the option of creating a complicated mapping table, but then I recalled my unix days… Regular Expressions!
Regular Expressions are great. They provide syntax for advanced string pattern matching.

In this example I want to find the Code which starts with a digit and ends with the letter J followed by the number 9.
The text in cell A3 is the pattern which matches what I’m looking for. In this case, that pattern matches the Code 9SGJ9.
I could have used ^(S|8)YRUP$ to match 8YRUP (or SYRUP if it were in the list of Codes).
The pattern matching syntax is sophisticated. The documentation is really worth a look.

The formula in cell B3 is:
=INDEX(B6:B15, RegMatch($A$6:$A$15, $A$3))

RegMatch is a User Defined Function in VBA for handling the pattern matching.
Regular Expression functions are already provided by Microsoft’s VBScript library so I just wrote a wrapper around it and called that RegMatch.

You’ll need to reference “Microsoft VBScript Regular Expressions”.
On my machine it’s version 5.5, but a later version is available for download from Microsoft’s website.

Dim reg As New RegExp
 
Public Function RegMatch(Source As Range, Pattern As String, Optional IgnoreCase As Boolean = True, Optional MultiLine As Boolean = True) As Long
    Dim rng As Range, i As Long, j As Long
 
    reg.IgnoreCase = IgnoreCase
    reg.MultiLine = MultiLine
    reg.Pattern = Pattern
 
    i = 0: j = 0
    For Each rng In Source
        i = i + 1
        If reg.test(rng.Value) Then
            j = i
            Exit For
        End If
    Next
    RegMatch = j
End Function

As I’m writing this post I’ve been looking on the Internet for other examples.
A great newsgroup post by Harlan Grove offers similar wrappers around the scripting library.

The documentation for VBScript Regular Expressions is available from Microsoft’s website. Personally, I prefer the offline version.

29 Comments

  1. That’s cool, Rob. Here’s another link

    http://www.tmehta.com/regexp/

    which I can only assume is Tushar Mehta.

  2. JWalk says:

    Very nice, Rob. I didn’t know about that. I’ve tried to do similar things using VBA’s Like operator, but this is definitely a better choice.

  3. brettdj says:

    Hi Rob,

    I’ve been mucking about with RegExp for a while and found it useful for a variety of Excel tasks from rearranging peoples names, replacing specific absolute cell references with relative references, to extacting the sheet names from a multi sheet RefEdit box range.

    As I had a number of requests for some examples, I pulled together a small file showing different RegExp methods (Test, Execute & Replace) and uses for RegExp with Excel

    Hopefully this link at VBAX is accessible
    http://www.vbaexpress.com/kb/getarticle.php?kb_id=68

    there is a discussion thread on this example @
    http://www.vbaexpress.com/forum/showthread.php?t=226
    but I think you need to be a VBAX member to see it

    Cheers

    Dave

  4. XL-Dennis says:

    I agree :)

    Here is a link to a workbook that contain 7 examples for regular expression from my friend Dave (aka brettdj):

    http://www.vbaexpress.com/kb/getarticle.php?kb_id=68

    And if You want to test expression or learn more then this free tool - The Regulator - may be of interest:

    http://regex.osherove.com/

    Kind regards,
    Dennis

  5. brettdj says:

    LOL :)

    Dennis, that’s the first time I’ve been browsing through here for a month or more, and we end up posting a minute apart.

    Thanks for the recommendation.

    I hope all is well with you and your family. All is well here with our newborn but I’m travelling a lot for work at the momement

    Cheers

    Dave

  6. XL-Dennis says:

    :)

    I will always associate Regular Expression with You ;)

    Kind regards,
    Dennis

  7. Hi.

    Some great links here - thanks.

    I too have a newborn - well, 7 months… some of the reason I’ve not been posting recently ;)

    In the above example I used it for Code lookups.

    My other use for RegExp in this particular task was for massaging the output of DOS program.
    The DOS program creates a fixed column width report, divided into sections - with Headers, Footers, Subnotes etc…
    I used Regular expressions to detect and split sections. Then detected headers and footers for removal. Removed lines with a blank code (subnotes).
    Removed empty lines.
    I was finally left with only data in fixed width column ready for transfer.
    In all, about 10 lines of code (each calling a generic RegReplace function)

    I dread to think what the non-RegEx equivalent would have been.

  8. Jon Peltier says:

    “I dread to think what the non-RegEx equivalent would have been.”

    I wrote the non-RegEx version of this a few years ago. I was parsing mainframe output in an antique QA application, for which we fortunately had a terminal emulator with a VBA object model (the in-house VB programmer couldn’t figure out how to work it, and wouldn’t listen to a mere engineer–me–explain it to him). I’d get the output, then line by line search for recognized phrases, then go to the branch of the code that parsed that particular line or section. It would break every second time I ran it, because it would encounter yet another different abbreviation in the output. RegEx would have helped immensely.

  9. Ben says:

    This is a great bit of code for RegExp functions that actually work!!
    Wish i’d seen this a long time ago when i first started my involvment with excel (after a long time working on unix).

  10. doco says:

    Here is an example from Microsoft’s site

    Function TestRegExp(myPattern As String, _
                        myString As String) As String
    Dim objRegExp           As RegExp
    Dim objMatch            As Match
    Dim colMatches          As MatchCollection
    Dim RetStr              As String
       
    '   Create a regular expression object.
       Set objRegExp = New RegExp

        With objRegExp
        '   Set the pattern by using the Pattern property.
           .Pattern = myPattern
       
        '   Set Case Insensitivity.
           .IgnoreCase = True
       
        '   Set global applicability.
           objRegExp.Global = True
       
        '   Test whether the String can be compared.
           If (.Test(myString) = True) Then
       
            '   Get the matches.
               Set colMatches = .Execute(myString)     '   Execute search.
       
                For Each objMatch In colMatches         '   Iterate Matches collection.
                   With objMatch
                        RetStr = RetStr & "Match found at position "
                        RetStr = RetStr & .FirstIndex & ". Match Value is '"
                        RetStr = RetStr & .Value & "'." & vbCrLf
                    End With
                Next
            Else
                 RetStr = "String Matching Failed"
            End If
        End With
       
       TestRegExp = RetStr
       
    End Function
  11. doco says:

    That did’nt work well. I think you can read it anyway. I can never remember what the tags are for code on this site. ‘[]‘ ?

  12. doco: There’s a paragraph above the comments to remind you. It’s [ vb ] without the spaces.

  13. flee01 says:

    Another solution is to use iserror(match(”*” & yourSearchCriteria & “*”,yourCell,0)) as a formula.

    If it returns true your criteria is not found.

  14. Emma says:

    Hi everyone

    Can someone give me some more pointers on how to make the “RegExp” Object available to me in Excel?

    I went to the Microsoft website, and installed “Windows Script 5.7″. Is that the right thing? If it is, what else do I have to do to make my VBA macros recognize this object? I tried to initialize both “VBScript_RegExp_57.RegExp” and just “RegExp”, but it does not recognize either.

    I’m not at all familiar with VBScript so I might be completely on the wrong track. But browsing the Microsoft website didn’t help me much…

    Thanks in advance.
    Emma

  15. Emma: From VBA’s menu, Tools > References. Tick “Microsoft VBScript Regular Expressions”

  16. [...] As Boolean = True, Optional MultiLine As Boolean = True) As Boolean ‘ Modified from http://www.dailydoseofexcel.com/archives/2005/08/13/pattern-matching/    Dim reg As New RegExp     reg.IgnoreCase = IgnoreCase    reg.MultiLine = MultiLine  [...]

  17. Ramesh Vasudevan says:

    All:

    I have following question on the parsing and using regular expressions.

    The following information need to be parsed and provide the results as follows:

    Input:
    123 Technology, Inc.
    Access Info, LLC
    First systems, llc
    Level 5, Ltd
    Level 3, LLC

    Expected Output:
    123 Technology
    Access Info
    First systems
    Level 5
    Level 3

    How do I get the results in Excel using regexp or existing formulas.

    Thank You in advance.
    Ramesh V.

  18. PBorralho says:

    Ramesh, the formula that you need is:

    =LEFT(A1,FIND(”,”,A1)-1)

    (assuming the input text is at cell A1)

  19. Bill Ghauri says:

    As a first time poster but long time reader of DDoE, many thanks to all who post wonderful solutions and ideas.

    Here is a problem that I have been grappling with and hope someone can provide me with some ideas that will lead me towards a solution.

    I have a model that processes real-time data and outputs to matrix table with either a “1†or “0.†The matrix table consists of eight letters across the top (“A†to “Hâ€) and ten rows down (1 to 10), i.e. a table with 80 cells. As the model runs, it populates these 80 cells with either a “1†or “0.†Now I concatenate all these 1’s and 0’s into a string that might look like something like “B7C3E8.†So as you can imagine this would imply a “1†in the “B†column, row 7, a “1†in the “C†column, row 3, and a “1†in the “E†column, row 8. The rest of the cells would have “0’s.â€

    I have a library of patterns that I need to see if they are contained in the model output string. So for example, I may have a pattern called “A3A7B3E9†and need to see if this exists within the model output string which might look like “A2A3A7B1B3B6C1C8D7D8E2E9F1H3H5.†In this specific example, there is a pattern match because “A3â€, “A7â€, “B3â€, and “E9†are contained within the model output string.

    I have been trying to use RegMatch function, but so far I have had limited success. I can get a match in some cases, but the example above will not be picked up by the function as a match. Is what I’m trying to do possible with the RegMatch function or is there a better approach?

    I should note that I am currently matching patterns by having two identical tables, one that is the model output and the other that is the desired pattern and then comparing the two as arrays. This actually works quite well, but as my library of search patterns has grown, I now need to loop through my library of patterns and identify the matches in a more streamlined fashion. Any help or ideas would be much appreciated.

    Best regards,
    Bill

  20. Rick Rothstein (MVP - Excel) says:

    @Bill Ghauri

    Are your patterns always ordered in the same way… all the A’s are listed first in numerical order, then the B’s are listed next in numerical order, followed by the C’s, D’s, etc. each numerical order (as your example shows)? If yes, and if the pattern is also listed in with this ordering, then you use this function which returns True if the pattern exists and False otherwise…

    Function IsPatternMatch(ModelOutputString As String, _
                          Pattern As String) As Variant
      Dim X As Long, WildcardPattern As String
      For X = 1 To Len(Pattern) Step 2
        WildcardPattern = WildcardPattern & "*" & Mid(Pattern, X, 2)
      Next
      IsPatternMatch = ModelOutputString Like WildcardPattern & "*"
    End Function

    And, using your posted pattern and model output string, you would use it like this…

    Sub Test()
      If IsPatternMatch("A2A3A7B1B3B6C1C8D7D8E2E9F1H3H5", "A3A7B3E9") Then
        MsgBox "Yes, the pattern matches."
      Else
        MsgBox "No, the pattern does not match."
      End If
    End Sub
  21. Bill Ghauri says:

    Rick,

    Yes, the patterns are always ordered the same way… in alphabetical and numerical order. Your function works like a champ! It is exactly what I needed. Thank you so much.

    Best regards,
    Bill Ghauri

  22. Rick Rothstein (MVP - Excel) says:

    @Bill Ghauri

    Sorry Bill, but there is a small problem with the function I posted… it will not work correctly if the Pattern contains any cells from the tenth row. Here is an amended function which should correct that problem…

    Function IsPatternMatch(ByVal ModelOutputString As String, _
                          ByVal Pattern As String) As Variant
      Dim X As Long, WildcardPattern As String
      Pattern = Replace(Pattern, "10", "X")
      ModelOutputString = Replace(ModelOutputString, "10", "X")
      For X = 1 To Len(Pattern) Step 2
        WildcardPattern = WildcardPattern & "*" & Mid(Pattern, X, 2)
      Next
      IsPatternMatch = ModelOutputString Like WildcardPattern & "*"
    End Function
  23. fzz says:

    My first thought is that this would be a very good situation in which to use Longre’s MOREFUNC.XLL add-in. Doing so, this problem could be reduced to formulas like

    =REGEX.COUNT(string,REGEX.SUBSTITUTE(”.*”&pattern,”([A-Z]+\d+)”,”[1].*”))

    This could be supplemented with some initial validity checking on string and pattern.

    =CHOOSE(1+REGEX.COMP(string,”^([A-Z]+\d+)+$”,1)+2*REGEX.COMP(pattern,”^([A-Z]+\d+)+$”,1),
    -3,-2,-1,REGEX.COUNT(string,REGEX.SUBSTITUTE(”.*”&pattern,”([A-Z]+\d+)”,”[1].*”)),-1)

    which returns -3 if both string and pattern are invalid, -2 if only pattern is invalid, -1 if only string is invalid, 0 if valid pattern isn’t found within string, and 1 if valid pattern is found within string.

    OTOH, if add-ins aren’t allowed but VBA is, more general is better than more particular. Rick’s revised UDF works given the specs, but if the specs change, e.g., range expands to row 11 or further and/or to column X or further, it fails. Myself, I’d use a state machine and have the udf perform validity checking. Something like

    Function matchpat(pat As String, str As String) As Long
      'first Const for case insensitive, second for case sensitive/only upper case valid
     Const LETTERS As String = "[A-Za-z]"
      'Const LETTERS As String = "[A-Z]"

      Dim k As Long, st As Long, ch As String * 1, apat As String

      For k = 1 To Len(pat)
        ch = Mid$(pat, k, 1)

        If st = 0 And ch Like LETTERS Then
          apat = apat & ch
        ElseIf st = 0 And ch Like "#" Then
          apat = apat & ch
          st = 1
        ElseIf st = 1 And ch Like "#" Then
          apat = apat & ch
        ElseIf st = 1 And ch Like LETTERS Then
          apat = apat & "*" & ch
          st = 0
        Else
          st = 0
          Exit For
        End If
     
      Next k

      matchpat = IIf(st = 1, 0, -2)
      st = 0

      For k = 1 To Len(str)
        ch = Mid$(str, k, 1)

        If st = 0 And ch Like LETTERS Then
          'nothing to do
       ElseIf st = 0 And ch Like "#" Then
          st = 1
        ElseIf st = 1 And ch Like "#" Then
          'nothing to do
       ElseIf st = 1 And ch Like LETTERS Then
          st = 0
        Else
          st = 0
          Exit For
        End If

      Next k

      matchpat = matchpat + IIf(st = 1, 0, -1)

      If matchpat = 0 Then matchpat = Abs(str Like "*" & apat & "*")

    End Function

    Quibble: argument ordering should be comparable to Excel’s built-in functions, e.g.,

    FIND(look_FOR,look_WITHIN[,...])

    SEARCH(look_FOR,look_WITHIN[,...])

    MATCH(look_FOR,look_WITHIN[,...])

    so

    matchpat(look_FOR,look_WITHIN)

    rather than

    IsPatternMatch(look_WITHIN,look_FOR)

  24. Rick Rothstein (MVP - Excel) says:

    @fzz and Bill Ghauri

    Responding to the comments made by fzz, this is how I would write the general case pattern search modeled after Bill’s setup; it should work for any address range in XL2003 or XL2007/2010 and beyond; that is, as long as the ordering is as Bill has defined it, it should work for any number of letters followed by any number of digits for each “address” in the pattern and model output strings.

    Function IsPatternMatch(ByVal ModelOutput As String, _
                            ByVal Pattern As String) As Boolean
      Dim X As Long
      Do
        X = X + 1
        If X <= Len(Pattern) Then
          If Mid(Pattern, X, 2) Like "#[a-zA-Z]" Then
            Pattern = Replace(Pattern, Mid(Pattern, X, 2), Mid(Pattern, _
                        X, 1) & "-" & "*" & "-" & Mid(Pattern, X + 1, 1))
          End If
        End If
        If X <= Len(ModelOutput) Then
          If Mid(ModelOutput, X, 2) Like "#[a-zA-Z]" Then
            ModelOutput = Replace(ModelOutput, Mid(ModelOutput, X, 2), _
                                  Mid(ModelOutput, X, 1) & "-*-" & _
                                  Mid(ModelOutput, X + 1, 1))
          End If
        Else
          Pattern = "*-" & Pattern & "-*"
          ModelOutput = "*-" & ModelOutput & "-*"
          Exit Do
        End If
      Loop
      IsPatternMatch = ModelOutput Like Pattern
    End Function
  25. Rick Rothstein (MVP - Excel) says:

    @fzz

    Your function does not work quite right. It will return True for this function call…

    matchpat(”A3A7B1234E10″, “A2A3A79B1234B3B6C1C8D7D8E10E9F123H3H5AB1234CC1ABC123″)

    Using your argument names, there is an A7 in the ‘pat’ argument, but no A7 in the ’str’ argument (it looks like is matched the first two characters from the A79 address).

  26. Bill Ghauri says:

    Rick and fzz,

    Thank you both for the solutions. Rick, in my initial testing, I did not notice the 10th row limitation at first, but I did later. Your third version of the function works very well. Since my actual model currently contains 80 rows (but it could easily be over 100 soon), I like the flexibility of your function to accommodate an expanded table range. Btw, to get the model output string in the proper order, I have a function that concatenates the columns first and then concatenates the column “totals” (i.e. top to bottom then left to right). This ensures the model output string is always in the proper order. Again, many thanks.

    Best regards,
    Bill Ghauri

  27. fzz says:

    @Rick- then better still to use MOREFUNC.XLL regular expressions.

    =REGEX.COUNT(REGEX.SUBSTITUTE(string,”(\d)([A-Z])”,”[1] [2]“),
    “^.*\b”&REGEX.SUBSTITUTE(pattern,”([A-Z]+\d+)”,”[1]\b.*”)&”$”)

    As for udfs without regular expressions, you’re right: delimiters between tokens are needed, but why

    … & “-” & “*” & “-” & …

    rather than

    … & “-*-” & …

    in the first Replace call?

  28. Rick Rothstein (MVP - Excel) says:

    @fzz…

    ==> then better still to use MOREFUNC.XLL regular expressions.

    I still like rolling my own code… and the function I came up with is relatively short… plus, in my opinion, will be easier to read next week than the regular expression pattern will be. [grin]

    ==> but why … & “-†& “*†& “-†& … rather than
    ==> … & “-*-†& … in the first Replace call?

    I originally had Chr$(1) instead of the dashes; then I did a Replace on the Chr$(1) and replaced them with “-”; then I noticed I could replace the “-”&”*”&”-” with “-*-”, which I did by hand… and, of course, I missed one of them.

  29. Bob Phillips says:

    Also, someone recently posted that MOREFUNC can fail in Excel 2007. Haven’t tried it myself, have never used MOREFUNC, but if that is correct it validates my decision.

Leave a Reply