Combine 2 regex matches

95 Views Asked by At

I have 2 regex expressions which are capturing both instances of what I need. However, I need to combine them so I only have 1 regex. These are complicated to me, and I'm not exactly sure how to combine them or how they are even matching what I'm looking for. I'm hoping that someone can tell me not only how to do it but kind of break it down so I can tell what is happening.

The 1st one is able to capture fields separated by a comma. It works for every instance where the comma is bypassed whenever they are in between parentheses but not in between single quotes. The 2nd one is the works between single quotes but not within parentheses when the field doesn't start with a parentheses,

 /((?:[^(),]+ | ( \((?: [^()]+ | (?2) )*\) ))*)(?: ,\s* | $)/xg
 /(?:^|\s*,)\s*( '[^']*' | \([^)]*\) | [^,]*?(?=\s*(,|$)) )/xg

I wrote a similar question on this, but I wanted to rewrite this to be a little more clear. Here is the string I'm processing and the 1st regex output and then the 2nd. Plus what I'd like to get.

String - "10507, 'KEY,CUST', NAME(FIRST,LAST), (FIRST,LAST)"

Example 1:  field 0 is 10507
Example 1:  field 1 is 'KEY
Example 1:  field 2 is CUST'
Example 1:  field 3 is NAME(FIRST,LAST)
Example 1:  field 4 is (FIRST,LAST)

Example 2:  field 0 is 10507
Example 2:  field 1 is 'KEY,CUST'
Example 2:  field 2 is NAME(FIRST
Example 2:  field 3 is LAST)
Example 2:  field 4 is (FIRST,LAST)

Expected    field 0 is 10507
            field 1 is 'KEY,CUST'
            field 2 is NAME(FIRST,LAST)
            field 3 is (FIRST,LAST)
1

There are 1 best solutions below

0
brian d foy On BEST ANSWER

So I'm hoping that someone can tell me not only how to do it but kind of break it down so I can tell what is happening.

You're using the /x regex flag, which makes literal (unescaped) whitespace insignificant. You can spread out the pattern and add comments so you can see what each part does, like this:

/
    (                       # start $1
        (?:
            # start a field (can't be any of these)
            [^(),]+         
            
            | 

            # start $2, matching literal paren groups
            (               
                \(    
                (?: 
                    [^()]+ 
                    | 
                    (?2) 
                )*
                \) 
            )
        )*
    )
    
    # handle the next field or the end of the string
    (?: 
        ,\s* 
        | 
        $
    )
/xg

When I first looked at this regex in your previous question, I already spotted problems, such as not handling escaped versions of characters. You could add more branches for each special case, but you end up with pages of code. Someone might come up with something manageable, but I'm not going to spend my time thinking about that.

Beyond that, there are some useful tools to watch a regex work. Regexp::Debugger lets you single step through a match. See my article Watch regexes with Regexp::Debugger which include an animation of it in action. It's actually quite magical; run this in a terminal:

use Regexp::Debugger;

my $pattern = qr/((?:[^(),]+ | ( \((?: [^()]+ | (?2) )*\) ))*)(?: ,\s* | $)/x;

my $string = "10507, 'KEY,CUST', NAME(FIRST,LAST), (FIRST,LAST)";

$string =~ m/$pattern/g;

If you want to keep going with the recursive solution, you might study Randal Schwartz's recursive regex to parse JSON. You can define grammars within the pattern. The pattern is interesting as an exercise (and I go into much more detail in Mastering Perl, 2nd Edition).

There are two general mistakes people make with regexes:

  1. They try to do the entire job in one pattern. We have a trick exercise in Learning Perl where it's much easier to accomplish and much easier to understand and to maintain as two separate patterns.
  2. Forcing regexes onto a problem past the point where regexes are useful, or trying to salvage or adjust patterns that are already misguided instead of starting over. When you start struggling with special cases, a regex that solves the problem quickly becomes unmaintainable. I think that's where you are: you inherited some broken regexes and you are trying to salvage them

I don't think there's much value in spending too much time on these patterns though since they are broken in other ways. That's why I broke it out into several much simpler patterns in my answer to your previous question.