Ruby regex for comma separated unique numbers using negative lookahead

102 Views Asked by At

Following is a regex which matches a comma-separated list of numbers and allows only unique numbers to be input:

^(?!.*\b(\d+)\b.*\b\1\b)(\d)(,(\d))*$

For e.g it allows 1,2,3 but disallows 1,1 or 2,1,1.

Can anybody please explain in simple language how that works?

Where I am confused is the negative lookahead assertion. In the explanations available on web about it shows the syntax as following

The syntax is: X(?!Y), it means "search X, but only if not followed by Y".

Ref: https://javascript.info/regexp-lookahead-lookbehind#negative-lookahead

But in my shown regex it doesn't follow the above syntax. Then how it works?

When it matches 1,2,3 what matching process happens?

When it doesn't match 2,1,1 what matching process happens?

When 1,2,3 or 2,1,1 are matched are they first matched against following the part of regex (\d)(,(\d))* and then its match result is asserted against the negative lookahead part (?!.*\b(\d+)\b.*\b\1\b) or first the negative lookahead part runs and then the remaining part?

Also if I remove .* before and after \b in negative lookahead part then it also starts matching 1,1 or 2,1,1. So what's the significance of the removed .* in negative lookahead part.

Note: I want to use the regex in my Ruby code in case it is important to inform about.

Source references for the regex are

https://stackoverflow.com/a/45946721/936494

https://stackoverflow.com/a/45944821/936494

Thanks.

3

There are 3 best solutions below

0
Cary Swoveland On

The negative lookahead

(?!.*\b(\d+)\b.*\b\1\b)

following a beginning-of-string anchor (^), asserts, "It is not the case ((?!...)) that, after skipping zero or more characters (other than line terminators) (.*), there exists a string comprised of one or more digits (\d+), saved to capture group 1 ((\d+)), that is preceded and followed by a word boundary (\b), and that is followed by one or more characters (other than line terminators) (.*) that is followed by the contents of capture group 1 (\1), preceded and followed by a word boundary".

Therefore, the negative lookahead fails for the string 'a 12, *34 *12' (because '12', preceded and followed by a word boundary, is repeated), whereas it succeeds for the string 'a 12, *34 512' (although '12' is repeated, the second '12' is not preceded by word boundary). If the lookahead fails there can be no match so the rest of the regex is not evaluated. If it succeeds the regex engine continues to evaluate the rest of the regex.

Since the regex begins with the start-of-string anchor (^), the lookahead, if satisfied, won't move the regex's string pointer from the start of the string, in which case the regex effectively becomes

^(\d)(,(\d))*$

This satisfies, for example, '1,2,3'. '1' would be saved to capture group 2 (capture group 1 was used in the lookahead), ',2' would be saved to capture group 3, '2' would be saved to capture group 4, ',3' would be saved to capture group 3 (overwriting ',2') and '3' would be saved to capture group 4 (overwriting '2'). Demo

In view of the part of the regex that follows the negative lookahead we see that the negative lookahead could be tightened as follows1:

^(?!(?:\d,)*(\d),(?:\d,)*\1(?=,|\z))(\d)(,(\d))*$

Demo

1. I've used the positive lookahead (?=,|$) (rather than (?=,|\z)) at the link in order to test the regex against multiple strings.

0
sln On

You have to be consistent with the exclusion (top)

^
(?!
   .* \b 
   ( \d+ )                       # (1)
   \b .* \b \1 \b 
)

with the bottom

( \d+ )                       # (2)
(                             # (3 start)
   ,
   ( \d+ )                       # (4)
)*                            # (3 end)
$  

Use \d+ throughout.
If you just want single digit \d, use that throughout.

^(?!.*\b(\d+)\b.*\b\1\b)(\d+)(,(\d+))*$

https://regex101.com/r/EnAPJW/1

0
Todd A. Jacobs On

A Pragmatic Solution Instead of a Regexp Explanation

I'm going to let other people address the underlying regexp portion of your question. I'm going to address the X/Y problem here, which is that while you can do certain things with regular expressions, it can create unnecessary complexity. In addition, other methods such as Array or String methods are often faster.

Splitting on String Literals Instead of Regexps

The rule of thumb for good code should be to extract a pattern as simply as possible and then use other methods to manipulate it. Consider the following examples, depending on whether you want to print out unique values, or only digits that have no duplicates in the String:

STR = "1,6,6,7,3,2,1,2,3,6"

# return only one of each number found, sorted
STR.split(?,).uniq.sort.map &:to_i
# => [1, 2, 3, 6, 7]

# return only Integer values not duplicated in
# the String
STR.split(?,).tally.select { _2.eql? 1 }.keys.map &:to_i
# => [7]

Either way, this is faster and easier to follow than a complex regexp.

Splitting on Regexp Literals

If you insist on a Regexp you could replace String#split with a Regexp literal or with String#scan depending on your needs. For example, consider the two roughly equivalent uses of #split and #scan with a Regexp instead a String literal:

# scan String for one or more consecutive
# digits
STR.scan(/\d+/).uniq.sort.map &:to_i
#=> [1, 2, 3, 6, 7]

# split String on commas surround by
# optional whitespace
STR.split(/\s*,\s*/).uniq.sort.map &:to_i
#=> [1, 2, 3, 6, 7]

This is a little more forgiving if your String isn't regular, e.g.:

STR = " 1,6,  6 , 7,3 ,2,1,\t2\t, 3 ,6"

although you could certainly just insert String#strip into your method chain to achieve similar results. Here's an example of splitting on , as a String literal even given the non-standardized input values above:

STR.split(?,).map(&:strip).uniq.sort.map &:to_i

General Considerations

The main points to note here are:

  1. That you don't need regular expressions at all to solve this particular problem.
  2. Even if you use regexps, make your code simpler and easier to read by using regexps with simple patterns, then do more complex manipulations of the result with other methods so you can better inspect your intermediate results.

When you come back to read your code in six months, method chains will be a lot easier to debug than a long, complex regular expression especially if you're using atoms or regexp constructs that you don't find easy to understand when you write them the first time. Your future self will thank you for the kindness!