regex: matching an entire line with an exact number of repeating tokens

176 Views Asked by At

I'm trying to write a regex that can match ONLY those lines where every white-space separated token occurs exactly twice, regardless of order.

For example, the entirety of the following lines should match:

1 1 2 2
100 10 10 100
A B B A 
HELLO HELLO

The following lines should NOT match:

hello hello hello
1 1 22
1001

Although I'm able to match individual groups of repetitions in a given line, using the regex (\d+)(?=.*(\1)), I am struggling to match entire lines using ^$. My guess is that when I use the lookahead, this is creating an infinite loop where we constantly look at every token (including the repetitions) and expect a repeat later in the string, although I am not sure how to resolve this. Any ideas? Thanks!

[EDIT]: To add some details, based on questions from the comments: Obviously it is fairly trivial to implement this as a function in most programming languages. However,I was originally looking to implement this as a regex, because I was trying to match certain records in a database. Thus, this regex was intended to be embedded as a CASE statement in a SQL query, which in my mind would've been a nice way make the selection.

Given the apparent complexity of such a regex, it seems that creating a function is the way to go, and thus pretty much any of the following answers would be fine solutions, circumstances depending.

5

There are 5 best solutions below

1
Nick On BEST ANSWER

You could create a function to split the string and count the occurrences of each word in it, returning true if all the words have a count of 2:

create function all_pairs(v text) returns bool as $$
  with counts as (
    select count(*) as c
    from unnest(string_to_array(v, ' ')) as vals(val)
    group by val
  ),
  arr as (
    select array_agg(c) as cc
    from counts
  )
  select 2 = all(cc)
  from arr
$$ language sql

Then you can simply call the function to test your string. For example:

with values as (
   select '1 1 2 2' as v union all
   select '100 10 10 100' union all
   select 'A B B A' union all 
   select 'HELLO HELLO' union all
   select 'hello hello hello' union all
   select '1 1 22' union all
   select '1001'
)
select * 
from values
where all_pairs(v)

Output:

v
1 1 2 2
100 10 10 100
A B B A
HELLO HELLO

Demo on dbfiddle.uk

3
Luuk On

In postgresql:


with recursive values as (
   select '1 1 2 2' as v union all
   select '100 10 10 100' as v union all
   select 'A B B A' as v union all 
   select 'HELLO HELLO' as v union all
   select 'hello hello hello' union all
   select '1 1 22' union all
   select '1001'
), nrs as (
   select 1 as x
   union all
   select x+1 from nrs where x<=10
)
select 
    v, 
    length(v)-length(replace(v,' ',''))+1 nrOfItems,
    split_part(v,' ',x) as s,
    count(*) as c
from values
cross join nrs
where nrs.x <= length(v)-length(replace(v,' ',''))+1
group by v,s
having not count(*)<>2
order by v
  • The cross join with nrs produces the parts after separation on space.
  • Then grouping on the text (v), and the part (s)
  • Having not a count unequal to 2

output:

v nrofitems s c
100 10 10 100 4 10 2
100 10 10 100 4 100 2
1 1 2 2 4 2 2
1 1 2 2 4 1 2
1 1 22 3 1 2
A B B A 4 A 2
A B B A 4 B 2
HELLO HELLO 2 HELLO 2

EDIT: Because 1 1 22 shows up, which I (sadly) overlooked.

The having not count(*)<>2 was something I did, but was not sure of, and it is wrong...

Knowing this, I change to:

with recursive values as (
   select '1 1 2 2' as v union all
   select '100 10 10 100' as v union all
   select 'A B B A' as v union all 
   select 'HELLO HELLO' as v union all
   select 'hello hello hello' union all
   select '1 1 22' union all
   select '1001'
), nrs as (
   select 1 as x
   union all
   select x+1 from nrs where x<=10
)
select 
    v
from (
select 
    v, 
    split_part(v,' ',x) as s,
    count(*) as c
from values
cross join nrs
where nrs.x <= length(v)-length(replace(v,' ',''))+1
group by v,s
)x
group by v
having case when min(c)=2 then 1 else 0 end=1

which has a correct output, see: https://dbfiddle.uk/DJEmfTIT

0
Luuk On

Or, a bash script:

#!/bin/bash

awk '{ split($0,a,FS) 
       for(i in a) b[a[i]]++ 
       for(i in b){ if(b[i]!=2) c++ } 
       if(c==0) print $0 
       delete a; delete b; c=0
      }' input.txt

When input.txt has:

1 1 2 2
100 10 10 100
A B B A
HELLO HELLO
hello hello hello
1 1 22
1001

output is:

1 1 2 2
100 10 10 100
A B B A
HELLO HELLO
5
Cary Swoveland On

A regular expression can be used to perform the required test provided the regex engine supports variable-length negative lookbehinds. That includes .NET's engine, Python's regex engine and some Javascript engines.

For simplicity, initially assume that all the non-whitespace strings are comprised of word characters. Later I will relax that assumption.

Strings have the desired property if and only if they match the following regular expression.

^(?!.*\b(\w+)\b.*\b\1\b.*\b\1\b)(?!.*\b(\w+)\b(?<!\b\2\b.+)(?!.*\b\2\b)) 

Demo

The regular expression is comprised of two negative lookaheads, both pinned to the beginning of the string. The first asserts that no string of word characters appears three (or more) more times in the string. The second asserts that no string of word characters appears just once in the string.

The expression can be broken down as follows.1

^            # match the beginning of the string
#--- assert no non-whitespace string appears more than twice ---
(?!          # begin a negative lookahead
  .*         # match >=0 characters other than line terminators
  \b(\w+)\b  # match >=1 word chars, with word boundaries fore and aft
             # and save to capture group 1
  .*         # match >=0 characters other than line terminators
  \b\1\b     # match the content of capture group 1 with word boundaries
  .*         # match >=0 characters other than line terminators
  \b\1\b     # match the content of capture group 1 with word boundaries
)            # end negative lookahead
#--- assert no non-whitespace string appears just once ---
(?!          # begin a negative lookahead
  .*         # match >=0 characters other than line terminators
  \b(\w+)\b  # match >=1 word chars, with word boundaries fore and aft
             # and save to capture group 2
  (?<!       # begin negative lookbehind
    \b\2\b   # match the content of capture group 2 with word boundaries
    .+       # match >=1 characters other than line terminators
             # matches the position of the string saved to the capture group
  )          # end negative lookbehind
  (?!        # begin inner negative lookahead
    .*       # match >=0 characters other than line terminators
    \b\2\b   # match the content of capture group 2 with word boundaries
  )          # end inner negative lookahead
)            # end outer negative lookahead

I have assumed that empty strings have the desired property. If that is not the case tack .{2,} on to the end of the regex.


If the non-whitespace strings are not necessarily comprised of word characters \b(\w+)\b can be replaced with (?<!\S)(\S+)(?!\S) and elsewhere "left" and "right" word boundaries (\b) and be replaced with (?<!\S) and (?!\S), respectively.


This is a case where the desired objective is easier to achieve in code. For example, that could be done in Ruby as follows.

def check_it(str)
  str.split.tally.values.uniq == 2
end
check_it "1 1 2 2"        #=> true
check_it "1 2 2 1"        #=> true
check_it "100 10 10 100"  #=> true
check_it "A B B A"        #=> true
check_it "HELLO HELLO"    #=> true
check_it "1 2 2"          #=> false 
check_it "1 2 1 2 2"      #=> false

The steps are as follows.

str = "A B    B A"
a = str.split    #=> ["A", "B", "B", "A"]
h = a.tally      #=> {"A"=>2, "B"=>2}
b = h.values     #=> [2, 2]
c = b.uniq       #=> [2]
c == [2]         #=> true

It should be straightforward to implement this algorithm in other languages.

1. Suppose the string were "cat dog cat". In the second negative lookahead, suppose \b(\w+)\b matched "dog", saving it to capture group 2. The regex engine's string pointer would then be located right after the "g" in "dog". The negative lookbehind then matches the contents of capture group 2, \b\2\b, followed by one or more characters (.+) to reach the regex engine's string pointer. This assures a second "dog" does no appear before the initial "dog". The latter would not work if it were (.*) as that would permit \b\2\b to match in the initial "dog".

1
Andrei Odegov On

Split, count and aggregate and it's all done.

with
  t0 as (
    select line, tok, count(*) as cnt
    from regexp_split_to_table($$1 1 2 2
100 10 10 100
A B B A
HELLO HELLO
hello hello hello
1 1 22
1001$$, '[\r\n]+') as line,
    regexp_split_to_table(line, '\s+') as tok
    group by line, tok
  ),
  t1 as (
    select
      t0.line,
      array_agg(cnt) as acc
    from t0
    group by line
  )
select line
from t1
where 2 = all(acc);

Result:

+---------------+
|     line      |
+---------------+
| 100 10 10 100 |
| A B B A       |
| HELLO HELLO   |
| 1 1 2 2       |
+---------------+

db<>fiddle.