Why is the regex quantifier {n,} more greedy than + (in Python)?

147 Views Asked by At

I tried to use regexes for finding max-length sequences formed from repeated doubled letters, like AABB in the string xAAABBBBy.

As described in the official documentation:

The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible.

When I use the quantifier {n,}, I get a full substring, but + returns only parts:

import re

print(re.findall("((AA|BB){3,})", "xAAABBBBy"))
# [('AABBBB', 'BB')]
print(re.findall("((AA|BB)+)", "xAAABBBBy"))
# [('AA', 'AA'), ('BBBB', 'BB')]

Why is {n,} more greedy than +?

1

There are 1 best solutions below

0
Lover of Structure On

Both quantifiers {3,} and + are greedy in the same way.

First, let's simplify the output a little bit by changing the inner group into a non-capturing one:

import re

print(re.findall("((?:AA|BB){3,})", "xAAABBBBy"))
# ['AABBBB']
print(re.findall("((?:AA|BB)+)", "xAAABBBBy"))
# ['AA', 'BBBB']

The first pattern requires a repetition (with the total number of occurrences being at least 3 – let's call this multiplicity ≥3), so the only possible match starts at the second A:

x AA A B B B By

The second pattern requires only multiplicity ≥1. As the string is scanned left-to-right, the first (left-most) possible greedy match is formed by the first two As. For the remaining string ABBBBy, the first (left-most) possible greedy match is BBBB. After that, only y remains, which can't be matched.

xA AAB B B By