I tried to use regexes for finding max-length sequences formed from repeated doubled letters, like AABB in the string xAAABBBBy.
As described in the official documentation:
The
'*','+', and'?'quantifiers are all greedy; they match as much text as possible.
When I use the quantifier {n,}, I get a full substring, but + returns only parts:
import re
print(re.findall("((AA|BB){3,})", "xAAABBBBy"))
# [('AABBBB', 'BB')]
print(re.findall("((AA|BB)+)", "xAAABBBBy"))
# [('AA', 'AA'), ('BBBB', 'BB')]
Why is {n,} more greedy than +?
Both quantifiers
{3,}and+are greedy in the same way.First, let's simplify the output a little bit by changing the inner group into a non-capturing one:
The first pattern requires a repetition (with the total number of occurrences being at least 3 – let's call this multiplicity ≥3), so the only possible match starts at the second
A:The second pattern requires only multiplicity ≥1. As the string is scanned left-to-right, the first (left-most) possible greedy match is formed by the first two
As. For the remaining stringABBBBy, the first (left-most) possible greedy match isBBBB. After that, onlyyremains, which can't be matched.