About Regex word boundaries

263 Views Asked by At

I have this string:

s='''
D. JUAN:
¡Cálmate, pues, vida mía!
Reposa aquí; y un momento
olvida de tu convento
la triste cárcel sombría.
¡Ah! ¿No es cierto,
ángel de amor,
que en esta apartada orilla
más pura la luna brilla
y se respira mejor?
'''

If I want all the words strarting with a vowel:

import re
print(re.findall(r'\b[aeiouAEIOU]\w*\b', s))

and the output is:

['aquí', 'un', 'olvida', 'Ah', 'es', 'amor', 'en', 'esta', 'apartada', 'orilla']

Now, I try to list all words that do not start with a vowel:

print(re.findall(r'\b[^aeiouAEIOU]\w*\b', s))

and my output is:

['D', 'JUAN', 'Cálmate', 'pues', 'vida', ' mía', 'Reposa', ' aquí', 'y', ' un', ' momento', '\nolvida', ' de', ' tu', ' convento', '\nla', ' triste', ' cárcel', ' sombría', 'No', ' es', ' cierto', 'ángel', ' de', ' amor', 'que', ' en', ' esta', ' apartada', ' orilla', '\nmás', ' pura', ' la', ' luna', ' brilla', '\ny', ' se', ' respira', ' mejor']
1

There are 1 best solutions below

3
Wiktor Stribiżew On

The [^aeiouAEIOU] negated character class matches any character other than a, e, i, o, u, A, E, I, O and U, so a linefeed char, or a § will also be matched if they are preceded with a word character (a letter, digit or underscore in most cases) as the negated character class is preceded with a \b construct.

So, you need to use

re.findall(r'\b(?![aeiouAEIOU])\w+', s)

where (?![aeiouAEIOU]) negative lookahead will make sure the \w+ only matches one or more word chars where the first char is not equal to the letter inside the character class.

See the regex demo (note that you must select the right engine in the regex101 options).

Note you do not need any \b at the end after \w+, since the word boundary is implied at that position.