Regular expression to match up to the first space of a line not preceeded by a comma

51 Views Asked by At

I'm editing a large dictionary file and the term and definition pairs do not have a consistent format. Some words are "simple", some words include the base term plus some suffix to alter things like its gender, basically stacking two terms into one entry:

abacora (definition)
abacorar  (definition)
abad, desa (definition)

This last term means "abad" and "abadesa" (feminine variant).

I've been trying to write the regular expression to capture this "peculiarity" but I can't seem to make it work. This matches the first part of the term fine, but fails to capture the second part:

^[^\s(?<!,)]+

It should return:

"abacora"
"abacorar"
"abad, desa"
2

There are 2 best solutions below

3
Tim Biegeleisen On

I would use the following pattern, which should capture all leading words possibly including a CSV list:

^\w+(?:,\s*\w+)*

This pattern says to match:

  • ^ from the start of the line
  • \w+ match a word
  • (?:,\s*\w+)* optionally followed by a CSV list of other words

Demo

Edit:

More generally, we can match on [^,\s]+ for a non whitespace, non comma, character, and use this pattern:

^[^,\s]+(?:,\s*[^,\s]+)*

Demo

3
Nick On

Your regex is just a character class which will match anything other than whitespace or one of (, ?, <, !, , or ). What you need to do is match up to a space which is not preceded by a comma, which could do with this regex:

^(?:, |[^ ])+

This matches:

  • (?:, |[^ ])+ : one or more of either:
    • , : a comma followed by a space; or
    • [^ ] : a character which is not a space

Regex demo on regex101