Regex: substitute spaces after specific word

114 Views Asked by At

I’m trying (and failing) to write a regular expression (PCRE2) which will replace every space with a dash (-) after the first instance of a particular word (namely •VAN•, •VON• or •DE•) which itself must be surrounded by spaces.

For example:

HENRIETTA VON DER GRAAF
CAROLINE VAN OOSTEN DE WINKEL
MARC DE VRIES VAN JONG
ANNEKA VANHOVEN BAKKER
JOHN WILKINSON SMITH

would translate to:

HENRIETTA VON-DER-GRAAF
CAROLINE VAN-OOSTEN-DE-WINKEL
MARC DE-VRIES-VAN-JONG
ANNEKA VANHOVEN BAKKER (NB: Does not match VAN as not surrounded by spaces)
JOHN WILKINSON SMITH (NB: No substitution here as pattern not matched)

This is as far as I’ve got, but it’s not substituting all of the spaces following the match:

\b( VON| VAN| DE)+\s

https://regex101.com/r/s6BC1y/1

Any advice most appreciated!

4

There are 4 best solutions below

1
Richard On BEST ANSWER

You can do your transformation without regular expressions.

data have;
input text $CHAR50.;
datalines;
HENRIETTA VON DER GRAAF
CAROLINE VAN OOSTEN DE WINKEL
MARC DE VRIES VAN JONG
ANNEKA VANHOVEN BAKKER
JOHN WILKINSON SMITH
;

data want;
  set have;
  p = prxmatch('m/\b(VAN|VON|DE)( )/',text);
  if 0 < p < length(text) then 
    substr(text,p+1) = translate(substr(trim(text),p+1),'-',' ');
run;

enter image description here

0
Gilles Quénot On

Using Perl:

perl -anE '
    if (/\b(?:VON|VAN|DE)\b/) {
        @a = split /\s+/;
        say $a[0], " ", join "_", @a[1..$#a]
    } else {
        print;
    }
' file

HENRIETTA VON_DER_GRAAF
CAROLINE VAN_OOSTEN_DE_WINKEL
MARC DE_VRIES_VAN_JONG
ANNEKA VANHOVEN BAKKER
JOHN SMITH
0
InSync On

This can be done with \G and \K:

(?:                # Match either
  (?<!\S)          #                      but only if it is not preceded by a whitespace,
  (?:VON|VAN|DE)   # 'VON', 'VAN' or 'DE'
|                  # or
  \G(?!\A)         # the end of the last match
  \S+              # then a sequence of non-whitespace characters.
)                  # 
\K\x20             # Forfeit everything we just match, then match a space.

Try it on regex101.com.

Due to the lack of support for non-fixed-width lookbehinds in PCRE2, we can't do something like the following, which is arguably easier to understand:

(?<=               # Match a position preceded by
  (?:VON|VAN|DE)   # either of the three words
  (?:\x20\S+)*     # then 0 or more (space + word),
)                  # 
\x20               # and a space at that position.

Try it on regex101.com.

\G matches the position at the end of the last match or the start of the entire string. Thanks to (?!\A), the latter alternative will only match once we matched the first alternative: (?<!\S)(?:VON|VAN|DE).

A visual explanation:

MARC DE VRIES VAN JONG
     ^ Start matching `(?<!\S)(?:VON|VAN|DE)`
MARC DE VRIES VAN JONG
       ^ ...then `\x20`.
MARC DE VRIES VAN JONG
        ^ `(?<!\S)(?:VON|VAN|DE)` doesn't match here; switch to `\S+`
MARC DE VRIES VAN JONG
             ^ `\x20` is matched.
MARC DE VRIES VAN JONG
              ^ Back to step 1.
MARC DE VRIES VAN JONG
                  ^ Back to step 3.
0
Nick On

You can achieve the result you want using this regex:

^(.*? (?:VAN|VON|DE)) |((?<!^)\G\w+) 

This matches either:

  • ^(.*? (?:VAN|VON|DE)) : some minimal number of characters after beginning of line, followed by a space, and one of VON, VAN or DE, all captured in group 1, then a space; or
  • ((?<!^)\G\w+) : some number of word characters starting at the end of the last successful match (but not at the beginning of string, which \G normally allows), captured in group 2, then a space

You can then replace the matches with $1$2- (only one of $1 or $2 will have any content).

Regex demo on regex101

Note that the regex can be simplified using \K to discard the first parts of the match and only match the space after the words:

^.*? (?:VAN|VON|DE)\K |(?<!^)\G\w+\K 

Then the substitution is simply -.

Regex demo on regex101