Remove sequential duplicates using regex (pipe delimited)

151 Views Asked by At

I have a pipe delimited list of phrases. I would like to remove sequential duplicates using a regex replace/substitution. For example:

dog|cat|cat woman|cat woman|dog|dog 
cat|cat|catman|catman|catman|cat woman|cat woman|dog|dogman|doggy

would be transformed into

dog|cat|cat woman|dog 
cat|catman|cat woman|dog|dogman|doggy

I am stuck. So far, I am at ((^|\|)([^\|]+))\1+ with a substitution of $1. But clearly, that does not work, for the output is

dog|cat woman|cat woman|dog 
cat|catman|catman|cat woman|dogman|doggy

Thanks for your help

1

There are 1 best solutions below

10
The fourth bird On

You can set boundaries on the left and the right to prevent partial matches when using the capture group and the backreference.

If a lookbehind assertion is supported:

(?<![^|\n])([^|\n]+)(?:\|\1)+(?![^|\n])

The pattern matches:

  • (?<![^|\n]) Negative lookbehind, assert that what is directly to the left is not any char except | or a newline
  • ([^|\n]+) Capture group 1, match 1 or more times any char except | or a newline to prevent crossing lines
  • (?:\|\1)+ Repeat 1 or more times matching | and the backreference to group 1
  • (?![^|\n]) Negative lookahead that asserts that what is directly to the right is not any char except | or a newline

Regex demo

In the replacement you can use capture group 1.

Output

dog|cat|cat woman|dog
cat|catman|cat woman|dog|dogman|doggy

With thanks to Casimir et Hippolyte for the great improvement.