RegEx code to find duplicate (parts of) sentences anywhere in LibreOffice Writer?

75 Views Asked by At

I have created a RegEx code to find the following:
- strings, or parts of strings (at least 5 consecutive words), that appear at least twice in the whole scope of text
- the whole scope of text is in tables.
/\b([\w]{1,}[\s]{1,}[\w]{1,}[\s]{1,}[\w]{1,}[\s]{1,}[\w]{1,}[\s]{1,}[\w]+)(?=.*\b\1{1,})/gm

I only used this part of the RegEx, since LibreOffice does not want to recognize the whole RegEx above:
\b([\w]{1,}[\s]{1,}[\w]{1,}[\s]{1,}[\w]{1,}[\s]{1,}[\w]{1,}[\s]{1,}[\w]+)(?=.*\b\1{1,})

The problem:
- the RegEx ONLY finds parts of a text that appears IN THE SAME segment, but not intersegmentally. The whole text is in scope.
The red underlined text (the one I underlined) in the right segment should also be found, but it was not. IOW: I want to mark duplicates even if they appear somewhere else in the document/another cell.
enter image description here

I have tried ChatGPT in OpenAI, but to no avail.
Please, help. I also use MS Word, so wildcards are also ok.

1

There are 1 best solutions below

2
Jim K On

According to https://help.libreoffice.org/latest/en-US/text/swriter/guide/search_regexp.html:

A search using a regular expression will work only within one paragraph.

But with plain text, there's no need to limit yourself to LibreOffice. For example, there are text editors such as Vim, command line tools such as grep, or programming languages such as Perl (or modern languages such as Python that use the same concept with a bit more code required).

For a solution that doesn't require anything in particular on your system, use the following web site (the example is included in the link): https://regex101.com/r/pF3EN3/1

In that example, I used the following regex:

/\b((?:[\w]{1,}[\s]{1,}){4}[\w]+)(?=.*\b\1{1,})/s

The important part is the /s flag at the end, meaning that the input will be treated as a single line so that . matches line breaks.