pcregrep or grep: searching with lookaheads not working

412 Views Asked by At

I am trying to search for a regex with lookahead its not working in pcregrep or grep

I want to search for bits of sections

  • which may span over multiple lines,
  • which start with PQXY at the beginning of a line and
  • end with OFEJ at the end of the line and
  • does not contain either PQXY or OFEJ in between

Generall i use the following in sublime text find and works well

(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)

Now i want to find the count of such occurences so i am trying to use grep or pcergrep, both are not working.

pcregrep -c "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)" file.txt
zsh: event not found: PQXY|OFEJ).)

and with grep

$ grep -c -zoP "(?s)(^PQXY(?:(?!PQXY|OFEJTRANS).)*OFEJTRANS\n)" CB_raw_testing_21_feb_CORRECTIONS_0002.txt
zsh: event not found: PQXY|OFEJTRANS).)

How can i do this

Answer based on @paxdiablo and @anubha.

The main error was the single quotes as addressed by @paxdiablo

$ pcregrep -c -M '(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt 
0

The regex solution is to add (?s) based on @anubha. Ofcourse \n also works instead of (\R|\z)

$ pcregrep -c -M '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt
11726
2

There are 2 best solutions below

5
paxdiablo On BEST ANSWER

zsh: event not found: PQXY|OFEJ).)

Since this is zsh raising the error, it's almost certainly because it's trying to process the stuff within the double quotes. To protect it from that, you should use single quotes, such as:

pcregrep -c '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt

I don't have pcregrep installed but here's a transcript showing the problem with just echo:

pax> echo "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ)"
zsh: event not found: PQXY|OFEJ).)

pax> echo '(?s)(^PQXY(?:(?OFEJ)'
(?s)(^PQXY(?:(?OFEJ)

In terms of solving the problem rather than using a specific tool, I would actually opt for awk(a) in this case. You can do something like:

awk '/^PQXY/     { s = $0; c = 1; next}
     /OFEJ$/     { if (c == 1) { print s""ORS""$0; c = 0 }; next }
     /OFEJ|PQXY/ { c = 0; next }
     c == 1      { s = s""ORS""$0 }' inputFile

This works by using a string and flag to control lines collected and state, initially they are an empty string and zero.

Then, for each line:

  • If it starts with PQXY, store the line and set the collection flag, then go to next input line.
  • Otherwise, if it ends with OFEJ and you're collecting, output the collected section and stop collecting, then go to next input line.
  • Otherwise, if it has either of the strings in it, stop collecting, move to next input line.
  • Otherwise, if collecting, append current line and move (implicitly) to next input line.

I've tested this with some limited test data and it seems to work okay. Here's the bash script(b) I used for testing, you can add as many test cases as you need to be comfortable it solves your problem.

for i in \
    "PQXY 1\nabc\n2 OFEJ\n" \
    "PQXY 1\nabc\n2 OFEJx\n" \
    "PQXY 1\nabc\n  PQXY \n2 OFEJ\n" \
    "PQXY 1\nabc\n  OFEJ \n2 OFEJ\n" \
    "PQXY 1\nabc\ndef\nPQXY 2\n2 OFEJ\n" \
; do
    echo "$i:"
    printf "$i" | awk '
        /^PQXY/     { s = $0; c = 1; next}
        /OFEJ$/     { if (c == 1) { print s""ORS""$0; c = 0 }; next }
        /OFEJ|PQXY/ { c = 0; next }
        c == 1      { s = s""ORS""$0 }' | sed 's/^/    /
    '
done

Here's the output so you can see it in action:

PQXY 1\nabc\n2 OFEJ\n:
    PQXY 1
    abc
    2 OFEJ
PQXY 1\nabc\n2 OFEJx\n:
PQXY 1\nabc\n  PQXY \n2 OFEJ\n:
PQXY 1\nabc\n  OFEJ \n2 OFEJ\n:
PQXY 1\nabc\ndef\nPQXY 2\n2 OFEJ\n:
    PQXY 2
    2 OFEJ

(a) In my experience, if you've tried three things with a grep-style regex without success, it's usually faster to move to a more advanced tool :-)


(b) Yes, I know it's written in bash rather than zsh but that's because:

  • it's a test program to show you that awk works, hence the language used is irrelevant; and
  • I'm far more comfortable with bash tahn zsh :-)
8
anubhava On

Using gnu grep:

grep -ozP '(?ms)^PQXY(?:(?!PQXY|OFEJ).)*OFEJ(\R|\z)' file
  • You must use -z option to treat input and output data as sequences of lines, each terminated by a zero byte.

  • Make sure to use single quotes for your pattern so that shell's history module doesn't attempt to process !.

  • Added (?m) (MULTILINE) modifier to allow use of ^ and $ in regex for each line
  • Used (\R|\z) to allow ending pattern to end without newline at the end of file. \R matches any ind of line break including unicode characters and \z matches end of input.

Working Demo


Equivalent solution in pcregrep

pcregrep -M '(?s)^PQXY(?:(?!PQXY|OFEJ).)*OFEJ(\R|\z)' file

-M enables multiline optio in pcregrep.