Search Paragraph and return entire first line until and including entire last line using string keyword for each line

348 Views Asked by At

I'm trying to find a way to isolate a specific paragraph using a string as a starting point, where the string could be a word in any part of the line (not necessarily the end or the beginning).

So it will grab that entire line where the string occurs, and then it will grab until the line where it finds the secondary string. I've checked various questions and I'm not finding quite what I want. Here's an example input paragraph with the desired output paragraph:

Input:

JUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
NOTJUNK ABC NOTJUNK
DEF GHI JKL
MNO PQR STW
UVW XYZ NOTJUNK
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT

Objective: I want to get every LINE from ABC (including the words before ABC anf after ABC in the same line) until XYZ (including the words before and after XYZ). ABC and XYZ will always only have one occurence in the paragraph - and ABC will always occur before XYZ. My paragraphs in questions are being obtained from emails, and I'm currently using PhpMimeMailParser to parse the email.

start string search term: ABC

end string search term: XYZ

Desired Output:

NOTJUNK ABC NOTJUNK
DEF GHI JKL
MNO PQR STW
UVW XYZ NOTJUNK
6

There are 6 best solutions below

2
Computable On BEST ANSWER

Glad I could help. Here is the regex which evidently does as you prescribed:

 /.*(^.*ABC.*XYZ.*?[\r\n]).*/sm

Here is a regex tester link: regex test

Supporting info:

Options

The multi line option m is required since the capture needs to start at the beginning of a line and not the start of the string.

The single line option s is required to ignore newlines with the dot.

Explanation

So with the options as context, the expression can be described:

Ignore all characters until a line is found with ABC anywhere within the line. Begin to capture all characters starting at the line which contains the first ABC. Continue the capture until XYZ is found in a line. Stop the capture at the first newline found on the line with the XYZ. Ignore the remaining characters in string. The lazy qualifier in .*? ensures the match stops at the first newline (following the XYZ). I removed the {1} from my original comment as it is unnecessary.

0
Piyush B On
startWord = "ABC"
endWord = "XYZ"
result = ""
foreach(word in para)
{
    if(word == startWord || result.Length > 0)
      result += word;
    if(word == endWord)
      break;
}
return result;

-- if there are multiple occurrences of the sequence then repeat above logic.

1
lukas.j On

$data = "
JUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
ABC
DEF GHI JKL
MNO PQR STW
UVW XYZ
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
";

$start = 'ABC';
$end = 'XYZ';

$startIndex = strpos($data, $start);
$endIndex = strpos($data, $end, $startIndex) + strlen($end);

$result = substr($data, $startIndex, $endIndex - $startIndex);

echo $result;

For case-insensitive search use stripos() instead of strpos().

1
lukas.j On
$data = "
JUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
ABC
DEF GHI JKL
MNO PQR STW
UVW XYZ
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
";

$result = preg_replace('/(.*)(ABC.*XYZ)(.*)/s', '\2', $data);

echo $result;

For case-insensitivity change add the regex modifier i after the pattern:

'/(.*)(ABC.*XYZ)(.*)/si'
0
fkrzski On

You can use this:

$data = "
JUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
ABC
DEF GHI JKL
MNO PQR STW
UVW XYZ
JUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
JUNKTEXTJUNKTEXT
";

echo substr($data, strpos($data, "ABC"), strpos($data, "XYZ")-62+strlen("XYZ"));

You need to get data from string from index A to index B.

strpos($data, "ABC") is index of "ABC" string.
strpos($data, "XYZ")-62+strlen("XYZ") is LENGHT of the string you want to take. To get this lenght you need from strpos($data, "XYZ") result minus first result and add lenght of second searched string. Why? Because strpos() return index of start of searched value. To take end you must add

0
StuyvesantBlue On

This is the solution I've come up with. It's not very elegant but it works and I cannot find another way:

preg_match('/[^\n]*ABC[^\n]*/', $text, $matches, PREG_OFFSET_CAPTURE);
$start = $matches[0][1];
$text = substr($text, $start);

// the above finds the beginning position of the line in which ABC
is located, and does a substr to remove every line before and above ABC

// the below find the beginning position of the line in which XYZ is
 located, and then performs strpos to determine the position at the
end of that line. Another substr is performed to remove everything
after that position

preg_match('/[^\n]*XYZ[^\n]*/', $text, $matches2, PREG_OFFSET_CAPTURE);
$end = $matches2[0][1];
$end = strpos($text, PHP_EOL, $end);
$text = substr($text, 0, $end);



echo $text