I have newline-delimited logs that look like this:
Unimportant unimportant
Some THREAD-123 blah blah blah patternA blah blah blah
Unimportant unimportant
More THREAD-123 blah blah blah patternB blah blah blah
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive doctype tag
Unimportant unimportant
Outbound XML distinctive root opening-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive HEY-THIS-IS-MY-DATA tagset and innertext
Unimportant unimportant
Outbound XML distinctive root closing-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Yet more THREAD-123 blah blah blah patternC blah blah blah
Unimportant unimportant
Unimportant unimportant
Even more THREAD-123 blah blah blah patternD blah blah blah
Unimportant unimportant
Inbound XML distinctive snippet
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Just a bit more THREAD-123 blah blah blah patternE blah blah blah
Unimportant unimportant
Unimportant unimportant
And then THREAD-123 blah blah blah patternF blah blah blah
Unimportant unimportant
I've already come up with ^...$ regex patterns capable of recognizing every line you see here that isn't "Unimportant unimportant", with one caveat:
Sometimes, things that match one of these patterns will themselves be unimportant.
Like, there might be overlapping concurrent threads that both match this pattern.
So once I see a "Some THREAD-(\d+) blah blah blah patternA blah blah blah" I'll need to save off "(\d+)"'s value of "123" from "THREAD-(\d+)" into some sort of variable and use it as a literal in subsequent patternB-patternF (actually look for "THREAD-123").
Furthermore, I need to pass in a parameter to the whole thing where I've written "HEY-THIS-IS-MY-DATA."
In other words, I'm looking for "HEY-THIS-IS-MY-DATA" surrounded by a consistent "opening" and "closing" sequences of regexes in a log file.
Any tips on how I could approach this?
Extremely vanilla Python 3 (as delivered on 2021-era AWS EC2 RHLE instances), older (v5) PowerShell, or Linux shell flavors that come with standard 2021-era AWS EC2 RHLE instances would be my preferred programming languages, as I'll be passing this on for others to use as a unit test for validating whether certain behaviors against "HEY-THIS-IS-MY-DATA" in an interactive UI "show up correctly" in logs.
It's ugly, but it seems to work.
I realized that if I just keep whacking the beginning off the logs any time I find the first instance of a thing I'm looking for, and then keep looking for more of it, I should be all right.
First I throw away all lines of the log file that don't even match any of the 11 regexes. Meanwhile, I also cache the thread numbers involved in the matching regexes.
Then I loop through the remaining log lines. I start with a modified regex #0 (the first cached thread number in the place of
\d+), see if I find an instance of it, chop off everything before that, keep looking for modified regex #1 from there, repeat repeat repeat.Do that for as many variants on the regex-set as there are thread numbers in the cache.
Error out if I don't find all 11 regexes, in order, based on this find-and-chop method.
(Note: I just realized this code errors out prematurely if there's more than 1 thread number and the all-11 match isn't in the first thread number processed. I'll have to fix that. Should've tested against a bigger log. Oops.)