How do I parse multi-line logs when I have some regex for individual lines?

Question

How do I parse multi-line logs when I have some regex for individual lines?

136 Views Asked by k.. At 14 December 2022 at 16:23

I have newline-delimited logs that look like this:

Unimportant unimportant
Some THREAD-123 blah blah blah patternA blah blah blah
Unimportant unimportant
More THREAD-123 blah blah blah patternB blah blah blah
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive doctype tag
Unimportant unimportant
Outbound XML distinctive root opening-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Outbound XML distinctive HEY-THIS-IS-MY-DATA tagset and innertext
Unimportant unimportant
Outbound XML distinctive root closing-tag
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Yet more THREAD-123 blah blah blah patternC blah blah blah
Unimportant unimportant
Unimportant unimportant
Even more THREAD-123 blah blah blah patternD blah blah blah
Unimportant unimportant
Inbound XML distinctive snippet
Unimportant unimportant
Unimportant unimportant
Unimportant unimportant
Just a bit more THREAD-123 blah blah blah patternE blah blah blah
Unimportant unimportant
Unimportant unimportant
And then THREAD-123 blah blah blah patternF blah blah blah
Unimportant unimportant

I've already come up with ^...$ regex patterns capable of recognizing every line you see here that isn't "Unimportant unimportant", with one caveat:

Sometimes, things that match one of these patterns will themselves be unimportant.

Like, there might be overlapping concurrent threads that both match this pattern.

So once I see a "Some THREAD-(\d+) blah blah blah patternA blah blah blah" I'll need to save off "(\d+)"'s value of "123" from "THREAD-(\d+)" into some sort of variable and use it as a literal in subsequent patternB-patternF (actually look for "THREAD-123").

Furthermore, I need to pass in a parameter to the whole thing where I've written "HEY-THIS-IS-MY-DATA."

In other words, I'm looking for "HEY-THIS-IS-MY-DATA" surrounded by a consistent "opening" and "closing" sequences of regexes in a log file.

Any tips on how I could approach this?

Extremely vanilla Python 3 (as delivered on 2021-era AWS EC2 RHLE instances), older (v5) PowerShell, or Linux shell flavors that come with standard 2021-era AWS EC2 RHLE instances would be my preferred programming languages, as I'll be passing this on for others to use as a unit test for validating whether certain behaviors against "HEY-THIS-IS-MY-DATA" in an interactive UI "show up correctly" in logs.

Original Q&A

There are 1 best solutions below

**k..** · Answer 1 · 2022-12-15T02:15:44.750000

It's ugly, but it seems to work.

I realized that if I just keep whacking the beginning off the logs any time I find the first instance of a thing I'm looking for, and then keep looking for more of it, I should be all right.

First I throw away all lines of the log file that don't even match any of the 11 regexes. Meanwhile, I also cache the thread numbers involved in the matching regexes.

Then I loop through the remaining log lines. I start with a modified regex #0 (the first cached thread number in the place of \d+), see if I find an instance of it, chop off everything before that, keep looking for modified regex #1 from there, repeat repeat repeat.

Do that for as many variants on the regex-set as there are thread numbers in the cache.

Error out if I don't find all 11 regexes, in order, based on this find-and-chop method.

(Note: I just realized this code errors out prematurely if there's more than 1 thread number and the all-11 match isn't in the first thread number processed. I'll have to fix that. Should've tested against a bigger log. Oops.)

from collections import OrderedDict
from itertools import islice
import re

hey_this_is_my_data = 'my_data'
filepath = 'c:\\example\\log.txt'

class LogDidNotMatchException(Exception):
    Exception
    pass

logstart = re.compile(r'^start_of_every_log_line (.*)$')

def get_od(thread_number_pattern):
    returnme = OrderedDict()
    returnme[0] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patterna$')
    returnme[1] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patternb$')
    returnme[2] = re.compile(r'^start_of_every_log_line <!DOCTYPE root_type SYSTEM "[\.\w]+">$')
    returnme[3] = re.compile(r'^start_of_every_log_line <root_type>$')
    returnme[4] = re.compile(r'^start_of_every_log_line <DataId>' + hey_this_is_my_data + r'<\/DataId>$')
    returnme[5] = re.compile(r'^start_of_every_log_line </root_type>$')
    returnme[6] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patternc$')
    returnme[7] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patternd$')
    returnme[8] = re.compile(r'^start_of_every_log_line <response><Reply><Result status="success" \/><\/Reply><\/response>$')
    returnme[9] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} blahteeblah INFO      \[Thread-(' + thread_number_pattern + r')\] - patterne$')
    returnme[10] = re.compile(r'^start_of_every_log_line \[jibberjabber\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3} dumdeedum INFO      \[Thread-(' + thread_number_pattern + r')\] - patternf$')
    return returnme

def filter_lines(enumerated_lines, regex_od):
    returnme_kept_lines = OrderedDict()
    returnme_thread_numbers = []
    for line_number, line in enumerated_lines.items():
        not_yet_kept = True
        for rgx_num, rgx in regex_od.items():
            if not_yet_kept and rgx.search(line):
                not_yet_kept = False
                returnme_kept_lines[line_number] = line
                if rgx_num == 0:
                    returnme_thread_numbers.append(rgx.match(line).group(1))
    return returnme_kept_lines, returnme_thread_numbers

with open(filepath, 'r') as f:
    lines = f.readlines()

first_od = get_od(r'\d+')
kept_lines, thread_numbers = filter_lines(OrderedDict(enumerate(lines)), first_od)

def find_first_regex_occurrence_in_linesod(the_lines_od, the_regex):
    line_number_found_regex_on = -1
    if the_lines_od is None or len(the_lines_od) == 0 or the_regex is None:
        return line_number_found_regex_on
    for i, (line_num, line) in enumerate(the_lines_od.items()):
        if line_number_found_regex_on == -1:
            #print('loopline', i, 'fileline:', line_num, 'lineslen:', len(the_lines_od))
            if the_regex.search(line):
                line_number_found_regex_on = i
                #print(f'found on {i}')
    return line_number_found_regex_on

def recursively_process_subset(collector, lines_od, rgx_od, curr_rgx_key):
    #print('\n', 'recursiongo', 'lineslen:', len(lines_od), 'currregexno:', curr_rgx_key)
    if curr_rgx_key > len(rgx_od):
        return collector # Recursion base condition
    if len(lines_od) == 0:
        if curr_rgx_key < len(rgx_od):
            raise LogDidNotMatchException(f'Never got through regex key {curr_rgx_key}')
        else:
            return collector # Recursion base condition
    line_number_found_currod_on = -1
    line_number_found_currod_on = find_first_regex_occurrence_in_linesod(lines_od, rgx_od[curr_rgx_key])
    if line_number_found_currod_on == -1:
        raise LogDidNotMatchException(f'Short-circuited trying to find regex key {curr_rgx_key}')
    #print(f'recursion found for regex key {curr_rgx_key} on line {line_number_found_currod_on} of {len(lines_od)}-line logsubset')
    if (curr_rgx_key + 1) < len(rgx_od):
        currodfound_linesod_new_param = OrderedDict(islice(lines_od.items(), line_number_found_currod_on+1, len(lines_od)))
        recursively_process_subset(collector, currodfound_linesod_new_param, rgx_od, curr_rgx_key + 1)

try:
    for thread_number in thread_numbers:
        thread_number_based_od = get_od(str(thread_number))
        thread_number_kept_lines, thread_number_thread_numbers = filter_lines(kept_lines, thread_number_based_od)
        x = recursively_process_subset([], thread_number_kept_lines, thread_number_based_od, 0)
        #print('final', x, 'lenlines:', len(thread_number_kept_lines))
    print(f'Success:  All {len(first_od)} expected patterns were found in the log, in order, for ID {hey_this_is_my_data}.')
except LogDidNotMatchException:
    print(f'Failure:  Not all {len(first_od)} expected patterns were found as expected in the log for ID {hey_this_is_my_data}.  Below are lines that seemed close but were not quite enough:\n')
    {print(f'log line #{line_number}:  {line}') for line_number, line in kept_lines.items()}
    print(f'Failure:  Not all {len(first_od)} expected patterns were found as expected in the log for ID {hey_this_is_my_data}.  Above are lines that seemed close but were not quite enough.')

How do I parse multi-line logs when I have some regex for individual lines?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in POWERSHELL

Related Questions in SH

Related Questions in LOGPARSER

Trending Questions

Popular # Hahtags

Popular Questions