Use Perl/Regex to parse 'structured' dump

190 Views Asked by At

I was working with a proprietary software, and had to deal with very long/tedious analysis among the components of that software.

To try to improve my productivity in this task, I used the software generated Report and as trying to parse it, my bet was that Perl was perfectly suited to it.

The reports looks like (after removing page numbers):

Category one: NAME1    - Some free form text description goes here
    Used by Cat2 Resources:
        CAT2_NAME - Anoter free form text description here (there are lots of them, but they are pretty much useless, since no one cares about them, probably could not be that long.)
        CAT2_NAME2 - And so on.
        CAT2_NAME4 - U guessed it!
    Uses Resource Cat4:
        CAT4_NAMED   - A meaningless description that where copied from an unrelated resource (Save as...)


Category one: NAME7    - Description
    Used by Cat2 Resources:
        CAT2_NAME - Text
        CAT2_NAME5 - And so on.

        CAT2_NAME4 - U guessed it!
    Uses Resource Cat4:
        CAT4_NAME_  - Some names don't make any sense.

Category TWO: NAME7    - Description of another Category
    Used by Cat3 Resources:
        CAT3_NAME - Text
        CAT3_NAME5 - And so on.
        CAT3_NAME4 - U guessed it!
    Uses Resource Cat4:
        CAT4_NAME_  - Some names don't make any sense.

To be completely clear

  • All element names where named in CAPITAL LETTERS, numbers and underscore ("_")
  • Almost all elements relate with each others
  • There are orphan elements
  • None of the names had their categories included (I put those on my example to make them more readable)
  • The spaces at the start of each line under a paragraph/sub-paragraph where one or two TABs
  • There are some random empty lines here and there, I plan to clean this file a little better, but I am in a hurry right now.

I would like to be able to generate something like:

Out_CAT1_CAT2.csv
NAME1,CAT2_NAME
NAME1,CAT2_NAME2
NAME1,CAT2_NAME4
NAME7,CAT2_NAME
NAME7,CAT2_NAME5
NAME7,CAT2_NAME4
Out_CAT1_CAT4.csv
NAME1,CAT4_NAMED
NAME7,CAT2_NAME4
NAME7,CAT4_NAME_
Out_CAT2_CAT3.scv
NAME7,CAT3_NAME
NAME7,CAT3_NAME5
NAME7,CAT3_NAME4
Out_CAT2_CAT4.scv
NAME7,CAT4_NAME_

To parse this file, I tried (and failed) one first approach that consisted of grabbing a complete paragraph (That one that started as Category/again, there is no 'Category' on that label, only the category name, like database/processing model, etc)

Approach 1

I tried to use multi line regex like /(<?=^Category one :)[A-Z0-9_]+.*+$(^\s.*$)+/m intending to capture a complete paragraph to an array (or yet best an array to each level one Category) but have tried a lot of combinations at https://regex101.com/ without any lucky.

My aim was to create an array of such Cat1 paragraphs, that I would in turn parse with a subroutine. But I failed (I would appreciate some advice to it in comments please).

Them I turned over to a completely different approach, I wrote something along os the lines of

Approach 2

while(<>){
    if(/^Category one: /){
        $mode = CAT1PARSING;
        # Used regex to grab the name as it came in this same line after the colon.
        $cat1Name = /regex/;
    }
    elsif(/^Category TWO: /){
        $mode = CAT2PARSING;
    }
    ...

    if($mode == CAT1PARSING){
        # Used some regex to capture the name and description as elements of an array
        push @cat1Array, ($el1, $el2) = $_ =~ (/regex/);
    }
    ...
}

# Here I do some formatting to dump the same info to a lot of CSV files one to each category/sub-category pair, with the appropriated headers

I do some formatting to dump the same info to a lot of CSV files one to each category/sub-category pair, with the appropriated headers

My real program was made using approach 2, sets two control variables $mode and $subMode (I am luck that there are only two such levels), but I am unsatisfied with it.

I am not sure if it is what is called a 'state machine', anyone can confirm?

So of course I am not asking one question but a lot, nevertheless my main questions are:

There are any ways in which I could implement this with regex? As stated in the approach one? How?

1

There are 1 best solutions below

0
Sobrique On

Rule one in 'parsing ad-hoc formats' is: Find the field separator.

In this case - You've got "Category" at start of line. If you had a proper 'blank line' then you could use paragraph mode, but you don't, so:

local $/ = "\nCategory"; 

To match it, with no whitepace.

Rule two - look for the structure in your records. It looks like you have 'Uses' and 'Used' segments, and within those - key-value pairs, indented with - separators. Do you need to maintain ordering of those? Because if not, then a hash would make sense, and if so, then you need an array.

And then third question - what does the output need to look like? You mention CSV, but ... if your data is hierarchical (like it is) then something like JSON might do it better.

Anyway, parsing would look something like:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use JSON;

my @records;

local $/ = "\nCategory";
while (<>) {

   print "Record $. looks like: \n\n";
   print;
   print "\n\n----\n\n";


   my ( $used_cat, $used_text, $uses_cat, $uses_text ) =
     m/^\s{4}Used by (\w+) Resources:(.*)Uses Resource (\w+):(.*)/gms;


   print "::\n";
   print $used_text;
   print "\n:::\n";

   my $this_record;
   $this_record->{used_cat} = $used_cat;
   $this_record->{uses_cat} = $uses_cat;

   for ( split /\n/, $used_text ) {

      if ( my ( $key, $value ) = m/^\s+(\w+)\s*-\s*(.*)$/ ) {
         print "$key => $value\n";
         $this_record->{used}{$key} = $value;
      }
   }

   for ( split /\n/, $uses_text ) {

      if ( my ( $key, $value ) = m/^\s+(\w+)\s*-\s*(.*)$/ ) {
         print "$key => $value\n";
         $this_record->{uses}{$key} = $value;
      }
   }

   push @records, $this_record;
}

print Dumper \@records;
print "JSON Output:\n";
print to_json ( \@records, { pretty => 1 } );