Optimal Data Structure for Storing Structured Text Data with Bounding Boxes

34 Views Asked by At

I have a structured dataset containing text segments with associated bounding boxes, and I need advice on the optimal data structure to store this information efficiently also to convert them into a pandas dataframe since I would want to perform operations on them.

Each class may have multiple values and each value may have multiple text segment with it's associated bounding box. Here's a representation:

word    label           x1_pixel    y1_pixel    x3_pixel    y3_pixel
text_1  O                  1           2           3           4
text_2  O                  5           6           7           8
text_3  B-CLASS_NAME_1     9           10          11          12
text_4  I-CLASS_NAME_1     13          14          15          16
text_5  I-CLASS_NAME_1     17          18          19          20
text_6  O                  21          22          23          24
text_7  O                  25          26          27          28
text_8  B-CLASS_NAME_1     29          30          31          32
text_9  I-CLASS_NAME_1     33          34          35          36
text_10 I-CLASS_NAME_1     37          38          39          40
text_11 O                  41          42          43          44
text_12 O                  45          46          47          48

and for now I came up with this:

{
    "CLASS_NAME_1": [
        [
            ("text_3", "bounding_box"),
            ("text_4", "bounding_box"),
            ("text_5", "bounding_box"),
        ],
        [
            ("text_8", "bounding_box"),
            ("text_9", "bounding_box"),
            ("text_10", "bounding_box"),
        ]
    ],

    "CLASS_NAME_2" : [
        ...
    ],
    ...
}

My initial thought was to create a Class that would handle all of this for me, and I would just pass the class names and it would handle creating a new class or appending to the same class name like so:

class Key_Value_BB:
    class_names = {}
    prev_start = ""
    
    def __init__(self, class_name):
        self.class_name = class_name
        if class_name in self.class_names:
            self.append_to_existing(class_name)
        else:
            self.create_new(class_name)
    
    def append_to_existing(self, class_name):
        if(self.prev_start == 'I'):
            # append to the existing value
            pass
        self.prev_start = class_name[0]
        pass
    
    def create_new(self, class_name):
        pass

I would create only a single instance of the class and then append all the values to the class dictionary or a list, but I'm not able to figure out how to store into the same list when it belongs to the same iteration of 'I'

I would like advice on a more optimal data structure which can store these values properly and is also easily convertible to a pandas dataframe, also a way to handle value discrepancies such as one value has B-CLASS_NAME_1, I-CLASS_NAME_1 only 2 instances of the class instead of 3, how would we handle that optimally?

Ideally the goal is to replace text_3 with text_8, text_4 with text_9 and so on, along with the pixels.

Thanks!

0

There are 0 best solutions below