I have a structured dataset containing text segments with associated bounding boxes, and I need advice on the optimal data structure to store this information efficiently also to convert them into a pandas dataframe since I would want to perform operations on them.
Each class may have multiple values and each value may have multiple text segment with it's associated bounding box. Here's a representation:
word label x1_pixel y1_pixel x3_pixel y3_pixel
text_1 O 1 2 3 4
text_2 O 5 6 7 8
text_3 B-CLASS_NAME_1 9 10 11 12
text_4 I-CLASS_NAME_1 13 14 15 16
text_5 I-CLASS_NAME_1 17 18 19 20
text_6 O 21 22 23 24
text_7 O 25 26 27 28
text_8 B-CLASS_NAME_1 29 30 31 32
text_9 I-CLASS_NAME_1 33 34 35 36
text_10 I-CLASS_NAME_1 37 38 39 40
text_11 O 41 42 43 44
text_12 O 45 46 47 48
and for now I came up with this:
{
"CLASS_NAME_1": [
[
("text_3", "bounding_box"),
("text_4", "bounding_box"),
("text_5", "bounding_box"),
],
[
("text_8", "bounding_box"),
("text_9", "bounding_box"),
("text_10", "bounding_box"),
]
],
"CLASS_NAME_2" : [
...
],
...
}
My initial thought was to create a Class that would handle all of this for me, and I would just pass the class names and it would handle creating a new class or appending to the same class name like so:
class Key_Value_BB:
class_names = {}
prev_start = ""
def __init__(self, class_name):
self.class_name = class_name
if class_name in self.class_names:
self.append_to_existing(class_name)
else:
self.create_new(class_name)
def append_to_existing(self, class_name):
if(self.prev_start == 'I'):
# append to the existing value
pass
self.prev_start = class_name[0]
pass
def create_new(self, class_name):
pass
I would create only a single instance of the class and then append all the values to the class dictionary or a list, but I'm not able to figure out how to store into the same list when it belongs to the same iteration of 'I'
I would like advice on a more optimal data structure which can store these values properly and is also easily convertible to a pandas dataframe, also a way to handle value discrepancies such as one value has B-CLASS_NAME_1, I-CLASS_NAME_1 only 2 instances of the class instead of 3, how would we handle that optimally?
Ideally the goal is to replace text_3 with text_8, text_4 with text_9 and so on, along with the pixels.
Thanks!