python event information extraction from string

55 Views Asked by At

i dont have a specific problem, or better said, i dont know how to correctly extract informations from my inputs that looks like those:

  • Roč. 4, č. 2-4, EUROSTEEL 2021, Sheffield – Steel‘s coming home (2021), s. 731-735 [online]
  • Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]
  • Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]
  • "Vol. 63, no. 3 (2006), p. 543-551"
  • "Vol. Volume 2019, no. Article ID 8148697(2019), p. 1-10"
  • P. 681-686
  • "Vol. 1, no. 2/2019(2019), p. 1-9"

As you can see, some of them are simple to extract informations and be sure if you extracted it correctly.

But im having a trouble with extraction of informations about event, as it is unique string, with different positions in the string.

Im trying to extract information one by one into json format, then send it in the end.

This is a function that i currently use for extraction of event information:

def remove_long_string(modified_input, saved_copy):
    def extract_longest_substring(string):
       pattern = r'[^.,()]{10,}'
       matches = re.findall(pattern, string)
       longest_substring = ''
       for match in matches:
          if len(match.split()) >= 3 and len(match) > len(longest_substring):
              longest_substring = match.strip()
       return longest_substring

# Detect long string without commas, dots, or parentheses in the modified input
pattern = r'[^.,()]{10,}'
match = re.search(pattern, modified_input)
if match:
    substring = extract_longest_substring(saved_copy)
    modified_input = re.sub(re.escape(match.group()), '', modified_input)
    return modified_input, substring

return modified_input, None

In this function i work with string without changes, but in other functions i work with modified string.

This is my main function, that calls others and work with row by row.

def parse_row(row):
# Remove whitespace from the input
#print(row)
pages_data = {}
backup = row
row, chapter = extract_chapter(row)
row, format = extract_format(row)
row = replace_roman_numerals(row)
row = row.replace(" ", "")
row = row.lower()
row = row.replace("s.", "p.")
row = row.replace("roč.", "vol.")
row = row.replace("č.", "no.")
row = row.replace(".]", "]")
row = row.replace(".(", "(")
row = row.replace(".,", ",")
row = row.replace("pp.", "p.")
row = row.replace(",(", "(")
row = row.replace("-s", "-")
row = row.replace("-p", "-")
row = row.replace("–", "-")
row = row.replace("až", "-")
row = row.replace("--", "-")
row = row.replace("÷", "-")

row = row.replace("vol.vol.", "vol.")
row = row.replace("p.articlenumber", "articlenumber")
row = row.replace("vol.n/a,","")
row = row.replace("no..","")
row = row.replace("no.no.","no.")

row = convert_month(row) 

row, Articel_number = extract_art_no(row)
row, volume = extract_volume(row)
row, issue = extract_issue(row)
row, month, year = extract_month_year(row)
row, year = extract_year(row)
row, Is_special_issue = remove_special_issue(row)
row = clear_row(row)
row, pages = extract_pages(row)

row = remove_non_alnum(row)
row = remove_commas(row)

row, longest_substring = remove_long_string(row, backup)

data = {
        "Volume": volume,
        "Issue": issue,
        "Month": month,
        "Year": year,
        "Is_special_issue": Is_special_issue,
        "Articel_number": Articel_number,
        "format": format,
        "Info": longest_substring,
        "pages": pages,
        "chapter": chapter
    }
json_data = json.dumps(data)
if(is_empty(row)):
    return json_data

If you have a idea how to solve my problem with event information extraction that would be great. Also, im open for some recomendations about changes and how to do this effectively.

0

There are 0 best solutions below