Python - How to search word with multiple occurrence and get table result in pdf

60 Views Asked by At

I have following pdf and want to search word 'Country', so will get country name and then 'Place to visit' list and convert to csv file.

This is my analysis when you visit a country, which places you must see in any season. Also get the number of hotels/motels in that places that served vegan food.

Country: USA

The following places of USA you must visit. There are number of hotels that provided vegan food.

Place to visit: California 28 hotels Vegas 9 hotels New York 35 hotels

Country: Canada

The following places of Canada you must visit. There are number of hotels that provided vegan food.

Place to visit: Toronto 22 hotels Vancouver 13 hotels Ottawa 8 hotels

Desire result:

USA California 28 hotels USA Vegas 9 hotels USA New York 35 hotels Canada Toronto 22 hotels Canada Vancouver 13 hotels Canada Ottawa 8 hotels

2

There are 2 best solutions below

0
Thermostatic On

First of all you should use pdf2image to convert each PDF page to an image. After that, use something like Tesseract OCR to convert the image to a string. Once you have that string just create a list of words that you'd like to count their occurrence and use a for loop to iterate through each of these words, using the count() string method to analyze each of the strings obtained from Tesseract and extracting the number of repetitions of each place.

0
Jorj McKie On

If you have a PDF with regular text (and not a PDF with scanned / image pages), then simply do this with PyMuPDF:

import fitz  # import PyMuPDF
doc = fitz.open("mypdf.pdf")
for page in doc:  # itrate over the PDF pages
    words = page.get_text("words", sort=True)  # list word items, sorted in reading sequ
    for i, word in enumerate(words):
        if word[4] == "Country:"  # word string is label "Country:" then ...
            # next word should be the country name
            country = words[i + 1][4]
            print(f"Now dealing with country {country}")
            # whatever else you want to do with country ...

The shown get_text variant returns a list of items, one item for each string that contains no spaces. Other item components are the position of the string on the page (a wrapping rectangle). So the word items look likethis: (x0, y0, x1, y1, "wordstring", ...).

If your PDF however does contain scanned pages, you must OCR it first, and then do the above. Also possible with PyMuPDF inside the same script, because it contains an integrated interface to Tesseract-OCR.