I have following pdf and want to search word 'Country', so will get country name and then 'Place to visit' list and convert to csv file.
This is my analysis when you visit a country, which places you must see in any season. Also get the number of hotels/motels in that places that served vegan food.
Country: USA
The following places of USA you must visit. There are number of hotels that provided vegan food.
Place to visit: California 28 hotels Vegas 9 hotels New York 35 hotels
Country: Canada
The following places of Canada you must visit. There are number of hotels that provided vegan food.
Place to visit: Toronto 22 hotels Vancouver 13 hotels Ottawa 8 hotels
Desire result:
USA California 28 hotels USA Vegas 9 hotels USA New York 35 hotels Canada Toronto 22 hotels Canada Vancouver 13 hotels Canada Ottawa 8 hotels
First of all you should use pdf2image to convert each PDF page to an image. After that, use something like Tesseract OCR to convert the image to a string. Once you have that string just create a list of words that you'd like to count their occurrence and use a for loop to iterate through each of these words, using the count() string method to analyze each of the strings obtained from Tesseract and extracting the number of repetitions of each place.