Python-based way to extract text from scientific/academic paper for a language model

364 Views Asked by At

I am looking for a method to extract only the core text of a scientific paper. The paper is structured in paragraphs and I only want to cover the text without any mail-adress, websites, tables or pictures. My purpose is to create a clean txt file for a language model.

Which methods are available to filter data (i.e. font size, searching for keywords or inlcude Spacy etc.)?

Thank you in advance!

`from langchain.document_loaders import PyPDFLoader # for loading the pdf
import glob
import os

# Path to pdf folder
folder_path = "C:/Users/faenkaya/Desktop/Language Models/documents/Scientific Data eng"

# Path to output
output_file = "C:/Users/faenkaya/Desktop/Language Models/documents/Scientific Data eng/Full_text.txt"

# Write the txt file
with open(output_file, "w", encoding="utf-8") as file:
    # Loop for each pdf
    for file_path in glob.glob(os.path.join(folder_path, "*.pdf")):

        # Open PDF in read-mode
        with open(file_path, "rb") as pdf_file:
            loader = PyPDFLoader(file_path)
            pages = loader.load_and_split()
            text = ""
            for i in range(len(pages)):
                text += pages[i].page_content
                text += "\n"
            print(file_path)
            # Write text to txt file
            file.write(text)
            file.write("\n")`
0

There are 0 best solutions below