I am working on a pdf summarisation app using langchain which will use open source llm Mistral-7B-Instruct-v0.2 for summarisation task. Since I am new to this I want help from this helping community regarding the issues I am facing. First of all, the pdf I am dealing with are large about 500 to 1000's of pages related to sec filings. What I want to do is
- Accurately extract text and metadata from PDFs
- And do some basic segmentations (eg. by headings, and sections).
I have taken some references and written a code that I am not sure will do the job or not. I am attaching the code blocks below please help me out whether the procedure I am following is correct or not.
def clean_up_text(self, content: str) -> str:
# Fix hyphenated words broken by newline
content = re.sub(r'(\w+)-\n(\w+)', r'\1\2', content)
# Remove specific unwanted patterns and characters
unwanted_patterns = [
"\\n", " —", "——————————", "—————————", "—————",
r'\\u[\dA-Fa-f]{4}', r'\uf075', r'\uf0b7'
]
for pattern in unwanted_patterns:
content = re.sub(pattern, "", content)
# Fix improperly spaced hyphenated words and normalize whitespace
content = re.sub(r'(\w)\s*-\s*(\w)', r'\1-\2', content)
content = re.sub(r'\s+', ' ', content)
print(len(content))
return content
def load_documents(self, pdf_file):
doc = fitz.open(pdf_file)
text_chunks = []
doc_idxs = []
for doc_idx, page in enumerate(doc):
page_text = page.get_text("text")
print(len(page_text))
text_parser = SentenceSplitter(
chunk_size=1024,
chunk_overlap=200,
)
cur_text_chunks = text_parser.split_text(page_text)
page_text = "\n".join(self.clean_up_text(content) for content in cur_text_chunks)
text_parser = SentenceSplitter(
chunk_size=1024,
chunk_overlap=200,
separator="\n"
)
new_text_chunks = text_parser.split_text(page_text)
text_chunks.extend(new_text_chunks)
doc_idxs.extend([doc_idx] * len(new_text_chunks))
print(len(page_text))
print(page_text)
return text_chunks, doc_idxs
I also don't think the chunks are of 1024 tokens since when I print the length of specific chunks they are larger than that.