Chunking of large pdfs for summarisation app using open source llm

23 Views Asked by At

I am working on a pdf summarisation app using langchain which will use open source llm Mistral-7B-Instruct-v0.2 for summarisation task. Since I am new to this I want help from this helping community regarding the issues I am facing. First of all, the pdf I am dealing with are large about 500 to 1000's of pages related to sec filings. What I want to do is

  1. Accurately extract text and metadata from PDFs
  2. And do some basic segmentations (eg. by headings, and sections).

I have taken some references and written a code that I am not sure will do the job or not. I am attaching the code blocks below please help me out whether the procedure I am following is correct or not.

    def clean_up_text(self, content: str) -> str:
        # Fix hyphenated words broken by newline
        content = re.sub(r'(\w+)-\n(\w+)', r'\1\2', content)

        # Remove specific unwanted patterns and characters
        unwanted_patterns = [
            "\\n", "  —", "——————————", "—————————", "—————",
            r'\\u[\dA-Fa-f]{4}', r'\uf075', r'\uf0b7'
        ]
        for pattern in unwanted_patterns:
            content = re.sub(pattern, "", content)

        # Fix improperly spaced hyphenated words and normalize whitespace
        content = re.sub(r'(\w)\s*-\s*(\w)', r'\1-\2', content)
        content = re.sub(r'\s+', ' ', content)
        print(len(content))
        return content

    def load_documents(self, pdf_file):
        doc = fitz.open(pdf_file)
        text_chunks = []
        doc_idxs = []
        for doc_idx, page in enumerate(doc):
            page_text = page.get_text("text")
            print(len(page_text))
            text_parser = SentenceSplitter(
                chunk_size=1024,
                chunk_overlap=200,
            )
            cur_text_chunks = text_parser.split_text(page_text)
            page_text = "\n".join(self.clean_up_text(content) for content in cur_text_chunks)
            text_parser = SentenceSplitter(
                chunk_size=1024,
                chunk_overlap=200,
                separator="\n"
            )
            new_text_chunks = text_parser.split_text(page_text)
            text_chunks.extend(new_text_chunks)
            doc_idxs.extend([doc_idx] * len(new_text_chunks))
            print(len(page_text))
            print(page_text)
        return text_chunks, doc_idxs

I also don't think the chunks are of 1024 tokens since when I print the length of specific chunks they are larger than that.

0

There are 0 best solutions below