Can I convert a streamlit uploaded pdf file into a langchain document?

252 Views Asked by At

I am building a streamlit app where a user can upload a pdf file and it generates questions based on the file. The error is the uploaded file is of uploaded_file object and the langchain, pdf loader accepts the file path as an input stored locally to further load the pdf and split it into small chunks of documents using text splitter. Is there a way I can convert it into a langchain doc before processing or any other method suitable to do it?

1

There are 1 best solutions below

1
InsertCheesyLine On

Let us say you a streamlit app with st.file_uploader

import streamlit as st

uploaded_file = st.file_uploader("Upload file")

Once a file is uploaded uploaded_file contains the file data. You cannot directly pass this to PyPDFLoader as it is a BytesIO object.

We need to save this file locally

with open(uploaded_file.name, mode='wb') as w:
        w.write(uploaded_file.getvalue())

and then, pass its file path to the loader

from langchain_community.document_loaders import PyPDFLoader

if uploaded_file: # check if path is not None
    loader = PyPDFLoader(uploaded_file.name)
    pages = loader.load_and_split()
    print(pages[0])

pages should now be a list of langchain Documents