i am attempting to expand a text corpus that was made available to me. The file itself is a .RDS file, and i need to expand it using the text from 20 different PDF documents, where 1 PDF file is its own document entry in the corpus itself.
All the packages that i am using in the project is:
- Readr
- Tidyverse
- Tidytext
- Quanteda
- tm
This is the code for all the PDF's i am trying to convert to text and expand the corpus:
pdf_paths <- c("NGODocuments/1234567_EPIC_NGO.pdf",
"NGODocuments/F2662175_Allied-Startups_NGO.pdf",
"NGODocuments/F2662292_Civil-Liberties_NGO.pdf",
"NGODocuments/F2662654_PGEU_NGO.pdf",
"NGODocuments/F2663061_Not-for-profit-law_NGO.pdf",
"NGODocuments/F2663127_Eurocities_NGO.pdf",
"NGODocuments/F2663268_European-Disability_NGO.pdf",
"NGODocuments/F2663380_Information-Accountability_NGO.pdf",
"NGODocuments/F2665208_Hospital-Pharmacy_NGO.pdf",
"NGODocuments/F2665222_European-Radiology_NGO.pdf",
"BusinessDocs/123_DeepMind_Business.pdf",
"BusinessDocs/1234_LinedIn_Business.pdf",
"BusinessDocs/12345_AVAAZ_Business.pdf",
"BusinessDocs/F2488672_SAZKA_Business.pdf",
"BusinessDocs/F2662492_Google_Business.pdf",
"BusinessDocs/F2662771_SICK_Business.pdf",
"BusinessDocs/F2662846_sanofi_Business.pdf",
"BusinessDocs/F2662935_EnBV_Business.pdf",
"BusinessDocs/F2662941_Siemens_Business.pdf",
"BusinessDocs/F2662944_BlackBerry_Business.pdf")
This is the code that i do for trying to extract the text and then expand the corpus:
pdf_text <- lapply(pdf_paths, read_file)
corpus <- tm::Corpus(VectorSource(pdf_text))
prev_corpus <- readRDS("data_corpus_aiact.RDS")
new_corpus <- c(prev_corpus, corpus)
writeCorpus(new_corpus, filenames = pdf_paths)
However, when i run this code, i run in to an error from the new_corpus variable saying:
Error: as.corpus() only works on corpus objects.
I have searchhed all over the web trying to find a solution, but whatever i find, it does not seem to work. I did try once with a package called pdftools, but i got an error when translating the PDfs to text, saying that it had an illegal font weight on the document, which is why i switched to readr.
The goal is to have a new corpus generated, which includes the content from the old corpus, with the new content added to the corpus, and having it saved as a new .RDS file.
Here's how I would do it, with only quanteda and readtext.
You have some oddities in the pdf files, but this is not uncommon. You should consider inspecting the texts to see if
readtext::readtext()converted them correctly.Now we can change the document names to match what was in your RDS file:
Some of those will collide with old docnames, and in quanteda, these should be unique. So:
Now we can simply combine them, and the
+operator will automatically match up the docvar columns.Created on 2023-05-15 with reprex v2.0.2