How to expand a .RDS text corpus using R and Readr package

54 Views Asked by At

i am attempting to expand a text corpus that was made available to me. The file itself is a .RDS file, and i need to expand it using the text from 20 different PDF documents, where 1 PDF file is its own document entry in the corpus itself.

All the packages that i am using in the project is:

  • Readr
  • Tidyverse
  • Tidytext
  • Quanteda
  • tm

This is the code for all the PDF's i am trying to convert to text and expand the corpus:

pdf_paths <- c("NGODocuments/1234567_EPIC_NGO.pdf",
           "NGODocuments/F2662175_Allied-Startups_NGO.pdf",
           "NGODocuments/F2662292_Civil-Liberties_NGO.pdf",
           "NGODocuments/F2662654_PGEU_NGO.pdf",
           "NGODocuments/F2663061_Not-for-profit-law_NGO.pdf",
           "NGODocuments/F2663127_Eurocities_NGO.pdf",
           "NGODocuments/F2663268_European-Disability_NGO.pdf",
           "NGODocuments/F2663380_Information-Accountability_NGO.pdf",
           "NGODocuments/F2665208_Hospital-Pharmacy_NGO.pdf",
           "NGODocuments/F2665222_European-Radiology_NGO.pdf",
           "BusinessDocs/123_DeepMind_Business.pdf",
           "BusinessDocs/1234_LinedIn_Business.pdf",
           "BusinessDocs/12345_AVAAZ_Business.pdf",
           "BusinessDocs/F2488672_SAZKA_Business.pdf",
           "BusinessDocs/F2662492_Google_Business.pdf",
           "BusinessDocs/F2662771_SICK_Business.pdf",
           "BusinessDocs/F2662846_sanofi_Business.pdf",
           "BusinessDocs/F2662935_EnBV_Business.pdf", 
           "BusinessDocs/F2662941_Siemens_Business.pdf",
           "BusinessDocs/F2662944_BlackBerry_Business.pdf")

This is the code that i do for trying to extract the text and then expand the corpus:

pdf_text <- lapply(pdf_paths, read_file)
corpus <- tm::Corpus(VectorSource(pdf_text))

prev_corpus <- readRDS("data_corpus_aiact.RDS")
new_corpus <- c(prev_corpus, corpus)
writeCorpus(new_corpus, filenames = pdf_paths)

However, when i run this code, i run in to an error from the new_corpus variable saying:

Error: as.corpus() only works on corpus objects.

I have searchhed all over the web trying to find a solution, but whatever i find, it does not seem to work. I did try once with a package called pdftools, but i got an error when translating the PDfs to text, saying that it had an illegal font weight on the document, which is why i switched to readr.

The goal is to have a new corpus generated, which includes the content from the old corpus, with the new content added to the corpus, and having it saved as a new .RDS file.

1

There are 1 best solutions below

1
Ken Benoit On BEST ANSWER

Here's how I would do it, with only quanteda and readtext.

library("quanteda")
#> Package version: 3.3.0.9001
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

prev_corpus <- readRDS("~/Downloads/pdf documents/data_corpus_aiact.rds")
pdfpath <- "~/Downloads/pdf documents/PDF documents/NGODocuments/*.pdf"

new_corpus <- readtext::readtext(pdfpath, 
                                 docvarsfrom = "filenames",
                                 docvarnames = c("id", "actor", "type_actor")) |>
    corpus()
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight

You have some oddities in the pdf files, but this is not uncommon. You should consider inspecting the texts to see if readtext::readtext() converted them correctly.

Now we can change the document names to match what was in your RDS file:

docnames(new_corpus) <- with(docvars(new_corpus),
                             paste0(actor, " (", type_actor, ")"))
print(new_corpus, 2)
#> Corpus consisting of 40 documents and 3 docvars.
#> EPIC (NGO) :
#> "          FEEDBACK OF THE ELECTRONIC PRIVACY INFORMATION CEN..."
#> 
#> Allied-Startups (NGO) :
#> "Feedback reference F2662175 Submitted on 13 July 2021 Submit..."
#> 
#> [ reached max_ndoc ... 38 more documents ]
head(docvars(new_corpus))
#>         id              actor type_actor
#> 1  1234567               EPIC        NGO
#> 2 F2662175    Allied-Startups        NGO
#> 3 F2662292    Civil-Liberties        NGO
#> 4 F2662654               PGEU        NGO
#> 5 F2663061 Not-for-profit-law        NGO
#> 6 F2663127         Eurocities        NGO

Some of those will collide with old docnames, and in quanteda, these should be unique. So:

# to avoid ducplicated docids
duplicated_index <- which(docnames(new_corpus) %in% docnames(prev_corpus))
docnames(new_corpus)[duplicated_index] <- 
    paste(docnames(new_corpus)[duplicated_index], "new")

Now we can simply combine them, and the + operator will automatically match up the docvar columns.


# combine the two
new_corpus <- prev_corpus + new_corpus
print(new_corpus, 0, 0)
#> Corpus consisting of 60 documents and 3 docvars.
head(docvars(new_corpus))
#>                                 actor type_actor   id
#> 1                          Access Now        NGO <NA>
#> 2                                 ACM        NGO <NA>
#> 3                      AlgorithmWatch        NGO <NA>
#> 4                               AVAAZ        NGO <NA>
#> 5                     Bits of Freedom        NGO <NA>
#> 6 Centre for Democracy and Technology        NGO <NA>
tail(docvars(new_corpus))
#>                  actor type_actor       id
#> 55           Impact-AI        NGO F2665589
#> 56         Croation-AI        NGO F2665590
#> 57               GLEIF        NGO F2665591
#> 58 Fraud-Corruption-AI        NGO F2665605
#> 59      Future-Society        NGO F2665611
#> 60   Climate-Change-AI        NGO F2665623

Created on 2023-05-15 with reprex v2.0.2