I'm trying to extract a line of text from the first page of each multi-page PDF file in a list of PDFs. I'm trying to get the text into a dataframe so I can extract the author of each PDF, which is on the first page and the same word precedes the author in every single document.
I found the resource below by Packt Publishing that gets very close to what I'm trying to do, but when I implement the for loop (I just copied and pasted and plugged in my object names), R throws this error:
For loop:
text_df <- data.frame(matrix(ncol=2, nrow=0))
colnames(text_df) <- c("pdf title", "text")
for (i in 1:length(vector)){
print(i)
pdf_text(paste("folder/", vector[i],sep = "")) %>%
strsplit("\n")-> document_text
data.frame("pdf title" = gsub(x =vector[i],pattern = ".pdf", replacement = ""),
"text" = document_text, stringsAsFactors = FALSE) -> document
colnames(document) <- c("pdf title", "text")
text_df <- rbind(text_df,document)
}
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 50, 60, 11
Could someone help me understand what this error means? Could someone direct me to other resources that accomplish what I'm trying to do? Thank you in advance!
Resource: https://www.r-bloggers.com/2018/01/how-to-extract-data-from-a-pdf-file-with-r/
Here's an example based on a few pdfs from arXiv, those are also used in pdftools intro. "Keyword" here for finding the author is
\n\n, two line breaks between title and author:Search for a string preceding
arXivwould have worked too. When working withpdf_text()output, watch out for all the whitespace and line breaks in resulting strings.Created on 2023-06-04 with reprex v2.0.2
( I would not use that linked R-boggers post as a base )