Im trying to scrape a couple PDFs in R, PDF1 has 9 pages and PDF2 has 12 pages. When I run the code below it scrapes both PDFs but only up to page 6 and nothing after that. Is there a reason for this? Something missing in my code?
library(tm)
read <- readPDF(engine = "xpdf", control = list(text = "-layout"))
document <- Corpus(URISource("C:\\Users\\Goku\\Documents\\Python Scripts\\PDF Scraping\\123.pdf"), readerControl = list(reader = read))
doc <- content(document[[1]])
head(doc)
You can find the pdf at: https://www.scribd.com/document/396797318/123
I can't replicate your issue. Using your document I get 12 pages reading the text in both ways. Checking if they are identical also yields true.
tm with reader pdftools:
using pdftools directly:
Check if they are identical: