Load only the names of many pdfs and make data frame

76 Views Asked by At

Im need obtain the names of set a many pdf files (36000 files). But only the names not load all object. Finally make a data frame like this:

enter image description here

The link of 21 example files: https://drive.google.com/drive/folders/1zUKyVJFICq4Q69zs48wqFNq1UPDvCgbf?usp=sharing

Im use this code:

#set directory 
library(pdftools)
library(tm)

files=list.files(pattern = "pdf$")
files

all=lapply(files, pdf_text)
lapply(all, length) 
x=Corpus(URISource(files), readerControl = list(reader = readPDF))
x

class(x) #character

DAT_FINAL <- data.frame(text = sapply(x, as.character), stringsAsFactors = T)
DAT_FINAL

The idea is has a data frame because I need compare the numeric names with an excel file for find the missing numbers between documents.

Update:

enter image description here

1

There are 1 best solutions below

5
PaulS On BEST ANSWER

A possible solution (instead of /tmp/PDFS/, use the path to the directory where your PDF are placed):

library(tidyverse)

data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
  mutate(number = str_extract(pdfs, "^\\d+"), .before = pdfs)

#>    number   pdfs
#> 1       1  1.pdf
#> 2      10 10.pdf
#> 3      12 12.pdf
#> 4      13 13.pdf
#> 5      14 14.pdf
#> 6      15 15.pdf
#> 7      16 16.pdf
#> 8      17 17.pdf
#> 9      18 18.pdf
#> 10     19 19.pdf
#> 11      2  2.pdf
#> 12     20 20.pdf
#> 13     21 21.pdf
#> 14     22 22.pdf
#> 15     23 23.pdf
#> 16      3  3.pdf
#> 17      4  4.pdf
#> 18      5  5.pdf
#> 19      6  6.pdf
#> 20      8  8.pdf
#> 21      9  9.pdf

Or using tidyr::extract:

data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
  extract(pdfs, into = "number", "(\\d+)\\.pdf", remove = F, convert = T) %>% 
  select(number, pdfs)

EDIT

To answer a further question of the OP (see comments below):

library(tidyverse)

data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
  mutate(number = str_extract(pdfs, ".*(?=\\.pdf)"), .before = pdfs)

#>    number      pdfs
#> 1       1     1.pdf
#> 2      10    10.pdf
#> 3     10A   10A.pdf
#> 4      12    12.pdf
#> 5      13    13.pdf
#> 6      14    14.pdf
#> 7      15    15.pdf
#> 8      16    16.pdf
#> 9      17    17.pdf
#> 10    17A   17A.pdf
#> 11     18    18.pdf
#> 12     19    19.pdf
#> 13      2     2.pdf
#> 14     20    20.pdf
#> 15     21    21.pdf
#> 16  21ABV 21ABV.pdf
#> 17     22    22.pdf
#> 18     23    23.pdf
#> 19      3     3.pdf
#> 20      4     4.pdf
#> 21      5     5.pdf
#> 22      6     6.pdf
#> 23      8     8.pdf
#> 24      9     9.pdf