Combining PDFS of slightly different formats using Power Query

42 Views Asked by At

I have these PDFs files and I want to extract the values in red and arrange them in a tabular form as shown below. I have attached the 3 sample PDFs in the link below. My problem is that sometimes the format of the PDFs add spaces between the tables and the DATE needs to be extracted and put in a column. The DATE is currently not in a table.

Is there a way I can combine the 3 sample PDFs in a single file using PowerQuery? Thanks for the anticipated help.

https://gofile.io/d/iNtyNc

enter image description here

1

There are 1 best solutions below

0
davidebacci On BEST ANSWER

enter image description here

let
    Source = Folder.Files("C:\Users\Dav\Downloads\test"),
    #"Filtered Rows" = Table.SelectRows(Source, each [Extension] = ".pdf"),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Custom", each Pdf.Tables([Content], [Implementation="1.3"])),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each [Custom]{[Id="Page001"]}[Data]),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom1",{"Custom.1"}),
    #"Expanded Custom.1" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom.1", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8"}),
    #"Added Custom2" = Table.AddColumn(#"Expanded Custom.1", "Custom", each let a = Record.ToList(_),
b = if [Column1] = null then List.Transform(a, each try DateTime.FromText(_) otherwise null) else {},
c = List.First( List.RemoveNulls(b))
in c),
    #"Filled Down" = Table.FillDown(#"Added Custom2",{"Custom"}),
    #"Filtered Rows1" = Table.SelectRows(#"Filled Down", each [Column4] <> null),
    #"Filtered Rows2" = Table.SelectRows(#"Filtered Rows1", each [Column1] <> "Data"),
    #"Filtered Rows3" = Table.SelectRows(#"Filtered Rows2", each [Column1] <> "Data2"),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows3",{"Column7", "Column8"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Column1", "Data"}, {"Column2", "Test"}, {"Column3", "Test_YTD"}, {"Column4", "Test2"}, {"Column5", "Pai"}, {"Column6", "Temp"}, {"Custom", "Date"}})
in
    #"Renamed Columns"