Parsing data from pdftotext .txt file output using VBA

138 Views Asked by At

I am trying to implement a parsing function that will grab data from parts of a .txt file created using pdftotext. I hate PDFs! Essentially, I use pdftotext on a PDF file using the -raw option and I get a file like this:

SPORTS FANZ Order #62659
June 24, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Virginia Tech Hokies Basketball Cassell Coliseum Panoramic
Picture
Virginia Tech Hokies Panoramic Picture Select
VAT5M
1 of 1
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States

Example with phone number and quantity of 2:

SPORTS FANZ Order #12345
June 24, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
(123) 123-4567
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Virginia Tech Hokies Basketball Cassell Coliseum Panoramic
Picture
Virginia Tech Hokies Panoramic Picture Select
VAT5M
2 of 2
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States

Example with phone number (different format) and two SKUs:

SPORTS FANZ Order #58083
January 6, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
+12345678900
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Nebraska Cornhuskers Women's Volleyball Devaney Center Panoramic Picture
Nebraska Cornhuskers Panoramic Picture Select Frame
UNE11M
1 of 1
Kansas City Chiefs Super Bowl 54 Champions Panoramic Picture
Kansas City Chiefs SB 54 Champions Panoramic Picture Unframed
NFLSBC20CHF
1 of 1
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States

I've implemented the following code already to grab some of the data:

Function for grabbing text between two strings

Public Function SuperMid(ByVal strMain As String, str1 As String, str2 As String, Optional reverse As Boolean) As String
'DESCRIPTION: Extract the portion of a string between the two substrings defined in str1 and str2.
'DEVELOPER: Ryan Wells (wellsr.com)
'HOW TO USE: - Pass the argument your main string and the 2 strings you want to find in the main string.
' - This function will extract the values between the end of your first string and the beginning
' of your next string.
' - If the optional boolean "reverse" is true, an InStrRev search will occur to find the last
' instance of the substrings in your main string.
Dim i As Integer, j As Integer, temp As Variant
On Error GoTo errhandler:
If reverse = True Then
    i = InStrRev(strMain, str1)
    j = InStrRev(strMain, str2)
    If Abs(j - i) < Len(str1) Then j = InStrRev(strMain, str2, i)
    If i = j Then 'try to search 2nd half of string for unique match
        j = InStrRev(strMain, str2, i - 1)
    End If
Else
    i = InStr(1, strMain, str1)
    j = InStr(1, strMain, str2)
    If Abs(j - i) < Len(str1) Then j = InStr(i + Len(str1), strMain, str2)
    If i = j Then 'try to search 2nd half of string for unique match
        j = InStr(i + 1, strMain, str2)
    End If
End If
If i = 0 And j = 0 Then GoTo errhandler:
If j = 0 Then j = Len(strMain) + Len(str2) 'just to make it arbitrarily large
If i = 0 Then i = Len(strMain) + Len(str1) 'just to make it arbitrarily large
If i > j And j <> 0 Then 'swap order
    temp = j
    j = i
    i = temp
    temp = str2
    str2 = str1
    str1 = temp
End If
i = i + Len(str1)
SuperMid = Mid(strMain, i, j - i)
Exit Function
errhandler:
MsgBox "Error extracting strings. Check your input" & vbNewLine & vbNewLine & "Aborting", , "Strings not found"
End
End Function

Extraction Sub

Sub extractPDF()
    
    Dim phoneNumber, shippingInfo, shippingAddress, itemInfo, poNumber As String
    Dim iTxtFile As Integer
    Dim strFile As String
    Dim strFileText As String
    strFile = "C:\blah\blah\blah\#62875.txt"
    iTxtFile = FreeFile
    Open strFile For Input As FreeFile
    strFileText = Input(LOF(iTxtFile), iTxtFile)
    Close iTxtFile
    
    Dim regexPattern As String
    Dim regex As Object
    Dim matches As Object
    Dim match As Object
    
    ' Regular expression pattern
    regexPattern = "Order #\d{5}"
    
    ' Create a regular expression object
    Set regex = CreateObject("VBScript.RegExp")
    
    ' Set the pattern and ignore case
    With regex
        .Pattern = regexPattern
        .IgnoreCase = True
    End With
    
    ' Perform the search
    Set matches = regex.Execute(strFileText)
    
    ' Loop through the matches
    For Each match In matches
        ' Print the matched value
        poNumber = Right(match, 5)
    Next match
    
    shippingInfo = SuperMid(strFileText, "SHIP TO", "BILL TO")
    shippingAddress = SuperMid(shippingInfo, "", "United States")
    phoneNumber = Application.WorksheetFunction.Clean(SuperMid(shippingInfo, "United States", "BILL TO"))
    itemInfo = SuperMid(strFileText, "ITEMS QUANTITY", "Thank you for shopping with us!")
    Debug.Print "PO #: " & poNumber
    Debug.Print "Phone Number: " & phoneNumber
    Debug.Print shippingAddress
    Debug.Print itemInfo

End Sub

This gets me the shipping info, which I further break down into shipping address and phone number (if applicable), PO #, and the block of text containing the item information. What I'm struggling with is how to extract SKU and quantity data from the itemInfo block. Based on previous PDFs, the SKU line is always followed by the quantity line. So, in this example, SKU is VAR5M and quantity is 1 (if it was 2 it would say 2 of 2). Any ideas on the best way to implement what I need? Is there a better way to implement my needs than what I've already designed? Thanks for your help.

2

There are 2 best solutions below

3
taller On BEST ANSWER

If you have a text string stored in cell A1 and you would get SKU and Quantity with following code.

Sub Demo()
    Dim objRegExp As Object
    Dim objMatches As Object
    Set objRegExp = CreateObject("vbscript.regexp")
    With objRegExp
        .IgnoreCase = True
        .Global = True
        .Pattern = "([A-Z0-9]+)\s*(\d+) of \d+"
        If .Test([a1]) Then
            Set objMatches = objRegExp.Execute([a1])
            For Each objMtch In objMatches
                With objMtch.submatches
                    If .Count = 2 Then
                        SKU = .Item(0)
                        QTY = .Item(1)
                        Debug.Print "SKU:" & SKU & vbNewLine _
                            & "Quantity:" & QTY
                    End If
                End With
            Next
        End If
    End With
    Set objMatches = Nothing
    Set objRegExp = Nothing
End Sub
7
FaneDuru On

Please, try the next function. It uses arrays and should be fast enough:

Function ExtractDat(arrTxt) As Variant
    Dim arrFin, mtch, arrH, arr, i As Long, k As Long
    '                                                           0                       1                        2                3             4       5       6          7             8                9                      10        11      12     13        14         15              16             17             18
    Const header As String = "Order number, (Ship To) Name, Address1, Address2, City, State, Zip, Country, Phone, (Bill To) Name, Address, City, State, Zip, Country, SKU1, Value SKU1, SKU2, Value SKU2"
    arrH = Split(header, ",")
    
    ReDim arrFin(UBound(arrH))
    
    With CreateObject("Vbscript.RegExp")
        .Pattern = "\d{5}"
        .Global = False
        arrFin(0) = .Execute(arrTxt(0))(0) 'order number
    End With
    
    arrFin(1) = arrTxt(3)             'Send To Name
    arrFin(2) = arrTxt(4)             'Send To Address
    arrFin(3) = ""                    'No second Address (assumption...)
    
    mtch = Application.match("BILL TO", arrTxt, 0)
    If IsError(mtch) Then MsgBox """BILL TO" & " could not be found in the analyzed data...", vbInformation, "BILL TO missing": Exit Function
    If mtch = 8 Then 'no Phone number existing, no second Address, too...
        arr = Split(arrTxt(5), ", ") 'split City from State and Zip
        arrFin(4) = arr(0)            'Send To City
        arrFin(5) = Split(arr(1))(0)  'Send To State
        arrFin(6) = Split(arr(1))(1)  'Send To Zip
        arrFin(7) = arrTxt(6)         'Country
        arrFin(8) = ""                'No Phone number
    ElseIf mtch = 9 Then
        If InStr(arrTxt(5), ",") = 0 Then 'no comma in string (second address...)
            arrFin(3) = arrTxt(5)         'second Address
            arr = Split(arrTxt(6), ", ")  'split City from State and Zip
            arrFin(4) = arr(0)            'Send To City
            arrFin(5) = Split(arr(1))(0)  'Send To State
            arrFin(6) = Split(arr(1))(1)  'Send To Zip
            arrFin(7) = arrTxt(6)         'Country
            arrFin(8) = ""                'No Phone number
        Else    'No second address
            arr = Split(arrTxt(5), ", ")   'split City from State and Zip
            arrFin(4) = arr(0)            'Send To City
            arrFin(5) = Split(arr(1))(0)  'Send To State
            arrFin(6) = Split(arr(1))(1)  'Send To Zip
            arrFin(7) = arrTxt(6)         'Country
            arrFin(8) = arrTxt(7)         'Phone number
        End If
    ElseIf mtch = 10 Then 'second Address and Phone number exist
        arrFin(3) = arrTxt(5)         'second Address
        arr = Split(arrTxt(6), ", ")  'split City from State and Zip
        arrFin(4) = arr(0)            'Send To City
        arrFin(5) = Split(arr(0))(0)  'Send To State
        arrFin(6) = Split(arr(0))(1)  'Send To Zip
        arrFin(7) = arrTxt(7)         'Country
        arrFin(8) = arrTxt(8)         'Phone number
    End If
    arrFin(9) = arrTxt(mtch)        'Bill To Name
    arrFin(10) = arrTxt(mtch + 1)   'Bill To Address
    arr = Split(arrTxt(mtch + 2), ", ")
    arrFin(11) = arr(0)             'Bill To City
    arrFin(12) = Split(arr(1))(0)   'Bill To State
    arrFin(13) = Split(arr(1))(1)   'Bill To Zip
    arrFin(14) = arrTxt(mtch + 3)    'Bill To Country
    'extract SCUs and their values:
    For i = 0 To UBound(arrTxt)
        If arrTxt(i) Like "#* of #*" Then
            arrFin(15 + k) = arrTxt(i - 1)
            arrFin(16 + k) = Split(arrTxt(i))(0)
            k = k + 2
        End If
    Next i
    ExtractDat = Array(arrH, arrFin)
End Function
```
It can be used to return in the active sheet with such a code. It processes a text file, placing its content in an array and returns on the first two rows of the active sheet:
```
Sub UseExtractDat()
   Dim strFile As String, arrT, retArr
   
    strFile = "C:\blah\blah\blah\#62875.txt"
    'Place the content of the text file in an array (splitting by end of line)
    arrT = Split(CreateObject("Scripting.FileSystemObject").OpenTextFile(strFile, 1).ReadAll, vbCrLf)

    retArr = ExtractDat(arrT)
    
    Range("A1").Resize(1, UBound(retArr(0)) + 1).Value2 = retArr(0)
    Range("A2").Resize(1, UBound(retArr(1)) + 1).Value2 = retArr(1)
End Sub
```
Please, send some feedback after testing it.