I am currently working with IRS forms (U.S. Internal Revenue Service), which are in PDF format, specifically XFA or AcroForm. My aim is to extract not only the field names but also the corresponding field labels where the user is expected to input their values.
I understand that libraries such as PyPDF2 and Aspose-PDF can be used to extract form field details in Python. However, these libraries seem to only provide the field names (like "f1_01"), and I haven't found a way to extract the corresponding field labels (the text displayed to the user on the form, such as "First Name") using these libraries.
For instance, in a form with a field labeled "First Name" that corresponds to "f1_01", I want to map "First Name" to "f1_01".
Could anyone suggest a method or a different library in Python that could help me extract this information from an IRS form? I would greatly appreciate any assistance or pointers in the right direction. Aspose-PDF is currently only able to give me the field details like "f1_01", but not the labels. Also, I cannot use IText due to license constraints.
Link to IRS form: https://www.irs.gov/pub/irs-pdf/f1065sk3.pdf
Thank you!
Here is the PyPDF2 code that I tried:
Code1:
import PyPDF2 as pypdf
def findInDict(needle, haystack):
for key in haystack.keys():
try:
value=haystack[key]
except:
continue
if key==needle:
return value
if isinstance(value,dict):
x=findInDict(needle,value)
if x is not None:
return x
pdfobject=open("input.pdf",'rb')
pdf=pypdf.PdfReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolved_objects)
xml=xfa[7].get_object().get_data()
with open('output.xml', 'w') as f:
f.write(xml.decode('utf-8'))
Code2:
from PyPDF2 import PdfReader
def scan_fields(path):
pdf = PdfReader(path)
fields = pdf.get_fields()
for key in fields:
print(key)
scan_fields('input.pdf')
Here is the Aspose-PDF code that I tried:
import aspose.pdf as ap
license = ap.License()
license.set_license("Aspose.TotalProductFamily.lic")
pdfDocument = ap.Document("input.pdf")
# Get values from all fields
for formField in pdfDocument.form.fields:
# Analyze names and values if need
print(f"Partial Field Name : {formField.partial_name}, Full Field Name : {formField.full_name}, Value : {str(formField.value)}")