Get text from Font Patterns (FNG) from AFP file

143 Views Asked by At

Can anyone help obtaining text from "Font Patterns (FNG)" field from an AFP file. Is there any library (preferably Java) which can be used for this task?

Thank you,

2

There are 2 best solutions below

0
Yan Hackl-Feldbusch On

You can try afplib. It has some sample code that dumps all structured fields (org.afplib.samples.DumpAFP). It produces output like this:

...
FNG  number:47,offset:49787,id:13889161,length:8201,rawData:null,charset:null,PatData:[B@4e3958e7,
FNG  number:48,offset:57988,id:13889161,length:8201,rawData:null,charset:null,PatData:[B@77f80c04,
FNG  number:49,offset:66189,id:13889161,length:8201,rawData:null,charset:null,PatData:[B@1dac5ef,
FNG  number:50,offset:74390,id:13889161,length:6991,rawData:null,charset:null,PatData:[B@5c90e579,
EFN  number:51,offset:81381,id:13871497,length:17,rawData:null,charset:null,RSName:C0EX0480,

You could use the binary array PatData to extract the font pattern like this:

    try (AfpInputStream in = new AfpInputStream(
        new BufferedInputStream(new FileInputStream(args[0])))) {

        SF sf;
        while((sf = in.readStructuredField()) != null) {
            if(sf instanceof FNG) {
                byte[] pattern = ((FNG)sf).getPatData();
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
1
WeightedWaffle On

I'm using Python OCR + Pytesseract to do this. Convert to jpg first then read the jpg using OCR to txt forms.

def convert_pdf_to_txt(pdf_file_nm):

    # If you need to assign tesseract to path
    # pytesseract.pytesseract.tesseract_cmd = r'C:\Users\xxx\AppData\Local\Tesseract-OCR\tesseract.exe'

    dir = './pdf/'
    pdf_path = dir + pdf_file_nm    
    output_filename = pdf_file_nm.replace('.pdf','') + ".txt"
    output_path = './text/'+ output_filename
    pages = convert_from_path(pdf_path)
    pg_cntr = 1
    #list = []


    sub_dir = str("images/" + pdf_path.split('/')[-1].replace('.pdf','') + "/")

    ## To ensure directory is exist / created
    if not os.path.exists(sub_dir):
        os.makedirs(sub_dir)

    for page in pages:
        print("ok")
        filename = "pg_"+str(pg_cntr)+'_'+pdf_path.split('/')[-1].replace('.pdf','.jpg')
        page.save(sub_dir+filename)
        
        ###list.append(str(pytesseract.image_to_string(sub_dir+filename)))

        with io.open(output_path, 'a+', encoding='utf8') as f:
            f.write(str("======================================================== PAGE " + str(pg_cntr) + " ========================================================\n"))
            f.write(str(pytesseract.image_to_string(sub_dir+filename)+"\n"))
            f.write(str(devider))
        pg_cntr += 1
            
    print('1. Process to convert PDF to image completed successfully.\n')

    return output_filename