Impact of Using PDF Training Data and JPG Test Data on Document AI Model Performance

52 Views Asked by At

I'm currently working on a document AI project (with Custom Extractor) and have encountered a scenario that I'm unsure how to navigate. My training dataset of Shipping instruction documents consists entirely of PDF documents, which are rich in both text and formatting details. For my testing phase, however, I'm considering using JPG images of Shipping instruction documents. These JPGs are essentially snapshots or scans of similar documents but in image format.

My concern revolves around the potential impact this difference in data format (PDFs for training vs. JPGs for testing) might have on the model's performance and accuracy.

I understand that the ideal scenario would involve matching training and testing data formats closely, but due to constraints and later deployment, I'm exploring how to best work within this limitation.

I tried to switch to same format but the F1 score isn't reach 0.8. I only have sample of 80 train and 20 test, even using many method to augment data but still not work.

0

There are 0 best solutions below