Here see the tags image first six tags are paragraph, then header and then again three more paragraph tag. Same things coming in content panel i.e first six and last three paragraph tag and then header, annotation as well. But if you look into reading order panel image then first six paragraph coming as a single reading order and last three paragraph is also coming as single reading order and this is main problem that I am facing.
Also if page has all tens paragraph then its reading order consider as a single reading order. So, whole summary is that if page has continuous similar structural elements, then it reads as a single structural element inside the reading order. (Structure elements may be headers, list, paragraph.)
Below is code logic for drawing text into content stream of current page.
TextPositionsInfo list contain whole texts of a particular structural element(i.e paragraph, headers ,lists) information of textPosition after extracting text using PDFTextStripper class.
public class TextPositionsInfo {
public String unicode;
public float fontSize;
public String fontName;
public float x;
public float y;
public float width;
public float height;
public Matrix textMatrix;
}
private PDPageContentStream currentContentStream;
private COSDictionary currentMarkedContentDictionary;
private int mcid = 1;
private PDStructureElement addTextCharByChar(List<TextPositionsInfo> textinfoList, String elementType, PDPage currentPage,
PDStructureElement Parent) throws IOException {
PDResources res = currentPage.getResources();
PDStructureElement currParent = null;
currentContentStream.beginText();
if (elementType.toLowerCase().equals("h2")) {
beginMarkedConent(COSName.H);
for(TextPositionsInfo textInfo : textinfoList) {
PDFont font = getFonts(res, textInfo.fontName);
if(font != null) {
currentContentStream.setFont(font, 1);
Matrix _tm = textInfo.textMatrix;
currentContentStream.setTextMatrix(_tm);
currentContentStream.showText(textInfo.unicode);
}
}
currentContentStream.endMarkedContent();
currParent = addStructEleToStructEleTree(elementType,
Parent,currentPage, COSName.H);
} else if (elementType.toLowerCase().equals("p")) {
beginMarkedConent(COSName.P);
for(TextPositionsInfo textInfo : textinfoList) {
PDFont font = getFonts(res, textInfo.fontName);
if(font != null) {
currentContentStream.setFont(font, 1);
currentContentStream.setTextMatrix(textInfo.textMatrix);
currentContentStream.showText(textInfo.unicode);
}
}
currParent = addStructEleToStructEleTree(elementType,
Parent,currentPage, COSName.P);
currentContentStream.endMarkedContent();
}
currentContentStream.endText();
return currParent;
}
private PDStructureElement addStructEleToStructEleTree(String elementtype,
PDStructureElement Parent,PDPage currentPage, COSName name) {
PDStructureElement StructEle = new PDStructureElement(elementtype, Parent);
StructEle.setPage(currentPage);
PDMarkedContent markedContent = new PDMarkedContent(name, currentMarkedContentDictionary);
StructEle.appendKid(markedContent);
Parent.appendKid(StructEle);
return StructEle;
}
private COSDictionary beginMarkedConent(COSName name) throws IOException {
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
mcid++;
currentContentStream.beginMarkedContent(name,
PDPropertyList.create(currentMarkedContentDictionary));
return currentMarkedContentDictionary;
}
So, please help me where things are going wrong.