How to get font name set for TessBaseAPI read

Question

How to get font name set for TessBaseAPI read

37 Views Asked by SarahLy At 02 October 2023 at 09:30

Using standard tess api to recognize text:

image1 = imread("/home/user/Desktop/src.png");
cv::cvtColor(image1, image1, COLOR_RGB2GRAY);
cv::threshold(image1, image1, 125, 255, THRESH_BINARY);
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
};
api4->SetImage((uchar*)image1.data, image1.size().width, image1.size().height, image1.channels(), image1.step1());
char *outText = api->GetUTF8Text();
cout << "outText:" << outText << endl;

Need to train tesseract to recognize more precisely some symbols

Using the guide below:

jTessBox Editor: https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

Step 1: Make box files for images that we want to train Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox Eg:tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox

{*Note: After making box files we have to change or modify wrongly identified characters in box files.}

Step 2: Create .tr file (Compounding image file and box file) Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train Eg: tesseract train.my.exp.tif train.my.exp0 box.train

step 3: Extract the charset from the box files (Output for this command is unicharset file) Syntax: unicharset_extractor [langname].[fontname].[expN].box Eg: unicharset_extractor train.my.exp0.box

step 4: Create a font_properties file based on our needs. Syntax: echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" [angle bracket should be here] font_properties Eg: echo "arial 0 0 1 0 0" [angled bracket] font_properties

Step 5: Training the data. Syntax: mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr Eg: mftraining -F font_properties -U unicharset -O train.unicharset train.my.exp0.tr

Step 6: Syntax: cntraining [langname].[fontname].[expN].tr Eg: cntraining train.my.exp0.tr {*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }

Step 7: Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto) Syntax: rename filename1 filename2 Eg: rename shapetable train.shapetable rename inttemp train.inttemp rename pffmtable train.pffmtable rename normproto train.normproto

Step 8: Create .traineddata file Syntax: combine_tessdata [langname]. Eg: combine_tessdata train.

Move .traineddata file to tesseract programs tessdata directory C:\Program Files\Tesseract-OCR\tessdata

Run tesseract for trained fronts

tesseract Test2.png stdout -l train

I'm confused with font name, as I don't know the font name.

step 4: Create a font_properties file based on our needs. Syntax: echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" [angle bracket should be here] font_properties Eg: echo "arial 0 0 1 0 0" [angled bracket] font_properties

How to get font name set for TessBaseAPI read with c+ and command line?

Original Q&A

There are 1 best solutions below

**SarahLy** · Answer 1 · 2023-10-03T11:20:00.863000

tesseract::ResultIterator* res_it = api4->GetIterator();

const char* word = res_it->GetUTF8Text(tesseract::RIL_WORD);
const char *font_name;

bool bold, italic, underlined, monospace, serif, smallcaps;
int pointsize, font_id;
 font_name = res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace, &serif,&smallcaps, &pointsize, &font_id);
                                                                
printf("%s \t=> fontname: %s, size: %d, font_id: %d, bold: %d,"\
       " italic: %d, underlined: %d, monospace: %d, serif: %d,"\
       " smallcap: %d\n", word, font_name, pointsize, font_id,
       bold, italic, underlined, monospace, serif, smallcaps);

does the trick, but tesseract representive mentioned somewhere that it might not be that reliable

How to get font name set for TessBaseAPI read

There are 1 best solutions below

Related Questions in C++

Related Questions in TESSERACT

Related Questions in LEPTONICA

Trending Questions

Popular # Hahtags

Popular Questions