I want to fine tune tesseract (version 4.1.3) on a text containing this symbol: §. I have the tiff files for my project and made the box files using tesseract file.tiff file lstm.train. All the § symbols in the text are wrong in the created box file, so I corrected the box file. Now, when doing the fine tuning, I get the error message:
Encoding of string failed! Failure bytes: ffffffc2 ffffffa7 20 31 35 20 41 73 79 6c 67 65 73 65 74 7a 20 7a 75 72 20 4d 69 74 77 69 72 6b 75 6e 67 20 76 65 72 70 66 6c 69 63 68 74 65 74 2e 20 53 6f 20 6d ffffffc3 ffffffbc 73 73 65 6e 20 53 69 65 20 62 65 69 73 70 69 65 6c 73 77 65 69 73 65 20 6d ffffffc3 ffffffbc 6e 64 2d
Can't encode transcription: 'bla § blub'
The fine tuning was conducted by running a shell script containing
/usr/local/bin/lstmtraining \
--model_output output/fine_tuned \
--continue_from lstm_model/deu.lstm \
--traineddata tesseract/tessdata/best/deu.traineddata \
--train_listfile train/deu.training_files.txt \
--eval_listfile eval/deu.training_files.txt \
--max_iterations 400
I did not expect an encoding error, because there was no issue when creating the lstmf files. I assume that there is a way to give tesseract a list of additional characters, such that the above error does not occur anymore and the § is correctly recognized. Does anyone know how to give tesseract such a listor another method to make it recognize §?