How to prevent Mecab from inputting strange characters before and after EOS when input buffer file size exceeded without increasing input buffer size?
When running mecab with files that exceed the input buffer size, it'd automatically split the output. This is usually fine, except before and after EOS, there's the below unrecognizable characters.
� � � � 補助記号-一般
��\uFFFD (character code)
Is there any settings that prevents mecab from outputting these strange characters? I need the file splitting to ensure morphemes are grouped properly. Going through the entire file and manually getting rid of them isn't the best option, especially when I have 10's of thousands of lines in the mecab output (due to lots of files).
Installed mecab via Homebrew with Unidict