Mecab outputs strange characters when automatically splitting large files

72 Views Asked by wanna_coder101 At 23 January 2022 at 04:36

How to prevent Mecab from inputting strange characters before and after EOS when input buffer file size exceeded without increasing input buffer size?

When running mecab with files that exceed the input buffer size, it'd automatically split the output. This is usually fine, except before and after EOS, there's the below unrecognizable characters.

�   �   �   �   補助記号-一般
 �&#65533;\uFFFD (character code)

Is there any settings that prevents mecab from outputting these strange characters? I need the file splitting to ensure morphemes are grouped properly. Going through the entire file and manually getting rid of them isn't the best option, especially when I have 10's of thousands of lines in the mecab output (due to lots of files).

Installed mecab via Homebrew with Unidict

Original Q&A

Mecab outputs strange characters when automatically splitting large files

There are 0 best solutions below

Related Questions in MECAB

Trending Questions

Popular # Hahtags

Popular Questions