R - how to create DocumentTermMatrix for Korean words

40 Views Asked by At

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.

I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.

First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.

Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)

Third of all, I know that by using tm package, I can create DTM rather easily.

The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.

Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?

I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.

Much appreciated in advance.

0

There are 0 best solutions below