How to reduce semantically similar words?

214 Views Asked by At

I have a large corpus of words extracted from the documents. In the corpus are words which might mean the same. For eg: "command" and "order" means the same, "apple" and "apply" which does not mean the same.

I would like to merge the similar words, say "command" and "order" to "command". I have tried to use word2vec but it doesn't check for semantic similarity of words(it ouputs good similarity for apple and apply since four characters in the words are the same). And when I try using wup similarity, it gives good similarity score if the words have matching synonyms whose results are not that impressive.

What could be the best approach to reduce semantically similar words to get rid of redundant data and merge similar data?

1

There are 1 best solutions below

0
Denis Gordeev On

I believe one of the options here is using WordNet. It gives you a list of synonyms for the word, so you may merge them together given you know its part of speech.

However, I'd like to point out that "order" and "command" are not the same, e.g. you do not command food in restaurants and such homonymy is true for many-many words.

Also I'd like to point out that for Word2vec spelling is irrelevant and is not taken into consideration at all, the algorithm considers only concurrent usage. I suppose you might be mixing it with FastText. However, there should be some problems with your model. Because in a standard set of embeddings distance between these concepts should be large. MUSE FastText similarity between "apple" and "apply" is only 0.15, which is quite low.

I use Gensim's function

model.similarity("apply", "apple")

So you might need to fix learning parameters or just use a pretrained model.