spaCY lemmatizer different results on repeated words

45 Views Asked by At

The lemmatization of the following sentence

  • "Not finished! Not finished! A gem for our adopted daughter, Kiri, - - born of Grace's avatar, - - and whose conception was a complete mystery."*

has given a little bit confusing result

['not', 'finished', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

The thing is that first word "finished" and "second" word "finished" were detected as different POS.

['PART', 'AUX', 'PART', 'VERB', 'DET', 'NOUN', 'ADP', 'PRON', 'VERB', 'NOUN', 'PROPN', 'VERB', 'ADP', 'PROPN', 'NOUN', 'CCONJ', 'DET', 'NOUN', 'AUX', 'DET', 'ADJ', 'NOUN']

First one was detected as an auxilary, and the second one was detected as a verb, as had been expected.

Adding third "Not finished!" gave the followinf result:

['not', 'finish', 'not', 'finish', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

Removing one of them

['not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

Even four repetition let to get the expected result

['not', 'finish', 'not', 'finish', 'not', 'finish', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

I find it difficult to find both a logical explanation and a workaround that will allow solving the problem not only with a specific example.

0

There are 0 best solutions below