I am doing stemming using Porter and Lancaster and I find these observations:
Input: replied
Porter: repli
Lancaster: reply
Input: twice
porter: twice
lancaster: twic
Input: came
porter: came
lancaster: cam
Input: In
porter: In
lancaster: in
My question are:
Lancasterwas supposed to be "aggressive"stemmerbut it worked properly withreplied. Why?- The word
Inremained the same inPorterwith uppercaseIn, Why? - Notice that the
Lancasteris removing words ending withe, Why?
I am not able to understand these concepts. Could you please help?
Q: Lancaster was supposed to be "aggressive" stemmer but it worked properly with
replied. Why?It's because Lancaster stemmer implementation is improved in https://github.com/nltk/nltk/pull/1654
If we take a look at https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62, there's a suffix rule, to change
-ied > -yThe feature allows users to input new rules and if no additional rules are added, then it'll use the
self.default_rule_tupleinparseRuleswhere therule_tuplewill be applied https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196The
default_rule_tupleactually comes from the whoosh implementation of the paice-husk stemmer which aka as the Lancaster stemmer https://github.com/nltk/nltk/pull/1661 =)Q: The word In remained the same in Porter with uppercase In, Why?
This is super interesting! And most probably a bug.
If we look at the code, the first thing that
PorterStemmer.stem()does it to lowercase, https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651But if we look at the code, everything else returns the
stem, which is lowercased but there are two if clauses that returns some form of the originalwordthat hasn't been lowercased!!!The first if clause checks if the word is inside the
self.poolwhich contains the irregular words and their stems.The second checks if the
len(word)<= 2, then return it's original form, which in the case of "In" the 2nd if clause returns True, thus the original non-lowercased form returned.Q: Notice that the Lancaster is removing words ending with
ein "came", Why?Not surprisingly also coming from the
default_rule_tuplehttps://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67, there's a rule that changes-e > -=)Q: How do I disable the
-e > -rule fromdefault_rule_tuple?(Un-)fortunately, the
LancasterStemmer._rule_tupleobject is an immutable tuple, so we can't simply remove one item from it, but we can override it =)