In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this
where V is the size of the vocabulary. My question is why do we add V when we are only considering the count of the previous word.
I only have a rough idea that every word is incremented by one so we have to normalize it by V time but I still don't understand it properly. As I said we are only considering the count of previous word right so why don't just add 1 to it.
I also saw that if we add V then the addition of all bigrams will be 1 which is what it should be. But is there any other explanation of why V?

The
|V|variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from:First, we start with the naive assumption that if we add 1 to the numerator, we also add 1 to denominator to avoid math division error.
But instead of adding +1 to all terms in the vocabulary, we could simply add the size of the vocab, thus you see the
sum(c(wi-1)) + |V|in the denominator, instead ofsum(c(wi-1) + 1), note the scope of the "sum" function.More details
Sometimes I find it easier to see the math in code, consider this ngram without laplace:
Now consider the laplace smoothing with the incremental +1 on the numerator and denominator:
Note that the +1 on the numerator is adding simple +1 to each
wi-1, wicount. The denominator is adding +1 for each ngram that exists in the corpus containing thewi-1, *.Now, we see that we are summing +1 for every ngrams that occurs in a similar fashion, we can just add the no. of ngrams that exists, e.g.
This is a good prove of concept for how the +1 in the inner sum becomes a + |V| in when you carry the +1 out of the summation, e.g.