I'm trying to write a method to count the number of words when the content is in chinese and japanese. This should exclude the special characters / punctuations / whiteSpaces.
I tried creating a regex for each locale and find the words based on it. Tried looking for existing regex on internet but none of them seems to be working. My approach -
function countWords(text, locale) {
let wordCount = 0;
// Set the word boundary based on the locale
let wordBoundary = '\\b';
if (locale === 'ja') {
// Japanese word boundary
wordBoundary = '[\\p{Script=Hiragana}\\p{Script=Katakana}\\p{Script=Han}ー]+';
} else if (locale === 'zh') {
// Chinese word boundary
wordBoundary = '[\\p{Script=Han}]+';
}
const regex = new RegExp(wordBoundary, 'gu');
const matches = text.matchAll(regex);
for (const match of matches) {
wordCount++;
}
return wordCount;
}
I thought this should work, but I'm comparing the word count in MS word and using this logic, they are coming different
A possible word count approach could be based on a text segmentation array which was the result of calling an
Intl.Segmenterinstance'ssegmentmethod.Each segmented item features properties like e.g. ...
... thus, in order to get the total word count, one could
reducethe array of text segment items by validating an item'sisWordLikevalue ...Note ... as of now Firefox still does not support/implement
Intl.Segmenter