Are there any opportunities to tokenize hashtags into multi-words tokens?

187 Views Asked by At

I am currently analyzing Instagram postings which often have hashtags containing more than one word (e.g. #pictureoftheday).

However, tokenizing them within the R package tidytext results in only one token. Instead, I would like to have more than one token like "picture" "of" "the" "day". Unfortunately, I have not found a package capable of doing so. Do you know any R package allowing this approach?

Thanks in advance!

2

There are 2 best solutions below

0
help-info.de On

As far as I know you can't split joined words without knowing they are just that--words. If the hashtags were split by a delimiter then it would be easy; without it it becomes very complex. You need a language-dependent dictionary.

You probably have to process your data separately. Creating your own dictionary-based method is often a good solution, but it is very time intensive.

See also:

Among the most basic forms of quantitative text analysis are word-counting techniques and dictionary-based methods. This tutorial will cover both of these topics, as well as sentiment analysis, which is a form of dictionary-based text analysis.

0
Haorui He On

try this Python repo: ekphrasis


    from ekphrasis.classes.segmenter import Segmenter
    seg = Segmenter(corpus="mycorpus") 
    print(seg.segment("smallandinsignificant"))

output:


    > small and insignificant