Are there any opportunities to tokenize hashtags into multi-words tokens?

187 Views Asked by Karl At 06 December 2021 at 17:25

I am currently analyzing Instagram postings which often have hashtags containing more than one word (e.g. #pictureoftheday).

However, tokenizing them within the R package tidytext results in only one token. Instead, I would like to have more than one token like "picture" "of" "the" "day". Unfortunately, I have not found a package capable of doing so. Do you know any R package allowing this approach?

Thanks in advance!

Original Q&A

There are 2 best solutions below

help-info.de On 12 December 2021 at 09:32

As far as I know you can't split joined words without knowing they are just that--words. If the hashtags were split by a delimiter then it would be easy; without it it becomes very complex. You need a language-dependent dictionary.

You probably have to process your data separately. Creating your own dictionary-based method is often a good solution, but it is very time intensive.

See also:

Among the most basic forms of quantitative text analysis are word-counting techniques and dictionary-based methods. This tutorial will cover both of these topics, as well as sentiment analysis, which is a form of dictionary-based text analysis.

Haorui He On 07 January 2023 at 09:24

try this Python repo: ekphrasis


    from ekphrasis.classes.segmenter import Segmenter
    seg = Segmenter(corpus="mycorpus") 
    print(seg.segment("smallandinsignificant"))

output:


    > small and insignificant

Are there any opportunities to tokenize hashtags into multi-words tokens?

There are 2 best solutions below

Related Questions in R

Related Questions in TEXT

Related Questions in TOKEN

Related Questions in MINING

Related Questions in POSTING

Trending Questions

Popular # Hahtags

Popular Questions