I have a corpus of 500 research articles and I want to extract the top 4-grams NOT simply based on the highest frequency but relevance to the research article genre in general (the 4-grams characteristic of this genre).
TF-IDF was recommended, and using Scikit-learn, I get a list of 4-grams based on TF-IDF score.
Question: High TF-IDF score means that the 4-gram has appeared in fewer articles. How are those 4-grams then representative of the research article genre if they have appeared in fewer articles? Is there any other approach you recommend?
Thanks.