My task is to allocate broad and fine category to the text I have in a pandas dataframe.
My df is something like this:
Text
I like this pen
this is the worst light bulb ever
these pants fit me just fine
Desired output:
Text Broad_cat Fine_cat
I like this pen Stationary Pen
this is the worst light bulb ever Electrical light Bulb
these pants fit me just fine Clothing Pants
The text could be from any category, so I cant use a prepared dictionary. These are reviews that I can get from any source. I was hoping that there is an open source python package that can help me with the specific task of categorization of a comment. I already tried YAKE, RAKE, Summa and KeyBERT methods and while each of them are giving me key words, they dont always turn out to be the category. Is this even possible? Any help in this regard is much appreciated.
I presume you have a list of allowed categories?
This a multiclass classification problem.
A fiddly approach is you embed the sentences into some sort of vector space then use a somethign like the softmax function to select the class and then train your model based on training data. This post discusses this.
I think you might be more interested in zero-shot text classification. Hugging face has a pipeline (what of using models for certain tasks) for this with the property
candidate_labels. So you should be able to use this with an appropriate model and specify candidate labels... though the underlying model would have support this in some way, but presumably some do. cross-encoder/nli-distilroberta-base appears to support this.`