Text clustering based on “stance” rather than the distribution of embeddings as the basis for clustering

23 Views Asked by At

I'm conducting public opinion analysis. The issue is that I want to utilize "stance" instead of the distribution of sentence embeddings as the basis for clustering. Specifically, I've researched text clustering literature, but most studies employ some form of embedding model (such as BERT and RNN) to generate numerical vectors, and then apply classical clustering algorithms (like K-Means) to cluster these vectors. While effective, this approach doesn't meet my requirements. Consider the following example:

Some individuals like dogs; they may simply write short reviews like "I like dogs" on social media. Conversely, other dog enthusiasts might compose longer reviews such as "Dogs are our best friends, and my little dog has alleviated my depression." Despite the difference in length, their views (or stances) are essentially identical. On the other hand, individuals who dislike dogs may express sentiments like "I don't like dogs" or provide more detailed explanations like "Some may argue that dogs are our best friends, but I once suffered a dog bite."

I've conducted experiments using the aforementioned methods, and the results indicate that the algorithm tends to classify "I like dogs" and "I don't like dogs" into the same cluster merely because their embeddings appear "similar." I've also tried other metircs like cosine similarity but all of these didn't work. The primary issue lies in the model's clustering of these vectors based on their distribution or "shape" of embeddings, whereas I intend for it to cluster them based on the underlying "stance" or "opinion" of the sentences. Is there any way to achieve this?

0

There are 0 best solutions below