I've developed a specialized model for identifying significant phrases within texts, utilizing a BERT+CRF framework, aimed at a nuanced Named Entity Recognition (NER) task. Instead of traditional named entities, this model focuses on classifying phrases into predefined categories (eg. in "can you check with the recruiter and set up a call", 'set up a call' would be a 'meeting' label)
Despite extensive optimization attempts, my model's performance improvement has plateaued. To address this, I'm integrating a semantic layer by embedding a curated list of keywords and phrases representative of each category. The integration involves enriching the CRF's input with both the token embeddings from BERT and a similarity score.
I am trying to feed to the CRF layer the output embeddings of BERT model along with similarity scores between my keyphrases in my list and each token in the text to be labeled. I am calculating this similarity score by first averaging embeddings of all words (if it is a keyphrase with more than one word) in each keyphrase for a label and then averaging it across all keyphrases for that label, so that I get one embedding vector for each label (lets call it the average-label-embedding) which I then use to cosine with each token in the text to be labeled. This way, the CRF gets a cosine score corresponding to each label, along with that token's embedding. Are there any other architectures that I should look into? Any reference papers would also be helpful.
And when I do the above experiment, the performance on the NER task did not improve much. On looking closely, it seems like the way I am calculating per-token similarity scoring is not working the way I thought it would. For example, similarity scores between even two closely related keyphrases within the same label class, are all over the place! And the cosine similarity between the average-label-embedding and a known token from the text that is clearly related to that label is also all over the place. Any insights or suggestions as to a better way to compute keyphrase embeddings, average keyphrase embeddings for a label and suggestions on better similarity scoring would really help.
Thanks in advance!