I've been playing around with analogy queries over some publicly available word embeddings, in particular using the following:
numberbatch-en-19.08from https://github.com/commonsense/conceptnet-numberbatchglove.42B.300dfrom https://nlp.stanford.edu/projects/glove/glove.840B.300dfrom https://nlp.stanford.edu/projects/glove/
I'm doing some basic queries that include (where queryTarget is what I am looking for):
baseSource:baseTarget :: querySource:queryTarget
e.g. man:woman :: king:queen
- maximize
cosine_similarity(baseTarget-baseSource, queryTarget-querySource) - maximize
cosine_similarity(baseTarget-baseSource, queryTarget-querySource) * cosine_similarity(baseTarget-queryTarget,baseSource-querySource) - minimize L2norm(baseTarget-baseSource+querySource, queryTarget)
For the query:
man:woman :: king:?
The glove data gives me the correct queen, lady, princess results for the various matching strategies. However, conceptnet gives female_person, adult_female, king_david's_harp as top 3, which I would not expect (queen is not in the top 20). Similarly, I see poor results regularly displace expected results that I do see in the glove results.
Does the conceptnet embedding require some sort of additional tweaking before I can use it? Or is it just not tailored/suited for English analogies?