I know that some methods of generating word embeddings (e.g. CBOW) are based on predicting the likelihood of a given word appearing in a given context. I'm working with polish language, which is sometimes ambiguous with respect to segmentation, e.g. 'Coś' can be either treated as one word, or two words which have been conjoined ('Co' + '-ś') depending on the context. What I want to do, is create a tokenizer which is context sensitive. Assuming that I have the vector representation of the preceding context, and all possible segmentations, could I somehow calculate, or approximate the likelihood of particular words appearing in this context?
How can I recover the likelihood of a certain word appearing in a given context from word embeddings?
160 Views Asked by Ryszard Tuora At
1
There are 1 best solutions below
Related Questions in NLP
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Clarification on T5 Model Pre-training Objective and Denoising Process
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
- Output of Cosine Similarity is not as expected
- Getting an error while using the open ai api to summarize news atricles
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Should I use beam search on validation phase?
- Dialogflow failing to dectect the correct intent
- How to detect if two sentences are simmilar, not in meaning, but in syllables/words?
- Is BertForSequenceClassification using the CLS vector?
- Issue with memory when using spacy_universal_sentence_encoder for similarity detection
- Why does the Cloud Natural Language Model API return so many NULLs?
- Is there any OCR or technique that can recognize/identify radio buttons printed out in the form of pdf document?
- Model, lexicon to do fine grained emotions analysis on text in r
Related Questions in WORD-EMBEDDING
- I am unable to perform the vector embeddings with the help of pinecone and python
- How I can search a string in a JSON file with an word-embedding list and return the nearest occurrences?
- the key did not present in Word2vec
- Subsampling when training word embeddings
- Enhancing BERT+CRF NER Model with keyphrase list
- Text Embedding result based on Priority
- Why is it possible to use OpenAI Embeddings together with Anthropic Claude Model?
- Set sample points for each cluster in kmeans using Python
- is there any way to retrieve the embeddings store in a langchain VectorStore?
- BERTopic document visualization same color for a list of topics
- How can I build an embedding encoder with FastAPI
- How can I get the embedding of a document in langchain?
- Mapping embeddings to labels in PyTorch/Huggingface
- How do I fix the output of one layer so it is compatible to another layer?
- Efficient many-to-many embedding comparisons
Related Questions in WORD-SENSE-DISAMBIGUATION
- How to use SemEval or SemCor dataset for word sense disabiguation model?
- Why accuracy is 0%
- Sense similarity matrix using WordNet
- Pytorch BCE loss not decreasing for word sense disambiguation task
- Understand the word sense disambiguation data set format
- How can we implement word sense disambiguation using word2vec representation?
- Why isn't WSD matching WordNet?
- How can I recover the likelihood of a certain word appearing in a given context from word embeddings?
- How do I find a synonym of a word or multi-word paraphrase using the gensim toolkit
- PyTorch - WSD using LSTM
- How to extract meaning of colloquial phrases and expressions in English
- Measure of similarity using meronym/holonym edge on Wordnet
- How to use Babelfy or Balnet Java API in a servlet?
- Difference between fine-grained and coarse-grained score for WSD tasks?
- babelfy.properties is missing in Java
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
This very much depends on the way how you got your embeddings. The CBOW model has two parameters the embedding matrix that is denoted v and the output projection matrix v'. If you want to recover the probabilities that are used in the CBOW model at training time, you need to get v' as well. See equation (2) in the word2vec paper. Tools for pre-computing word embeddings usually don't do that, so you would need to modify them yourself.
Anyway, if you want to compute a probability of a word, given a context, you should rather think about using a (neural) language model than a table of word embeddings. If you search the Internet, I am sure you will find something that suits your needs.