Glove text pre-processing

204 Views Asked by Abhishek Bhatia At 11 May 2020 at 15:34

I noticed in techniques, people convert text URLs, number, and dates to . Does the glove dataset has embedding trained for these placeholders. Can I feed them directly into the dataset?

Original Q&A

There are 1 best solutions below

gojomo On 12 May 2020 at 00:33

You can feed any tokens you want nito a word2vec/glove training sessions.

But, often tokens with a lot of internal variety, but perhaps little or diffuse semantic meaning (or too few examples of each individual variant), are either elided or coalesced into a synthetic replacement token.

For example, every number might become '__NUM__'. (Or, into ranged buckets like '__1DIGITNUM__', '__2DIGITNUM__', etc.) And dates might become '__DATE__'. (Or, bucketed like '__1700s__', '__1990s', etc.)

What any particular pre-trained model might have done needs to be checked directly with the model's creators, or via probing the tokens in the model. You should of course supply similar canonicalization on any entities/tokens you intend to look up against a pre-trained vector set.

So, what your set dos is completely up to you, if doing your own training, or up to the prior decisions made by a specific project, so only answerable with a specific project/dataset/codebase identified.

Glove text pre-processing

There are 1 best solutions below

Related Questions in NLP

Related Questions in WORD2VEC

Related Questions in GLOVE

Trending Questions

Popular # Hahtags

Popular Questions