Creating a language model from scratch with spaCy with POS-tagged corpus and word embeddings

73 Views Asked by jlrl At 25 April 2023 at 15:40

I am trying to build and train a new language in spaCy from scratch, but I am struggling with how to configure spaCy for the initial training. Some notes on current resources:

I already have word embeddings from a corpus of around 150 million tokens.
I already have a large POS-tagged corpus for this language.
I already have a dependency grammar to decompose parts of speech.
The language does not currently have a usefully large treebank to train the model on.

What would be the workflow to combine these resources into a new spaCy "language"?

I have seen lots of guides online for how to initialize a language model with one of spaCy's built-in languages, but relatively little (if anything) about setting up a new one from scratch.

Original Q&A

Creating a language model from scratch with spaCy with POS-tagged corpus and word embeddings

There are 0 best solutions below

Related Questions in NLP

Related Questions in SPACY

Related Questions in WORD-EMBEDDING

Related Questions in POS-TAGGER

Related Questions in LANGUAGE-MODEL

Trending Questions

Popular # Hahtags

Popular Questions