Creating a language model from scratch with spaCy with POS-tagged corpus and word embeddings

73 Views Asked by At

I am trying to build and train a new language in spaCy from scratch, but I am struggling with how to configure spaCy for the initial training. Some notes on current resources:

  • I already have word embeddings from a corpus of around 150 million tokens.
  • I already have a large POS-tagged corpus for this language.
  • I already have a dependency grammar to decompose parts of speech.
  • The language does not currently have a usefully large treebank to train the model on.

What would be the workflow to combine these resources into a new spaCy "language"?

I have seen lots of guides online for how to initialize a language model with one of spaCy's built-in languages, but relatively little (if anything) about setting up a new one from scratch.

0

There are 0 best solutions below