I am trying to build and train a new language in spaCy from scratch, but I am struggling with how to configure spaCy for the initial training. Some notes on current resources:
- I already have word embeddings from a corpus of around 150 million tokens.
- I already have a large POS-tagged corpus for this language.
- I already have a dependency grammar to decompose parts of speech.
- The language does not currently have a usefully large treebank to train the model on.
What would be the workflow to combine these resources into a new spaCy "language"?
I have seen lots of guides online for how to initialize a language model with one of spaCy's built-in languages, but relatively little (if anything) about setting up a new one from scratch.