I want to train large data in spacy v3.0+
There are 8000000 data tokens count i made 1000000 each chunk and finally murge vai DocBin python code but getting error
import os
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans
merged_doc_bin = DocBin()
fiels = [
"G:\\success-demo\\product_ner\\test\\train3.spacy",# 3000000 tokens here
"G:\\success-demo\\product_ner\\test\\train1.spacy", # 3000000 tokens here
"G:\\success-demo\\product_ner\\test\\train2.spacy", # 2000000 tokens here
]
for filename in fiels:
doc_bin = DocBin().from_disk(filename)
merged_doc_bin.merge(doc_bin)
merged_doc_bin.to_disk("G:\\success-demo\\product_ner\\test\\final\\murge.spacy")
unable to save vai to_disk
any other way to do this

This is a limitation due to msgpack, where a single file can't be larger than 2GB.
You don't need to merge these files, though. You can provide a directory rather than a single
.spacyfile forspacy train:It will recursively load all
.spacyfiles in the directory.