Spacy v3 DocBin unable to save train.spacy bytes object is too large

77 Views Asked by At

I want to train large data in spacy v3.0+

There are 8000000 data tokens count i made 1000000 each chunk and finally murge vai DocBin python code but getting error

import os
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans

merged_doc_bin = DocBin()

fiels = [
    "G:\\success-demo\\product_ner\\test\\train3.spacy",# 3000000 tokens here
    "G:\\success-demo\\product_ner\\test\\train1.spacy", # 3000000 tokens here
    "G:\\success-demo\\product_ner\\test\\train2.spacy", # 2000000 tokens here
]


for filename in fiels:
    doc_bin = DocBin().from_disk(filename)
    merged_doc_bin.merge(doc_bin)


merged_doc_bin.to_disk("G:\\success-demo\\product_ner\\test\\final\\murge.spacy")

unable to save vai to_disk

any other way to do this

enter image description here

1

There are 1 best solutions below

1
aab On

This is a limitation due to msgpack, where a single file can't be larger than 2GB.

You don't need to merge these files, though. You can provide a directory rather than a single .spacy file for spacy train:

spacy train config.cfg --paths.train train/ --paths.dev dev/

It will recursively load all .spacy files in the directory.