I'm getting a numpy.core._exceptions._ArrayMemoryError when I try to use scikit-learn's DictVectorizer on my data.
I'm using Python 3.9 in PyCharm on Windows 10, and my system has 64 GB of RAM.
I'm pre-processing text data for training a Keras POS-tagger. The data starts in this format, with lists of tokens for each sentence:
sentences = [['Eorum', 'fines', 'Nervii', 'attingebant'], ['ait', 'enim'], ['scriptum', 'est', 'enim', 'in', 'lege', 'Mosi'], ...]
I then use the following function to extract useful features from the dataset:
def get_word_features(words, word_index):
"""Return a dictionary of important word features for an individual word in the context of its sentence"""
word = words[word_index]
return {
'word': word,
'sent_len': len(words),
'word_len': len(word),
'first_word': word_index == 0,
'last_word': word_index == len(words) - 1,
'start_letter': word[0],
'start_letters-2': word[:2],
'start_letters-3': word[:3],
'end_letter': word[-1],
'end_letters-2': word[-2:],
'end_letters-3': word[-3:],
'previous_word': '' if word_index == 0 else words[word_index - 1],
'following_word': '' if word_index == len(words) - 1 else words[word_index + 1]
}
word_dicts = list()
for sentence in sentences:
for index, token in enumerate(sentence):
word_dicts.append(get_word_features(sentence, index))
In this format the data isn't very large. It seems to only be about 3.3 MB.
Next I setup DictVectorizer, fit it to the data, and try to transform the data with it:
from sklearn.feature_extraction import DictVectorizer
dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
X_train = dict_vectoriser.transform(word_dicts)
At this point I'm getting this error:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 499. GiB for an array with shape (334043, 200643) and data type float64
This seems to suggest that DictVectorizer has massively increased the size of the data, to nearly 500 GB. Is this normal? Should the output really take up this much memory, or am I doing something wrong?
I looked for solutions and in this thread someone suggested allocating more virtual memory by going into Windows settings and SystemPropertiesAdvanced, unchecking Automatically manage paging file size for all drives, then manually setting the paging file size to a sufficiently large amount. This would be fine if the task required about 100 GB, but I don't have enough storage to allocate 500 GB to the task.
Is there any solution for this? Or do I just need to go and buy myself a larger drive just to have a big enough pagefile? This seems impractical, especially when the initial dataset wasn't even particularly large.
I worked out a solution. In case it's useful to anybody, here it is. I had been using a data generator later in my workflow, just to feed data to the GPU for processing in batches.
Based on the comments I got here I originally tried updating the output here to
return batch_x.todense()and changing my code above so thatdict_vectoriser = DictVectorizer(sparse=True). As I mentioned in the comments, though, this didn't seem to work.I've now changed the generator so that, once the
dict_vectoriseris created and fitted to the data, it's passed as an argument to the data generator, and it's not called to transform the data until the generator is being used.To call it you need to set the
batch_sizeand provide labels, so belowy_trainis some encoded list of labels corresponding to thex_traindata.