memory error in library linear_kernel to make cosine_similarities

1.2k Views Asked by ibrahim At 03 February 2018 at 18:30

i have data set that contain 8 columns with 1482531 rows for every column i try to make content based rcomondation system by
making cosine similarities using linear_kernel in python but after half of hour it till me error memory are this due to large of data set , and if that is their a solution to solve this issue

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.model_selection import train_test_split

dataset = pd.read_csv('C:/data2/train.tsv',sep='\t', low_memory=False)

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

dataset['item_description'] = dataset['item_description'].fillna('')

tfidf_matrix.shape
((1482535, 13831759))

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Original Q&A

There are 1 best solutions below

Gnana On 19 March 2019 at 21:23

If your system has enough computational power, you can try the following method. Divide the data into chunks and write it to csv file(or db) and later use that file for prediction. Here is an small example if you have 1,00,000 records(say).

import csv
with open('cosine_data_test.csv','a') as f:
    writer = csv.writer(f)
    i=0
    while i!=tfidf_matrix.shape[0]:
        if i%100000!=0:
            #Iterating over 10,000 multiples(10 chunks)
            cosine_sim = linear_kernel(tfidf_matrix[i:i+1000], tfidf_matrix)
            print("{} completed".format(i+1000))
            writer.writerows(cosine_sim)
            i= i + 10000

memory error in library linear_kernel to make cosine_similarities

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PYTHON-2.7

Related Questions in CONTENT-BASED-RETRIEVAL

Trending Questions

Popular # Hahtags

Popular Questions