I have a CSV of around 5 million rows and two columns: 'client', 'product'. Each client is an ID in my neo4j graph and the same for the product. We can have all types of duplications and this is fine. I have the following code in python:
j=1
batch_size=500
nbatch=len(client_product)//batch_size+1
for i in range(0,len(client_product),batch_size):
j+=1
print(f"{j}/{nbatch}")
batch = client_product.iloc[i:i + batch_size]
my_dict = batch.to_dict('records')
my_dict_str = str(my_dict)
my_dict_str=dictToPropertiesSingleValue(str(my_dict))
tq=f"""UNWIND {my_dict_str} AS row
MATCH (n:client), (k:product)
WHERE ID(n) = row.client_ID AND ID(k) = row.product_ID
CREATE (n)-[r:client_product]->(k)
return count(r)"""
runq(tq)
The part in the middle
my_dict = batch.to_dict('records')
my_dict_str = str(my_dict)
my_dict_str=dictToPropertiesSingleValue(str(my_dict))
is formatting a dictionary in a list readable by neo4j. Like:
[ {client_ID: 918906,product_ID: 40320},
{client_ID: 918920,product_ID: 40320},
{client_ID: 918921,product_ID: 40320},
{client_ID: 918953,product_ID: 40321},
{client_ID: 918955,product_ID: 40322},
{client_ID: 918960,product_ID: 40323},
{client_ID: 919018,product_ID: 40323},
{client_ID: 919517,product_ID: 40323},
{client_ID: 919702,product_ID: 40323},
{client_ID: 919703,product_ID: 40324},
{client_ID: 919743,product_ID: 40324},
{client_ID: 919744,product_ID: 40324},
{client_ID: 919745,product_ID: 40324},
{client_ID: 919764,product_ID: 40325}]
I am really struggling on finding a faster way to push these. It makes sense that pushing a few million relationships should take a few minutes but not 4 hours like it does for me. Any suggestions?
There is no need to write special code to make 10K individual Cypher queries (which is why it takes so long). You can just make a single query that uses the LOAD CSV clause, which is specifically designed to import directly from the CSV data. You can also use CALL subqueries in transactions to support importing large amounts of data in batches.
For example (I use a batch size of 5000, but you can try different values to see which works faster for you):
Notes:
LOAD CSVcan also load from an http URL if the CSV file is available remotely.CREATE(but it will be a bit slower).