Adding relationships efficiently. Neo4j

44 Views Asked by At

I have a CSV of around 5 million rows and two columns: 'client', 'product'. Each client is an ID in my neo4j graph and the same for the product. We can have all types of duplications and this is fine. I have the following code in python:

j=1
batch_size=500
nbatch=len(client_product)//batch_size+1
for i in range(0,len(client_product),batch_size):
    j+=1
    print(f"{j}/{nbatch}")
    batch = client_product.iloc[i:i + batch_size]
    my_dict = batch.to_dict('records')
    my_dict_str = str(my_dict)
    my_dict_str=dictToPropertiesSingleValue(str(my_dict))
    tq=f"""UNWIND {my_dict_str} AS row
    MATCH (n:client), (k:product)
    WHERE ID(n) = row.client_ID AND ID(k) = row.product_ID
    CREATE (n)-[r:client_product]->(k)
    return count(r)"""
    runq(tq)

The part in the middle

my_dict = batch.to_dict('records')
my_dict_str = str(my_dict)
my_dict_str=dictToPropertiesSingleValue(str(my_dict))

is formatting a dictionary in a list readable by neo4j. Like:

[ {client_ID: 918906,product_ID: 40320},
 {client_ID: 918920,product_ID: 40320},
 {client_ID: 918921,product_ID: 40320},
 {client_ID: 918953,product_ID: 40321},
 {client_ID: 918955,product_ID: 40322},
 {client_ID: 918960,product_ID: 40323},
 {client_ID: 919018,product_ID: 40323},
 {client_ID: 919517,product_ID: 40323},
 {client_ID: 919702,product_ID: 40323},
 {client_ID: 919703,product_ID: 40324},
 {client_ID: 919743,product_ID: 40324},
 {client_ID: 919744,product_ID: 40324},
 {client_ID: 919745,product_ID: 40324},
 {client_ID: 919764,product_ID: 40325}]

I am really struggling on finding a faster way to push these. It makes sense that pushing a few million relationships should take a few minutes but not 4 hours like it does for me. Any suggestions?

1

There are 1 best solutions below

0
cybersam On

There is no need to write special code to make 10K individual Cypher queries (which is why it takes so long). You can just make a single query that uses the LOAD CSV clause, which is specifically designed to import directly from the CSV data. You can also use CALL subqueries in transactions to support importing large amounts of data in batches.

For example (I use a batch size of 5000, but you can try different values to see which works faster for you):

LOAD CSV WITH HEADERS FROM 'file:///client_product.csv' AS r
CALL {
  WITH r
  MATCH (n), (k)
  WHERE ID(n) = r.client_ID AND ID(k) = r.product_ID
  CREATE (n)-[:client_product]->(k)
} IN TRANSACTIONS OF 5000 rows

Notes:

  1. Above query assumes the CSV file starts with a header row (e.g., "client_ID,product_ID").
  2. If loading from a local file, see the "Configuration settings for file URLs" section of the docs.
  3. LOAD CSV can also load from an http URL if the CSV file is available remotely.
  4. If you need to avoid making duplicate relationships, use MERGE instead of CREATE (but it will be a bit slower).