I'm trying to create graph with edges only for nodes/(records index in dataframe) that have the same values in any 2 or more columns.
What I'm doing - I create a list with all possible combination pairs of column names and go through them searching for duplicates, for which I extract indexes and create edges.
The problem is that for huge datasets (millions of records) - this solution is too slow and requires too much memory.
What I do:
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [1, 1, 1, 1, 2],
'C': [1, 1, 2, 3, 3],
'D': [2, 7, 9, 8, 4]})
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 2 |
| 1 | 2 | 1 | 1 | 7 |
| 2 | 3 | 1 | 2 | 9 |
| 3 | 4 | 1 | 3 | 8 |
| 4 | 5 | 2 | 3 | 4 |
Here, rows 0 and 1 have 2 same values in columns B and C.
So, for nodes 0,1,2,3,4 I need to create edge 0-1. Other records have at maximum 1 same field between each other.
graph = nk.Graph(num_nodes, directed=False, weighted=False)
# Get the indices of all unique pairs
indices = np.triu_indices(len(column_names), k=1)
# Get the unique pairs of column names
unique_pairs = np.column_stack((column_names[indices[0]], column_names[indices[1]]))
for col1, col2 in unique_pairs:
# Filter the dataframe directly
duplicated_rows = df[[col1, col2]].dropna()
duplicated_rows = duplicated_rows[duplicated_rows.duplicated(subset=[col1, col2], keep=False)]
for _, group in duplicated_rows.groupby([col1, col2]):
tb_ids = group.index.tolist()
for i in range(len(tb_ids)):
for j in range(i + 1, len(tb_ids)):
graph.addEdge(tb_ids[i], tb_ids[j])
Main question - how to speed up / improve this solution? I was thinking about parallelization by column combination - but in this case can't figure out how to create edges in a graph properly.
Appreciate any help.
Memory Problem
Your multi-million record input generates so many pairs, they cannot all be kept in memory.
You will have to give up storing everything in memory. You will need to store the data in a highly optimized database. I suggest SQLite. bring input data into memory as required and store the pairs to the database as they are found. If you properly optimize your use of SQLite then the performance hit will be minimal and you will not run out of memory
Performance problem
Storing pairs to a database will slow the performance slightly.
You will need to optimize how you use the database. The two most important optimizations are:
Transaction Grouping. Initially, keep the pairs as they are found in memory. When the pair count reaches a specified number, write them all to the database in one transaction.
Asynchronous Write. Once you have handed off the writes to the db engine, do not wait for confirmation that the write succeeded - just blaze ahead with the pair search.
You forgot to state your performance requirement! However, whatever your requirement might be, I will assume that you will need to squeeze out a significant improvement.
I see that you are using python. This is an interpreted language, so the performance will be sluggish. Switching to a compiled language will give you a significant performance boost. For example using well coded C++ can give an improvement of up to 50 times.
Algorithm
Example:
Here is an implementation of these ideas in C++ that finds ~6,000,000 pairs in 100,000 records in 40 seconds on a modest laptop.
Output from a test run
Complete application with documentation in github repo https://github.com/JamesBremner/RecordMatcher
Multithreading
It is straightforward to split the data to be searched into two parts and search each part in its own thread. As often happens with multithreading applications the performance results at first are disappointing. However, by tuning the configuration parameters, I have achieved what seems like a worthwhile improvement.
Finds ~6,000,000 pairs in 100,000 records in 30 seconds on a modest laptop.