Using Dask to join two tables based on IDs - filtering issue

30 Views Asked by spongey At 15 March 2024 at 09:53

I have two csv files. The first one can be anything between 1000 records to 2 million records. The second one is huge; around 20 million records. I join them using an ID which exists in both tables.

I initially did this as follows. The code works and outputs several records when printed.

import dask.dataframe as dd
from datetime import datetime
import pandas as pd

print("\nStart time of process: ", datetime.now())

csv_filename = 'second_csv_huge.csv'

csv_tick_filename = r'C:\dir\first_csv_file_with_ids.csv' 
df_id_filtered = pd.read_csv(csv_tick_filename) 

valoren_ids = df_id_filtered['TDF_VALOREN'].unique().tolist()
valoren_ids = [str(valoren) for valoren in valoren_ids]


#debug: first 10 items in the list
print(valoren_ids[:10])

print("No. of relevant ids: ", len(valoren_ids))

column_names = ["TDF_VALOREN", "BOURSE_CODE", "LAST_TRADE_PRICE", "TRADE_VOLUME", "TRADE_TIME",
                        "BID_PRICE", "BID_VOLUME", "BID_TIME", "ASK_PRICE", "ASK_VOLUME",
                        "ASK_TIME", "CUMULATED_VOLUME", "TRADE_STYLE", "TRADE_DATE",
                        "BID_STYLE", "BID_DATE", "ASK_STYLE", "ASK_DATE", "TRADE_MICROSECS",
                        "BID_MICROSECS", "ASK_MICROSECS", "CUMULATED_VOLUME_TIME", "LOGICAL_SESSION_DATE"]

#assume the file is in the directory you are working in
df = dd.read_csv(csv_filename, header=None, names=column_names, dtype='str')

df_top_filtered = df.dropna(subset=["LAST_TRADE_PRICE"])

df_top_filtered = df_top_filtered[df_top_filtered.TDF_VALOREN.isin(valoren_ids)]

print(df_top_filtered.head(10))


print("\nEnd time of process: ", datetime.now())

Then I tried to do it without using pandas at all. In the first code above, I used pandas to get a unique list of IDs only. Everything else was using Dask.

Here's the code with no pandas at all.

import dask.dataframe as dd
from datetime import datetime

print("\nStart time of process: ", datetime.now())

csv_filename = 'second_csv_huge.csv'

csv_tick_filename = r'C:\dir\first_csv_file_with_ids.csv' 
df_id_filtered = dd.read_csv(csv_tick_filename) 

#valoren_ids = df_id_filtered['TDF_VALOREN'].unique().tolist()
#valoren_ids = [str(valoren) for valoren in valoren_ids]

# Get unique values of 'TDF_VALOREN' without converting to pandas
valoren_ids = df_id_filtered['TDF_VALOREN'].unique().compute()

# Convert to list
valoren_ids = valoren_ids.tolist()
valoren_ids = [str(valoren) for valoren in valoren_ids]


#debug: first 10 items in the list
print(valoren_ids[:10])

print("No. of relevant ids: ", len(valoren_ids))

column_names = ["TDF_VALOREN", "BOURSE_CODE", "LAST_TRADE_PRICE", "TRADE_VOLUME", "TRADE_TIME",
                        "BID_PRICE", "BID_VOLUME", "BID_TIME", "ASK_PRICE", "ASK_VOLUME",
                        "ASK_TIME", "CUMULATED_VOLUME", "TRADE_STYLE", "TRADE_DATE",
                        "BID_STYLE", "BID_DATE", "ASK_STYLE", "ASK_DATE", "TRADE_MICROSECS",
                        "BID_MICROSECS", "ASK_MICROSECS", "CUMULATED_VOLUME_TIME", "LOGICAL_SESSION_DATE"]

#assume the file is in the directory you are working in
df = dd.read_csv(csv_filename, header=None, names=column_names, dtype='str')

df_top_filtered = df.dropna(subset=["LAST_TRADE_PRICE"])

print(df_top_filtered.head(10))

df_top_filtered = df_top_filtered[df_top_filtered.TDF_VALOREN.isin(valoren_ids)]

#debug: why is it not printing any records?
print(df_top_filtered.head(10))

print("\nEnd time of process: ", datetime.now())

In this code, when i tried to print the first 10 rows after I filtered on the IDs, it returns no records.

Something about my list created here is not matching any IDs in the second file. I've checked the type and they are all strings in both sets of code. You can also see in my code that I convert them all to strings first.

Any ideas?

Tried comparing values in the lists. Tried checking the data types.

Original Q&A

Using Dask to join two tables based on IDs - filtering issue

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DASK

Trending Questions

Popular # Hahtags

Popular Questions