I have two csv files. The first one can be anything between 1000 records to 2 million records. The second one is huge; around 20 million records. I join them using an ID which exists in both tables.
I initially did this as follows. The code works and outputs several records when printed.
import dask.dataframe as dd
from datetime import datetime
import pandas as pd
print("\nStart time of process: ", datetime.now())
csv_filename = 'second_csv_huge.csv'
csv_tick_filename = r'C:\dir\first_csv_file_with_ids.csv'
df_id_filtered = pd.read_csv(csv_tick_filename)
valoren_ids = df_id_filtered['TDF_VALOREN'].unique().tolist()
valoren_ids = [str(valoren) for valoren in valoren_ids]
#debug: first 10 items in the list
print(valoren_ids[:10])
print("No. of relevant ids: ", len(valoren_ids))
column_names = ["TDF_VALOREN", "BOURSE_CODE", "LAST_TRADE_PRICE", "TRADE_VOLUME", "TRADE_TIME",
"BID_PRICE", "BID_VOLUME", "BID_TIME", "ASK_PRICE", "ASK_VOLUME",
"ASK_TIME", "CUMULATED_VOLUME", "TRADE_STYLE", "TRADE_DATE",
"BID_STYLE", "BID_DATE", "ASK_STYLE", "ASK_DATE", "TRADE_MICROSECS",
"BID_MICROSECS", "ASK_MICROSECS", "CUMULATED_VOLUME_TIME", "LOGICAL_SESSION_DATE"]
#assume the file is in the directory you are working in
df = dd.read_csv(csv_filename, header=None, names=column_names, dtype='str')
df_top_filtered = df.dropna(subset=["LAST_TRADE_PRICE"])
df_top_filtered = df_top_filtered[df_top_filtered.TDF_VALOREN.isin(valoren_ids)]
print(df_top_filtered.head(10))
print("\nEnd time of process: ", datetime.now())
Then I tried to do it without using pandas at all. In the first code above, I used pandas to get a unique list of IDs only. Everything else was using Dask.
Here's the code with no pandas at all.
import dask.dataframe as dd
from datetime import datetime
print("\nStart time of process: ", datetime.now())
csv_filename = 'second_csv_huge.csv'
csv_tick_filename = r'C:\dir\first_csv_file_with_ids.csv'
df_id_filtered = dd.read_csv(csv_tick_filename)
#valoren_ids = df_id_filtered['TDF_VALOREN'].unique().tolist()
#valoren_ids = [str(valoren) for valoren in valoren_ids]
# Get unique values of 'TDF_VALOREN' without converting to pandas
valoren_ids = df_id_filtered['TDF_VALOREN'].unique().compute()
# Convert to list
valoren_ids = valoren_ids.tolist()
valoren_ids = [str(valoren) for valoren in valoren_ids]
#debug: first 10 items in the list
print(valoren_ids[:10])
print("No. of relevant ids: ", len(valoren_ids))
column_names = ["TDF_VALOREN", "BOURSE_CODE", "LAST_TRADE_PRICE", "TRADE_VOLUME", "TRADE_TIME",
"BID_PRICE", "BID_VOLUME", "BID_TIME", "ASK_PRICE", "ASK_VOLUME",
"ASK_TIME", "CUMULATED_VOLUME", "TRADE_STYLE", "TRADE_DATE",
"BID_STYLE", "BID_DATE", "ASK_STYLE", "ASK_DATE", "TRADE_MICROSECS",
"BID_MICROSECS", "ASK_MICROSECS", "CUMULATED_VOLUME_TIME", "LOGICAL_SESSION_DATE"]
#assume the file is in the directory you are working in
df = dd.read_csv(csv_filename, header=None, names=column_names, dtype='str')
df_top_filtered = df.dropna(subset=["LAST_TRADE_PRICE"])
print(df_top_filtered.head(10))
df_top_filtered = df_top_filtered[df_top_filtered.TDF_VALOREN.isin(valoren_ids)]
#debug: why is it not printing any records?
print(df_top_filtered.head(10))
print("\nEnd time of process: ", datetime.now())
In this code, when i tried to print the first 10 rows after I filtered on the IDs, it returns no records.
Something about my list created here is not matching any IDs in the second file. I've checked the type and they are all strings in both sets of code. You can also see in my code that I convert them all to strings first.
Any ideas?
Tried comparing values in the lists. Tried checking the data types.