Optimizing Partial CSV Data Upload to MySQL (pymysql): Seeking Best Practices for Selective Column Insertion

24 Views Asked by At

I'm facing efficiency issues while uploading data to an AWS RDS database. Daily, I process a CSV file with about 6,000 lines using Python's 'pymysql' package. The process involves reading each line and selectively uploading data based on column indices. Although only 10 pieces of information are updated per iteration, the entire operation takes over 10 hours to complete.

Could anyone suggest ways to optimize this process? My current code is attached below for reference:

import pandas as pd
import pymysql
from datetime import datetime
import time


# Connect to the database
conn = pymysql.connect(host='hostURL',user='user', password='pw')
cursor = conn.cursor()
cursor.execute("USE DB")

# Read your data
date_org = datetime(2023,11,6).strftime("%Y%m%d")
path = 'C://local_code_run//data//'
resPath = path + 'RandomForest_output//' + '20231106_preds.csv'
data = pd.read_csv(resPath)
data = data.where(pd.notnull(data), None)

# Explicitly convert NaNs in float columns to None
data['Individual_Station_Model_Pred'] = data['Individual_Station_Model_Pred'].fillna(0)

# SQL
sql = '''
            UPDATE PM25_Predictions
            SET Lucas_ML_All = %s, Lucas_ML_One = %s
            WHERE stationid = %s
            AND YEAR(UTC) = %s AND MONTH(UTC) = %s AND DAY(UTC) = %s
            AND YEAR(Forecast) = %s AND MONTH(Forecast) = %s AND DAY(Forecast) = %s AND HOUR(Forecast) = %s
          '''

# Upload
for row in data.itertuples():
    utc_year, utc_month, utc_day = date_org[:4], date_org[4:6], date_org[6:8]
    forecast_year, forecast_month, forecast_day, forecast_hour = str(row.UTC_DATE)[:4], str(row.UTC_DATE)[4:6], str(row.UTC_DATE)[6:8], str(row.UTC_TIME)[:-2]
    params = (
            row.All_Station_Model_Pred,
            row.Individual_Station_Model_Pred,
            row.Station,
            utc_year, utc_month, utc_day,
            forecast_year, forecast_month, forecast_day, forecast_hour
        )

    try:
        cursor.execute(sql, params)
    except pymysql.MySQLError as e:
        print("Error while updating record:", e)
        conn.rollback()  # Rollback in case of error
    else:
        conn.commit()  # Commit the transaction

Thanks in advance for your help!

I attempted to compile the data into a table and upload it all at once, but this method also resulted in a lengthy process. Therefore, I would appreciate learning about more efficient approaches from those who have experience with similar issues.

0

There are 0 best solutions below