I am trying to learn more about Pandas/Matplotlib/Seaborn, by working with datasets at work. I would like to get to automating reports with Seaborn graphs, granted, that is probably overkill for basic graphs...but I like the way they look. I am struggling to get a simple graph to work. We pull data weekly, and its 4500-5000 datapoints. I want to work with pH and datetime to better understand working with large datasets.
Here is a snapshot of how I have initially set the data up. The csv file had a bunch of crap in top rows, so I set custom names for the columns I want to work with and removed the first 7 so that it starts with my data:
import pandas as pd
import seaborn as sns
import os
from datetime import datetime
df = pd.read_csv(r"Datasets\Effluent data export 12.28.2023 through 01.03.2024.csv")
df = df.drop(df.index[0:8]).rename(columns={df.columns[0]: 'Date', df.columns[1]: 'Level (in)', df.columns[2]: 'Flow Rate (gpm)', df.columns[3]: 'Total Flow (Gal)', df.columns[12]: 'pH', df.columns[13]: 'Temperature (C)'})
df = df.dropna(subset=['pH'])
I added the dropna for pH because using df.info() I noticed that there was one row of NA data and I don't know how to use that so I removed it. I am unsure how to work with NA as far as replacing it with something to maintain true data integrity, but I can figure that out later.
My graph is coming out with black lines all over the X and Y axis, which is probably due to it trying to show ALL of the ticks at once. I tried to solve this by changing the date from an object to datetime using:
df['Date'] = pd.to_datetime(df['Date'])
That got my date column to go from an object to a datetime64 type, and I have equal non-null counts for both of these columns now.
My issue now is that I don't know how to make a graph with this much data. I know that I need to go back and better understand Numpy, Matplotlib, and Pandas...but I am hoping someone can help me understand what it is that I actually need to learn/focus on for how to work with the graphs to show this data. Currently, I have tried these two things, and the result is ultimately the same in both cases with a line that doesn't look right, and no real graph:
sns.lineplot(data=df, x="Date", y="pH")
sns.scatterplot(data=df, x="Date", y="pH")
Here is an example of the scatterplot graph, which due to the pH being fairly tight, should look more like a parallel line and not one with a negative slope as shown:
My ultimate goal is to create graphs for weeks, months, and years, for our facility data. But I don't know how to work with sets this large, or what graphs would be good. I want to look at water level, flow rate, pH, and temperature (see my .rename() for the actual names but I need to figure out how to make a meaningful graph of datetime vs pH for now so I can see how it works.
I suggest trying grouping by period, consider following simple example
output
Explanation: I create dataframe with dates from 2024-1-1 to 2024-1-30 with values from 100 to 129 (
rangeis inclusive-exclusive) then I convert it to datetime and I use groupby to aggregate data, in this case into weeks (W, see Period aliases for allowed values) and takingmedianof it.