Add another plot on top of a histogram plot

59 Views Asked by At

I am using Plotly Express to plot two histograms. First one is just plotting the distribution of car ads and the number of miles they've run. The second is for the same bins as the miles, I am plotting a histogram of the price distribution of those cars. Now for the second plot I want add a datapoint to the middle of the bin that would represent sum of all the prices in that bin divided by the number of cars in the same bin. Histogram 1 with the counts of Ads per Mileage bin and Histogram 2 with the sum of all the car prices per mileage bins For Example: For example in the pictures shown above, I want a datapoint for 41.81B/8186 = 5.963 Million. PFB the code below

# Creating mil_counts and price_counts DataFrames
mil_counts = df.groupby(['mileage']).size().sort_values(ascending=False).reset_index(name='count')

fig = make_subplots(rows=1, cols=2)

# Create a Plotly Express histogram trace for mileage
mileage_histogram_trace = px.histogram(mil_counts, x="mileage", y="count", title="Mileage", nbins=20)

# Add the mileage histogram trace to the first column
fig.add_trace(go.Histogram(histfunc="sum", x=mileage_histogram_trace.data[0]['x'], y=mileage_histogram_trace.data[0]['y'], 
                            name="Mileage", nbinsx=20), row=1, col=1)

# Create a Plotly Express histogram trace for price
price_histogram_trace = px.histogram(df, x="mileage", y="price", title="Price", nbins=20)

# Add the price histogram trace to the second column
fig.add_trace(go.Histogram(histfunc="sum", x=price_histogram_trace.data[0]['x'], y=price_histogram_trace.data[0]['y'], 
                            name="Price", nbinsx=20), row=1, col=2)

# Calculate the average price of cars in each mileage category
mileage_x = mileage_histogram_trace.data[0]['x']
avg_price = [48813640000/8186 , ]

for x_value in mileage_x:
    indices = np.where(price_histogram_trace.data[0]['x'] == x_value)[0]
    if len(indices) > 0:
        avg_price.append(np.mean(price_histogram_trace.data[0]['y'][indices]))
    else:
        avg_price.append(0)

# Add the line trace (superimposed on the bar) with a secondary y-axis
fig.add_trace(go.Scatter(x=mileage_x, y=avg_price, mode='lines', name="Average Price (Line)", yaxis="y2"), row=1, col=2)

# Update the layout if needed
fig.update_layout(
    title_text="Mileage and Price Histograms",
    xaxis=dict(title="Mileage", domain=[0, 0.4]),
    yaxis=dict(title="Sum of Counts"),
    xaxis2=dict(title="Mileage", domain=[0.6, 0.9]),
    yaxis2=dict(title="Average Price", side="right"),
    xaxis3=dict(title="Mileage", domain=[0.95, 1.0]),
    yaxis3=dict(title="Average Price (Line)", side="right"),
)


fig.show()

I just want to draw a line(from scatter) with 20 datapoints, each datapoint representing the middle of hist bin and average of prices in that bin. For example in the pictures shown above, I want a datapoint for 41.81B/8186 = 5.963 Million. What the current code is doing is adding additional datapoints because already the price_histogram_trace.data[0]['x'] has 70k datapoints while mileage_histogram_trace.data[0]['x'] has 7k for them to match it is adding mean prices for each mileage observation in the dataframe

1

There are 1 best solutions below

0
russhoppa On

If you want to plot an average line for binned data on top of the histogram then you'll have to calculate the average for each bin. Here's an example:

import plotly.graph_objects as go
import numpy as np

m = 200000
x = np.linspace(1, m, 100, dtype=int)
y = np.sin((x+m)/(m/2))*m + m
n_bins = 20
chunk_size = len(y)//n_bins
y_avg = [sum(y[i*chunk_size:(i*chunk_size)+chunk_size])/chunk_size for i in range(n_bins)]

fig = go.Figure(data=[
    go.Histogram(x=x, y=y, nbinsx=n_bins, histfunc='sum', name='histogram'),
    go.Scatter(x=x[::len(x)//n_bins]+x[chunk_size]/2, y=y_avg, name='average of data')
])

fig.show()

Plotly graph

You'll want to consider, though, that the average of the data in each bin may become quite small compared to the number of data in the bin and depending on what aggregation function you're using for your histogram.