I'm probably making a stupid mistake, but here's how to reproduce:
- Generate random variables based on log-normal distribution
- Fit a log-normal distribution to the synthetic data
- Compute the probability distribution function using the fitted parameters
- Plot histogram of synthetic variables overlaid with the PDF
- They don't match!
import seaborn as sns
from scipy.stats import lognorm
import numpy as np
mu = 25
samples = lognorm.rvs(s=1, loc=0, scale=np.log(mu), size=10000)
shape, loc, scale = lognorm.fit(samples)
print(shape, loc, scale)
fig, ax = plt.subplots()
sns.histplot(samples, bins=50, stat="density", log_scale=True, ax=ax)
xs = np.linspace(0.1, 100, 10000)
ys = lognorm.pdf(xs, s=shape, loc=loc, scale=scale)
ax.plot(xs, ys, "r-")
I don't think the problem is with
scipy.stats
. Plotting withmatplotlib
, I see good agreement:You can also plot the histogram of the log of the sample against the corresponding normal distribution with Seaborn.
I suspect there's something about the interaction between
'density'
andlog_scale
that is not correct, possibly in our understanding of seaborn.Update: see https://github.com/mwaskom/seaborn/issues/3579 for an explanation of what is going on. Apparently the density normalization is performed on the log-transformed data, even though the data is displayed with the original magnitudes on a log-scaled axis.
If the shape of the seaborn histogram were preserved but the log-scaled axis were replaced by a linear-scaled axis with log-transformed labels, then the area under the curve would be 1.0.
In other words, it can be thought of as a histogram of the log-transformed data but displayed with the original data magnitudes on a log-scaled axis.
This doesn't account for the problem, but noticed that you defined
mu=25
and passedscale=np.log(mu)
tolognorm
. Double check that this is what you mean to do against the documentation oflognorm
.