I have a data-set of experimental data (y_observed, t whereby y_observed represents measured values and t represents the time in seconds since start of the measurement). I perform a Gaussian fitting to this data, but want to assess whether it makes sense to fit a Gaussian or whether the observed experimental data is non-Gaussian.
I've tried to implement the 2 sample Kolmogorov-Smirnov goodness of fit test in Python. However, I am wondering whether I am implementing and interpreting this test correctly. A small example can be found below, whereby y_observed does not look like a Gaussian (see attached figure) but according to the 2 sample K-S test y_observed does follow a Gaussian distribution. Can someone please explain to me what I'm doing wrong? Thanks in advance!
from scipy.stats import ks_2samp
import numpy as np
y_observed = [
1.93291, 9.84869, 29.5713, 20.6346, -1.51537,
0.396292, 50.8895, 68.855, 63.9291, 55.6863,
37.6503, 46.87, 33.6637, 25.0395,18.3027,
38.0947, 27.9305, 10.2012, -0.704519, 18.6656,
20.2873, 8.78955, 4.39672
]
t = [
519, 522, 525, 528, 531,
534, 537, 540, 543, 546,
549, 552, 555, 558 ,561,
564, 567, 570, 573, 576,
579,582,585
]
y_fit = [
5.60417, 8.74951, 12.9907, 18.3426, 24.63,
31.4518, 38.1947, 44.1102, 48.4454, 50.5991,
50.2586, 47.4739, 42.6458, 36.4314, 29.5973,
22.8668, 16.8011, 11.7394, 7.80065, 4.92939,
2.96233, 1.69298, 0.920122
]
count, bins_count = np.histogram(y_observed, bins=5)
pdf = count / sum(count)
cdf = np.cumsum(pdf)
count_2, bins_count_2 = np.histogram(y_fit, bins=5)
pdf_2 = count_2 / sum(count_2)
cdf_2 = np.cumsum(pdf_2)
KS_stat, P_val = ks_2samp(cdf, cdf_2)
n_1, n_2 = len(y_observed), len(y_fit)
critical_val = np.sqrt((n_1+n_2)/(n_1*n_2))*1.36
if P_val<0.05 and KS_stat>critical_val:
print("Observed data does not follow a Gaussian distribution")
else:
print("Observed data follows a Gaussian distribution")

stats.ks_2sampaccepts two samples and performs a test of the null hypothesis that the two samples were drawn from the same (unknown) continuous distribution. So the first problem is that the code is not passin two samples - it appears to be passing in course-grained empirical CDFs of two arrays, one of which is your sample. (I don't know thaty_fitis.) If you wanted to assess whethery_observedandy_fitwere drawn from the same distribution (which doesn't look meaningful if you're after whethery_observedis drawn from a normal distribution), you would simply callThe second problem is that this would not be the appropriate test to assess whether the data is non-Gaussian.
If it must be a KS-test, you would want the one-sample KS test.
The problem is that as-written, this will test the null hypothesis that the data were drawn from the standard normal. If you know the parameters
locandscaleof the hypothesized normal distribution, you can pass them in:If you don't know the parameters (even if you can fit them to the data), then this test is not appropriate.
There are several other normality tests in SciPy that you could use instead. All you need to pass them is the sample to be tested.
shapirois often preferred.andersonwould be another good option, butscipy.stats.andersondoes not produce a p-value (it only returns a table of critical values), so it's not as easy to use.scipy.stats.goodness_of_fitis the most flexible (and could produce a p-value for the Anderson-Darling test), but you don't need it for a normality test.Note that whatever you choose, a frequentist null hypothesis test cannot possibly conclude that the data were drawn from a Gaussian distribution. It can only provide evidence that the data were not drawn from a Gaussian distribution; otherwise it is inconclusive.