Different results of SPSS and Python KS-test to assess normality

47 Views Asked by fffff At 14 February 2024 at 09:22

Suppose that I have a series of data:

age;height 8;120 8;123 8;130 8;125 10;160 9;158 8;120 7;126 6;98 5;97 7;115 7;120 7;118 8;117 6;97 6;99 9;123 10;157 10;155 9;155 9;153 5;96 7;115 6;94 6;94 5;87 8;117 6;96 5;97 6;91 6;88 9;149 6;94 8;117 10;156 10;160 6;90 6;90 7;116 5;89 6;90 7;118 10;162

And I would like to assess the normality using Kolmogorov-Smirnov using both SPSS and Python. SPSS yielded a result of:

variables	statistics	sig
age	0.190	0.000
height	0.173	0.002

I tried to compare using Python with this code:

import pandas as pd
from scipy.stats import kstest
from scipy.stats import norm
data = pd.DataFrame([[8, 120], [8, 123], [8, 130], [8, 125], [10, 160], [9, 158], [8, 120], [7, 126], [6, 98], [5, 97], [7, 115], [7, 120], [7, 118], [8, 117], [6, 97], [6, 99], [9, 123], [10, 157], [10, 155], [9, 155], [9, 153], [5, 96], [7, 115], [6, 94], [6, 94], [5, 87], [8, 117], [6, 96], [5, 97], [6, 91], [6, 88], [9, 149], [6, 94], [8, 117], [10, 156], [10, 160], [6, 90], [6, 90], [7, 116], [5, 89], [6, 90], [7, 118], [10, 162]], columns=['age', 'weight'])
x = np.log(data.age)
n = norm(loc=0,scale=1)
kstest(x, n.cdf)

which gives:

KstestResult(statistic=0.9462396895483368, pvalue=5.139087762288979e-55)

Even if I don't log-transform the data, the result is still different:

kstest(data.age, n.cdf)

which gives:

KstestResult(statistic=0.9999997133484281, pvalue=9.27397852188504e-282)

Original Q&A

There are 1 best solutions below

Matt Haberland On 14 February 2024 at 15:38

The SciPy calculation is correct given your input: the KS-test statistic is the maximum difference between the empirical CDF and the provided CDF evaluated at the data.

import numpy as np
from scipy import stats

dist = stats.norm(loc=0, scale=1)

data = np.asarray([[8, 120], [8, 123], [8, 130], [8, 125], [10, 160], [9, 158], [8, 120], [7, 126], [6, 98], [5, 97], [7, 115], [7, 120], [7, 118], [8, 117], [6, 97], [6, 99], [9, 123], [10, 157], [10, 155], [9, 155], [9, 153], [5, 96], [7, 115], [6, 94], [6, 94], [5, 87], [8, 117], [6, 96], [5, 97], [6, 91], [6, 88], [9, 149], [6, 94], [8, 117], [10, 156], [10, 160], [6, 90], [6, 90], [7, 116], [5, 89], [6, 90], [7, 118], [10, 162]])
logage = np.log(data[:, 0])

x = np.sort(logage)
cdfvals = dist.cdf(x)
n = len(cdfvals)
dminus = (cdfvals - np.arange(0.0, n)/n)
dminus.max() # 0.9462396895483368

The SPSS code is not provided, so I cannot assess the reason for the discrepancy. Perhaps in SPSS you are not testing the null hypothesis that the data follows the standard normal distribution, which is clearly not the case. Instead, perhaps it is performing Lilliefors' test, which uses the KS-statistic to perform a test of the null hypothesis that the data follows a normal distribution in which the parameters loc and scale are treated as unknown.

res = stats.goodness_of_fit(stats.norm, logage, statistic='ks')
res.statistic  # 0.1821555634826541
res.pvalue  # 0.001
# p-value computed using Monte Carlo simulation, so results may vary.

If you want to perform such a test, there are many more powerful options available. Consider the Shapiro-Wilk test.

Different results of SPSS and Python KS-test to assess normality

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCIPY

Related Questions in SPSS

Related Questions in SCIPY.STATS

Related Questions in KOLMOGOROV-SMIRNOV

Trending Questions

Popular # Hahtags

Popular Questions