Suppose A and B are two datasets. The datasets might have 100 features each. How do I perform hypothesis testing on these independent datasets to compare statistical significance?
I tried to write a code in Python. I have preprocessed both the datasets and I have tried using Student's t test considering the columns are normalized. The datasets are tabular data with continuous values and have performed one hot encoding on the categorical features. I tried performing t-test on a numerical column from the both datasets. But I can't seem to figure out how to perform on the entire dataset. I used the scipy.stats library.
The
Kolmogorov-Smirnovtest is a non-parametric statistical test that can be used to determine if two samples come from the same distribution.One approach that you can take is for each of the features (columns) from the datasets
AandBperform aKStest to check if they have come from the same distribution (using thescipy.stats.ks_2samp()function).Th following code shows an example, where it uses couple of
2-column datasets, namely,AandB. The first feature (column) of the datatsetsAandBcomes (are sampled) from the same (standard normal) distribution, but the second feature comes from different (normal) distributions (with different parmeters).If you plot the histogram of the features for the datasets, you will obtain a figure like the following:
Clearly the second feature is highly likely to be chosen from different distributions. Let's verify with the
KStest.As can be seen from above, using the
KStest,5%level of significance) that the first feature for the datasetsAandBcame from the same distribution since thep-valueis high (0.368 > 0.05),AandBcame from the same distribution since thep-valueis almost0.You can use the same approach on your
100-column datasets, by comparing them parewise.