I am very unexperienced with statistics, so I am looking to you for guidance. I have a dataset, and the dataset is fully ranked. It includes category and numerical columns. I am now trying to explore, if rows, that fall high in the ranking, have values in the category columns, that do not appear often.
So I tried calculating the correlation between category sizes and ranks of the rows. But then I thought it might be more reasonable, to correlate it with the average rank of the category, and I tried it aswell. I got results like these:
Results for AEOutput_with_input.csv:
+--------------------+-----------------------+---------------------------+
| Categorical Column | Correlation Size-Rank | Correlation Size-Avg_Rank |
+--------------------+-----------------------+---------------------------+
| MONTH | 0.001 | 0.007 |
| DAY | 0.034 | 0.246 |
| WEEKDAY | 0.085 | 0.985 |
| PEH | 0.488 | 1.000 |
+--------------------+-----------------------+---------------------------+
Results for AEOutput_with_original.csv:
...
+--------------------+-----------------------+---------------------------+
| NIELSEN_BEZ | 0.344 | 0.483 |
| ARTIKEL_BEZ | 0.082 | 0.093 |
+--------------------+-----------------------+---------------------------+
I was surprised, by how much the correlation differs for example for WEEKDAY. So now my questions: Was it correct of me to assume that correlating with average rank in this case is more meaningfull? How can I interpret the big discrepencies like for WEEKDAY, while for other colums the values are far closer together? Might the descrepencies be a sign of me making a programming error? The code I used is the following.
categorical_columns = [['MONTH', 'DAY', 'WEEKDAY', 'PEH'],['NIELSEN_BEZ', 'ARTIKEL_BEZ']]
rank_column = 'RANK'
def calculate_correlations(df, categorical_column, rank_column):
category_stats = df.groupby(categorical_column).agg(Category_Size=('RANK', 'size'),
Category_Avg_Rank=('RANK', 'mean')).reset_index()
df_merged = pd.merge(df, category_stats, on=categorical_column)
correlation_size_rank = df_merged['Category_Size'].corr(df_merged[rank_column])
correlation_size_avg_rank = df_merged['Category_Size'].corr(df_merged['Category_Avg_Rank'])
return correlation_size_rank, correlation_size_avg_rank
for i, (filename, df) in enumerate(file_df_tuples):
# Create a PrettyTable for the results
result_table = PrettyTable()
result_table.field_names = ["Categorical Column", "Correlation Size-Rank", "Correlation Size-Avg_Rank"]
print(f"Results for {filename}:")
for categorical_column in categorical_columns[i]:
correlation_size_rank, correlation_size_avg_rank = calculate_correlations(df, categorical_column, rank_column)
result_table.add_row([f"{categorical_column}", f"{correlation_size_rank:.3f}", f"{correlation_size_avg_rank:.3f}"])
print(result_table)
# Save the table to a txt file
output_file_path = os.path.join(output_folder, f'correlation_results_{filename}.txt')
with open(output_file_path, 'w') as output_file:
output_file.write(str(result_table))
Thank you for your time.