Is it better to correlate group size in a category with individual ranks or with the average ranking of the group?

22 Views Asked by At

I am very unexperienced with statistics, so I am looking to you for guidance. I have a dataset, and the dataset is fully ranked. It includes category and numerical columns. I am now trying to explore, if rows, that fall high in the ranking, have values in the category columns, that do not appear often.

So I tried calculating the correlation between category sizes and ranks of the rows. But then I thought it might be more reasonable, to correlate it with the average rank of the category, and I tried it aswell. I got results like these:

Results for AEOutput_with_input.csv:
+--------------------+-----------------------+---------------------------+
| Categorical Column | Correlation Size-Rank | Correlation Size-Avg_Rank |
+--------------------+-----------------------+---------------------------+
|       MONTH        |         0.001         |           0.007           |
|        DAY         |         0.034         |           0.246           |
|      WEEKDAY       |         0.085         |           0.985           |
|        PEH         |         0.488         |           1.000           |
+--------------------+-----------------------+---------------------------+
Results for AEOutput_with_original.csv:
...
+--------------------+-----------------------+---------------------------+
|    NIELSEN_BEZ     |         0.344         |           0.483           |
|    ARTIKEL_BEZ     |         0.082         |           0.093           |
+--------------------+-----------------------+---------------------------+

I was surprised, by how much the correlation differs for example for WEEKDAY. So now my questions: Was it correct of me to assume that correlating with average rank in this case is more meaningfull? How can I interpret the big discrepencies like for WEEKDAY, while for other colums the values are far closer together? Might the descrepencies be a sign of me making a programming error? The code I used is the following.

categorical_columns = [['MONTH', 'DAY', 'WEEKDAY', 'PEH'],['NIELSEN_BEZ', 'ARTIKEL_BEZ']]
rank_column = 'RANK'  

def calculate_correlations(df, categorical_column, rank_column):
    category_stats = df.groupby(categorical_column).agg(Category_Size=('RANK', 'size'),
                                                        Category_Avg_Rank=('RANK', 'mean')).reset_index()

    df_merged = pd.merge(df, category_stats, on=categorical_column)

    correlation_size_rank = df_merged['Category_Size'].corr(df_merged[rank_column])
    correlation_size_avg_rank = df_merged['Category_Size'].corr(df_merged['Category_Avg_Rank'])

    return correlation_size_rank, correlation_size_avg_rank

for i, (filename, df) in enumerate(file_df_tuples):
    # Create a PrettyTable for the results
    result_table = PrettyTable()
    result_table.field_names = ["Categorical Column", "Correlation Size-Rank", "Correlation Size-Avg_Rank"]

    print(f"Results for {filename}:")

    for categorical_column in categorical_columns[i]:
        correlation_size_rank, correlation_size_avg_rank = calculate_correlations(df, categorical_column, rank_column)
        result_table.add_row([f"{categorical_column}", f"{correlation_size_rank:.3f}", f"{correlation_size_avg_rank:.3f}"])

    print(result_table)

    # Save the table to a txt file
    output_file_path = os.path.join(output_folder, f'correlation_results_{filename}.txt')
    with open(output_file_path, 'w') as output_file:
        output_file.write(str(result_table))

Thank you for your time.

0

There are 0 best solutions below