Optimizing Code for Computing Products from Correlation Matrix

43 Views Asked by At

I have a Python code that calculates products based on combinations of keys from a correlation matrix. The code works well for when the dataframe have small numbers of columns (e.g., less than 95 columns), but its performance degrades significantly as the column count increases (e.g., >95 columns). Even for small datasets I faced challenge to compute products more than 4 keys. I suspect there's room for improvement in terms of time complexity and memory efficiency. Below is the code:

import pandas as pd
from itertools import combinations
import numpy as np
import pandas as pd
from itertools import combinations
from itertools import islice

# synthetic data generated
# Set seed for reproducibility
np.random.seed(42)

# Generate random column names
column_names = ['test_' + str(i) for i in range(1, 1195)]

# Generate random row names
row_names = [f'ROW_{i}' for i in range(0, 151)]

# Create a DataFrame with random integers between 0 and 15
data = np.random.randint(0, 16, size=(len(row_names), len(column_names)))
df = pd.DataFrame(data, index=row_names, columns=column_names)


correlation_matrix = df.corr()

def compute_products(correlation_matrix):
    out = {}

    keys = correlation_matrix.index

    for r in range(2, 5):  # Compute products for 2, 3, 4 keys at a time
        for combo in combinations(keys, r):
            prod = 1
            for i in range(len(combo)):
                for j in range(i + 1, len(combo)):
                    prod *= correlation_matrix.loc[combo[i], combo[j]] ** 2
            out[str(combo)] = {
                'names': list(combo),
                'prod': prod
            }

    return out

bb = compute_products(correlation_matrix)

Specific Questions:

  • What optimisations can be applied to improve the time complexity and memory efficiency of code especially the compute_products function?

  • Are there alternative approaches or algorithms for achieving the same results with better scalability?

Additional Information:

  • I am using Python with pandas and numpy, I'm most comfortable with python but I don't mind answers using other language.

  • The code and a brief explanation of its purpose are provided above.

  • The generated dataset size is 151 rows by >100 columns, but it does vary depending on. the problem I'm working on.

I would appreciate any insights, suggestions, or improvements that can be made to enhance the efficiency of this code.

0

There are 0 best solutions below