Python code is taking too long for conversion of adjacency list to matrix and vice versa

52 Views Asked by At

I am working with the Reddit dataset and to train my graph ML model. I need to create a train adjacency matrix from the provided full graph adjacency list. The process involves converting the adjacency list to an adjacency matrix, filtering it based on the training mask, and then converting the selected adjacency matrix back to the adjacency list ( When I download data, I am provided with the adjacency list, not the matrix). The current implementation I have is functional, but it is taking an excessive amount of time to execute. I am seeking advice on how to optimize this part of my code for better performance.

from torch_geometric.datasets import Reddit
import torch
from torch_geometric.data import Data
import numpy as np
from scipy.sparse import lil_matrix


Reddit_data = Reddit(root = 'Reddit') # Download the dataset
Reddit_data_object = Reddit_data[0]

adj = Reddit_data_object.edge_index  # adjacency matrix
features = Reddit_data_object.x  # node features

train_mask = Reddit_data_object.train_mask
test_mask = Reddit_data_object.test_mask
train_features = features[train_mask]
y_train = Reddit_data_object.y[train_mask]
test_features = features[test_mask]
y_test = Reddit_data_object.y[test_mask]
test_index = torch.arange(Reddit_data_object.num_nodes)[test_mask]

true_indices = torch.nonzero(train_mask).squeeze()
num_nodes = len(Reddit_data_object.y)

Create matrix:

#adjacency_matrix = np.zeros((num_nodes, num_nodes), dtype=int)
adjacency_matrix = lil_matrix((num_nodes, num_nodes), dtype=np.uint)
adj_array = adj.numpy() # Convert tensor adj to array format

#Add connections in adjacency (matrix
for i in range(adj_array.shape[1]):
    source_node = int(adj_array[0, i])
    target_node = int(adj_array[1, i])
    adjacency_matrix[source_node, target_node] = 1
    if i % 5000000 ==0:
      print(i)

Filter the adj matrix and select nodes based on the train_mask (Only these will be used for training)

adj_matrix = adjacency_matrix[train_mask] 

Convert Adj matrix to adj list

source_node = []
target_node = []
adj_list = []
for i in range(adj_matrix.shape[0]):
    for j in range(adj_matrix.shape[1]):
        if adj_matrix[i ,  j] == 1:
            source_node.append(i)
            target_node.append(j)
    if i % 100 ==0: # To check the progress speed
        print(i)

print(adj_list)

To train my model, I needed adj, features, adj_train, train_features, y_train, y_test, test_index. I can extract the others from Reddit_data_object except adj_train.

I am executing this code on a machine which with 16GB RAM.

0

There are 0 best solutions below