byte size information using sys and pympler

23 Views Asked by At

I am using the pandas library to import a dataset. I split the dataset into X and y variables. I preprocessed X through a pipeline and have the X_data variable. Later, I am using XGBoost and RF models for training.

I used the sys and pympler libraries to see how much memory space all variables are taking up in my working space.

First Issue: The issue is that I am getting different values for sys.getsizeof() and asizeof.asizeof(). I have looked around other forums and learned that sys does not work for certain variables (e.g., set) etc. But I am getting a huge difference in value even for a normal panda series. I am not sure which one to select for the value of 'y'.

Second issue: I am using asizeof.asizeof() to find the size of the XGBoost and RF models. For the RF model, I am getting about 261KB for the fitted model, but for XGBoost, I am getting around 5.2KB for the fitted model. I am not sure if these values for either model are correct. Maybe I am missing something out?

Below are the code and outputs.

For Issue# 1

from pympler import asizeof
import sys

###
# Code for Preprocessing and model training not shown
###

type(y)
output: pandas.core.series.Series

asizeof.asizeof(y)
output: 6706024

sys.getsizeof(y)
output: 121104

_____________
type(X_data)
output: scipy.sparse._csr.csr_matrix

asizeof.asizeof(X_data)
output: 2301736

sys.getsizeof(X_data)
output: 56  # I discarded this value as this byte size is for python object as my research

Issue# 2 I also tried to use pickle and joblib.dump. but I am getting different values for models.

import joblib
from pympler import asizeof
import pickle

rf_classifier1 = RandomForestClassifier(**param, random_state=42)

## before training
print(asizeof.asizeof(rf_classifier1))
print(sys.getsizeof(rf_classifier1))

joblib.dump(rf_classifier1, 'rf_classifier1.joblib')
print(sys.getsizeof('rf_classifier1.joblib'))
print(asizeof.asizeof('rf_classifier1.joblib'))

print("----------")
rf_classifier1.fit(X_data, y)

## after training

# checking size directly using sys and asizeof commands
print(asizeof.asizeof(rf_classifier1))
print(sys.getsizeof(rf_classifier1))

print("----------")

# checking with joblib
joblib.dump(rf_classifier1, 'rf_classifier1.joblib')
print(sys.getsizeof('rf_classifier1.joblib'))
print(asizeof.asizeof('rf_classifier1.joblib'))

print("----------")

# checking with pickle.dumps
p = pickle.dumps(rf_classifier1)
print(sys.getsizeof(p))
print(asizeof.asizeof(p))

Output ---

2712
56
70
72
----------
361064
56
----------
70
72
----------
17719262
17719264

For the last two outputs, the outcomes did not make sense to me. model size was 361K when used asizeof, whereas pickle.dumps showed 17719262, and joblib.dump has no change.

0

There are 0 best solutions below