I try to do time series forecasting based on a Gaussian model. Therefore, I use GPyTorch which is a Gaussian process library implemented using PyTorch to enable the use of multiple GPUs through DataParallel. I am following the example from https://docs.gpytorch.ai/en/latest/examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.html which shows the use of Gaussian model with multiple GPUs. Instead of the provided data from the example, I use my own data which has around 100.000 data points and one input feature. The data is also scaled and normalized. I run my code on a supercomputer from my university so I have 8 GPUs with each one 32GB of memory. According to the paper that is linked in the example, this is more than enough to calculate the Gaussian model. However, when I run my code I always get an error message saying that CUDA out of memory, more specific GPU0 out of memory.
After the error message, I watched the memory usage while running my code via nvidia-smi -l. I noticed that the memory is not distributed overall GPUs equally which result then in a CUDA out of memory message because GPU0 is full even though the rest has still capacities. The error messages look similar to this: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.09 GiB (GPU 0; 39.44 GiB total capacity; 28.12 GiB already allocated; 10.77 GiB free; 28.12 GiB reserved in total by PyTorch) If reser ved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Example of imbalanced memory usage with 4 GPUs and a smaller data set
According to the example, the code should try to allocate the memory over several GPUs and is able to handle up to 1.000.000 data points. So I don't understand why my code uses all the memory of the first GPU while there's more space on the other ones. According to the example the code should do something like this:
Number of devices: 2 -- Kernel partition size: 0
RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)
Number of devices: 2 -- Kernel partition size: 18292
RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)
Number of devices: 2 -- Kernel partition size: 9146
Iter 1/1 - Loss: 0.893 lengthscale: 0.486 noise: 0.248
Finished training on 36584 data points using 2 GPUs.
My code looks like this:
def Exact_gp(self, x_train, y_train, x_test, y_test):
output_device = torch.device('cuda:0')
x_train = torch.tensor(x_train, device=output_device)
y_train = torch.tensor(y_train, device=output_device)
x_test = torch.tensor(x_test, device=output_device)
y_test = torch.tensor(y_test, device=output_device)
train_x, train_y = torch.flatten(x_train.contiguous()), torch.flatten(y_train.contiguous())
test_x, test_y = torch.flatten(x_test.contiguous()), torch.flatten(y_test.contiguous())
n_devices = torch.cuda.device_count()
print('Planning to run on {} GPUs.'.format(n_devices))
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood, n_devices):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
self.covar_module = gpytorch.kernels.MultiDeviceKernel(
base_covar_module, device_ids=list(range(n_devices)),
output_device=output_device)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
def train(train_x, train_y, n_devices, output_device, checkpoint_size, preconditioner_size, n_training_iter):
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
print('model built.')
model.train()
likelihood.train()
print('model in training mode')
optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
print('optimizer is calculated.')
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
print('mll is calculated.')
with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \
gpytorch.settings.max_preconditioner_size(preconditioner_size):
def closure():
optimizer.zero_grad()
print('set optimizer to zero grad.')
output = model(train_x)
print('made predictions')
loss = -mll(output, train_y)
print('loss was calculated')
return loss
print('start with loss calculation')
loss = closure()
loss.backward(torch.ones_like(loss))
print('loss backward was calculated')
for i in range(n_training_iter):
options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}
loss, _, _, _, _, _, _, fail = optimizer.step(options)
print('Iter %d/%d - Loss: %.3f lengthscale: %.3f noise: %.3f' % (
i + 1, n_training_iter, loss.mean().item(),
model.covar_module.module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
if fail:
print('Convergence reached!')
break
print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
return model, likelihood
def find_best_gpu_setting(train_x,
train_y,
n_devices,
output_device,
preconditioner_size
):
N = train_x.size(0)
# Find the optimum partition/checkpoint size by decreasing in powers of 2
# Start with no partitioning (size = 0)
settings = [0] + [int(n) for n in np.ceil(N / 2 ** np.arange(1, np.floor(np.log2(N))))]
for checkpoint_size in settings:
print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))
try:
# Try a full forward and backward pass with this setting to check memory usage
_, _ = train(train_x, train_y,
n_devices=n_devices, output_device=output_device,
checkpoint_size=checkpoint_size,
preconditioner_size=preconditioner_size, n_training_iter=1)
# when successful, break out of for-loop and jump to finally block
break
except RuntimeError as e:
print('RuntimeError: {}'.format(e))
gc.collect()
torch.cuda.empty_cache()
except AttributeError as e:
print('AttributeError: {}'.format(e))
finally:
# handle CUDA OOM error
gc.collect()
torch.cuda.empty_cache()
print('emptied cache')
return checkpoint_size
# Set a large enough preconditioner size to reduce the number of CG iterations run
preconditioner_size = 100
checkpoint_size = find_best_gpu_setting(train_x, train_y,
n_devices=n_devices,
output_device=output_device,
preconditioner_size=preconditioner_size)
Thanks for helping me!