Shift from single gpu to multiple gpu.Throws an error TypeError: '<' not supported between instances of 'list' and 'int'

156 Views Asked by At

I had shifted from using single gpu to multiple gpu. The Code throws an error

    epoch       main/loss   validation/main/loss  elapsed_time
   Exception in main training loop: '<' not supported between instances of 
    'list' and 'int'
       Traceback (most recent call last):
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/training/trainer.py", line 318, in run
       entry.extension(self)
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
        packages/chainer/training/extensions/evaluator.py", line 157, in 
        __call__
         result = self.evaluate()
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
         packages/chainer/training/extensions/evaluator.py", line 206, in evaluate
       in_arrays = self.converter(batch, self.device)
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 150, in concat_examples
       return to_device(device, _concat_arrays(batch, padding))
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 35, in to_device
          elif device < 0:

Will finalize trainer extensions and updater before reraising the exception.

I have tried without using gpu it worked fine. But when using single gpu ,got an error of out of memory.so, shifted p28xlarge instance and now it throws the above error.where is the problem and how to solve it ?

change's done using 8 gpu's

     num_gpus = 8
     chainer.cuda.get_device_from_id(0).use()

3.# updater

     if num_gpus > 0:

        updater = training.updater.ParallelUpdater(
        train_iter,
        optimizer,
        devices={('main' if device == 0 else str(device)): device for 
                 device in range(num_gpus)},
    )
    else:
        updater = training.updater.StandardUpdater(train_iter, optimizer, 
                    device=args.gpus)

4.and son on.. 5.Training :

       trainer.run()

output -- epoch main/loss validation/main/loss elapsed_time Exception in main training loop: '<' not supported between instances of 'list' and 'int'

I expected the output as

          epoch       main/loss   validation/main/loss  elapsed_time
           1.         
           2. 
           3. and so on till it converge's.
1

There are 1 best solutions below

3
hvy On

It seems like an error caused by the Evaluator extension when it's transferring data to the specified device. How are you specifying the device to Evalutor.__init__ ? Note that it should be a single device. Maybe this example could be a reference https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py