What is the reason for K80 versus Pascal performance differences in this program that adds two arrays?

28 Views Asked by At

I followed the example on this page to get started with CUDA programming. It uses addition of two arrays with a million elements each for illustration with different execution configurations.

I used a Tesla P100 (Pascal architecture) to run the code using Google Colaboratory. But the article uses a K80. Here are the metrics from nvprof on executing the same code in both these GPUs.

+--------------------+-------------------------+----------+
| GPU                | Execution configuration | Time     |
+--------------------+-------------------------+----------+
| K80                | <<<1, 256>>>            | 2.7107ms |
+--------------------+-------------------------+----------+
| Tesla-P100(Pascal) | <<<1, 256>>>            | 4.4293ms |
+--------------------+-------------------------+----------+
| K80                | <<<4096, 256>>>         | 94.015us |
+--------------------+-------------------------+----------+
| Tesla-P100(Pascal) | <<<4096, 256>>>         | 3.6076ms |
+--------------------+-------------------------+----------+

After reading this article, I was under the assumption that the Pascal architecture would outperform the K80. But as seen above there are two observations:

  1. The K80 is faster than Pascal for single block performance
  2. Using 4096 blocks instead of 1 on the K80 produces significant performance gains (~28x), but this is not seen in the case of Pascal (~1.2x)

Is this expected? Also, what would be the explanation for the observation (2)?

Please let me know if I am missing something here.

Thank you for reading.

0

There are 0 best solutions below