OpenACC - What does -ta in pgcc compiler mean?

647 Views Asked by At

I am struggling with "-ta" flag in pgi compiler in order to use GPU acceleration using OpenACC. I did not find any comprehensive answer. Yes, I know that it is called target accelerator to boost using information about the hardware. So, what -ta should I set, if my GPU hardware is:

weugene@landau:~$ sudo lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104GL [10de:1bb1] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation GP104GL [Quadro P4000] [10de:11a3]
    Physical Slot: 4
    Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 0
    Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Memory at d0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at e000 [size=128]
    [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Legacy Endpoint, MSI 00
    Capabilities: [100] Virtual Channel

CUDA versions for pgi compiler (/opt/pgi/linux86-64/2019/cuda) are: 9.2, 10.0, 10.1

1

There are 1 best solutions below

0
Mat Colgrove On

As you note, "-ta" stands for "target accelerator" and is a way for you to override the default target device when using "-acc" ("-acc" tells the compiler to use OpenACC and using just "-ta" implies "-acc"). PGI currently supports two targets, "multicore" to target a mult-core CPU, or "tesla" to target an NVIDIA Tesla device. Other NVIDIA products such as Quadro and GeForce will also work under the "tesla" flag provided they share the same architecture as a Tesla product.

By default when using "-ta=tesla", the PGI compiler will create a unified binary supporting multiple NVIDIA architectures. The exact set of architectures will depend on the compiler version and the CUDA device driver on the build system. For example with PGI 19.4 on a system with a CUDA 9.2 driver, the compiler will target Kepler (cc35), Maxwell (cc50), Pascal (cc60), and Volta (cc70) architectures. "cc" stands for the compute capability. Note if no CUDA driver can be found on the system, then the 19.4 compiler default to use CUDA 10.0.

In your case, a Quadro P4000 uses the Pascal architecture (cc60) so would be targeted by default. If you wanted to have the compiler only target your device, as opposed to creating a unified binary, you'd used the option "-ta=tesla:cc60"

You can also override which Cuda version to use as a sub-option. For example "-ta=tesla:cuda10.1". For a complete list of sub-options please run "pgcc -help -ta" from the command line or consult PGI's documentation.

If you don't know the compute capability of the device, run the PGI utility "pgaccelinfo" which will give you this information. For example, here's the output for my system which has a V100:

% pgaccelinfo

CUDA Driver Version:           10010
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  418.67  Sat Apr  6 03:07:24 CDT 2019

Device Number:                 0
Device Name:                   Tesla V100-PCIE-16GB
Device Revision Number:        7.0
Global Memory Size:            16914055168
Number of Multiprocessors:     80
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1380 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             877 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 6291456 bytes
Max Threads Per SMP:           2048
Async Engines:                 7
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
PGI Default Target:            -ta=tesla:cc70

Hope this helps!