How to interpret NVIDIA Visual Profiler analysis/recommendations?

205 Views Asked by AudioBubble At 02 May 2017 at 13:48

I'm relatively new to CUDA and am currently under a project to accelerate computer vision applications in embedded systems with gpu's attached(NVIDIA TX1). What I'm trying to do is select between two libraries: OpenCV and VisionWorks(includes OpenVX).

Currently, I have made test codes to run Canny Edge Detection algorithm and the two libraries showed different execution times(VisionWorks implementation takes about 30~40% less time).

So, I wondered what the reason might be, and thus profiled the kernel that's taking the most time: 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra , which is taking up 37.2% of the entire application(from both OpenCV implementation and VisionWorks implementation) and 'edgesHysteresisLocal' from VisionWorks.

I followed the 'guided analysis' and the profiler suggested that the applications are both latency-bound, and below are the captures of 'edgesHysteresisLocal' from VisionWorks, and 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra.

OpenCV4Tegra - canny::edgesHysteresisLocalKernel

VisionWorks - edgesHysteresisLocal

So, my question is,

from the analysis, what can I tell about the causes of the different performances?
Moreover, when profiling CUDA applications in general, where is a good point to start? I mean, there are a bunch of metrics and it's very hard to tell what to look at.
Is there some educational materials regarding profiling CUDA applications in general? (I looked at many slides from NVIDIA, and I think they're just telling the definitions of the metrics, not where to start from in general.)

-- By the way, as far as I know, NVIDIA doesn't provide the source codes of VisionWorks and OpenCV4Tegra. Correct me if I'm wrong.

Thank you in advance for your answers.

Original Q&A

There are 1 best solutions below

X3liF On 02 May 2017 at 14:54

1/ the shared_memory usage is different between the 2 libraries this is probably the cause of performance divergeance.

2/ Generally is use three metrics to know if my algorithm is well coded for CUDA devices :

the memory usage of kernels (bandwidth)
the amount of registers used : is there register spilling or not ?
the amount of shared memory bank conflicts

3/ i think there is many things on the internet....

another thing :

If you just want to qualify the usage of a lib versus another in order to select the best, why do you need the understand each implementations (it's interesting but not a pre-requisite isn't it)?

Why don't you measure algorithm performance with the cycle time and the quality of produced results according to a metric (false positive, average error on a set of results known, ...)

How to interpret NVIDIA Visual Profiler analysis/recommendations?

There are 1 best solutions below

Related Questions in PARALLEL-PROCESSING

Related Questions in CUDA

Related Questions in COMPUTER-VISION

Related Questions in GPU

Related Questions in NVVP

Trending Questions

Popular # Hahtags

Popular Questions