How to interpret NVIDIA Visual Profiler analysis/recommendations?

205 Views Asked by At

I'm relatively new to CUDA and am currently under a project to accelerate computer vision applications in embedded systems with gpu's attached(NVIDIA TX1). What I'm trying to do is select between two libraries: OpenCV and VisionWorks(includes OpenVX).

Currently, I have made test codes to run Canny Edge Detection algorithm and the two libraries showed different execution times(VisionWorks implementation takes about 30~40% less time).

So, I wondered what the reason might be, and thus profiled the kernel that's taking the most time: 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra , which is taking up 37.2% of the entire application(from both OpenCV implementation and VisionWorks implementation) and 'edgesHysteresisLocal' from VisionWorks.

I followed the 'guided analysis' and the profiler suggested that the applications are both latency-bound, and below are the captures of 'edgesHysteresisLocal' from VisionWorks, and 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra.

OpenCV4Tegra - canny::edgesHysteresisLocalKernel

VisionWorks - edgesHysteresisLocal

So, my question is,

  • from the analysis, what can I tell about the causes of the different performances?

  • Moreover, when profiling CUDA applications in general, where is a good point to start? I mean, there are a bunch of metrics and it's very hard to tell what to look at.

  • Is there some educational materials regarding profiling CUDA applications in general? (I looked at many slides from NVIDIA, and I think they're just telling the definitions of the metrics, not where to start from in general.)

-- By the way, as far as I know, NVIDIA doesn't provide the source codes of VisionWorks and OpenCV4Tegra. Correct me if I'm wrong.

Thank you in advance for your answers.

1

There are 1 best solutions below

2
X3liF On

1/ the shared_memory usage is different between the 2 libraries this is probably the cause of performance divergeance.


2/ Generally is use three metrics to know if my algorithm is well coded for CUDA devices :

  • the memory usage of kernels (bandwidth)
  • the amount of registers used : is there register spilling or not ?
  • the amount of shared memory bank conflicts

3/ i think there is many things on the internet....


another thing :

If you just want to qualify the usage of a lib versus another in order to select the best, why do you need the understand each implementations (it's interesting but not a pre-requisite isn't it)?

Why don't you measure algorithm performance with the cycle time and the quality of produced results according to a metric (false positive, average error on a set of results known, ...)