I'm relatively new to CUDA and am currently under a project to accelerate computer vision applications in embedded systems with gpu's attached(NVIDIA TX1). What I'm trying to do is select between two libraries: OpenCV and VisionWorks(includes OpenVX).
Currently, I have made test codes to run Canny Edge Detection algorithm and the two libraries showed different execution times(VisionWorks implementation takes about 30~40% less time).
So, I wondered what the reason might be, and thus profiled the kernel that's taking the most time: 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra , which is taking up 37.2% of the entire application(from both OpenCV implementation and VisionWorks implementation) and 'edgesHysteresisLocal' from VisionWorks.
I followed the 'guided analysis' and the profiler suggested that the applications are both latency-bound, and below are the captures of 'edgesHysteresisLocal' from VisionWorks, and 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra.
OpenCV4Tegra - canny::edgesHysteresisLocalKernel
VisionWorks - edgesHysteresisLocal
So, my question is,
from the analysis, what can I tell about the causes of the different performances?
Moreover, when profiling CUDA applications in general, where is a good point to start? I mean, there are a bunch of metrics and it's very hard to tell what to look at.
Is there some educational materials regarding profiling CUDA applications in general? (I looked at many slides from NVIDIA, and I think they're just telling the definitions of the metrics, not where to start from in general.)
-- By the way, as far as I know, NVIDIA doesn't provide the source codes of VisionWorks and OpenCV4Tegra. Correct me if I'm wrong.
Thank you in advance for your answers.
1/ the shared_memory usage is different between the 2 libraries this is probably the cause of performance divergeance.
2/ Generally is use three metrics to know if my algorithm is well coded for CUDA devices :
3/ i think there is many things on the internet....
another thing :
If you just want to qualify the usage of a lib versus another in order to select the best, why do you need the understand each implementations (it's interesting but not a pre-requisite isn't it)?
Why don't you measure algorithm performance with the cycle time and the quality of produced results according to a metric (false positive, average error on a set of results known, ...)