OMAP3530: Loop runs slower on DSP than on ARM

143 Views Asked by At

The OMAP3530 implements an ARM processor and a C64x+ DSP. I have a test loop that I expect to run faster on the DSP than on the ARM, but this is not the case.

Loop:

#define DIM 4
#define LIM 1000
#define MASK 3

int i, j;
uint32 arr[DIM][DIM] = {0};
uint32 rand[DIM][DIM] = {1, 5, 2, 7,
                         5, 4, 3, 8,
                         1, 2, 9, 3,
                         6, 6, 8, 4};

for (i = 0; i < LIM; i++)
    for (j = 0; j < LIM; j++)
        arr[i & MASK][j & MASK] += rand[i & MASK][j & MASK];

Benchmarks:

  • ARM: 5ms

  • DSP: 25ms

The point of the DSP is to handle simple arithmetic operations like this, so I would have expected it to be faster. I haven't done much configuration with the DSP, since I'm pretty inexperienced with it. I believe the cache is not configured, so am looking into that, but would welcome any other suggestions.

Could anybody advise on a possible solution?

EDIT - Changed the LIM value to 5000 to increase the # of iterations. New benchmarks:

  • ARM: 120ms

  • DSP: 530ms

1

There are 1 best solutions below

0
Marcus Müller On

I've seen this happen before. Using the DSP pays off in very specific scenarios, only. A million additions surely is not the use case – it's not like the ARM A8 is terribly bad at adding numbers, so you're running code that would be highly efficient on the ARM on a slower coprocessor. That simply won't speed things up.

The specific OMAP's you're looking at has an ARM Cortex A8 core with NEON, which means it has single-instruction-multiple-data Multiply/Accumulate instructions. Those should even be faster than just letting the compiler implement your loop as efficiently as it can, in my experience. Mileage might vary, though, assuming that somewhere down the line you're doing multiplications, too.

If you want to unleash the power of hand-optimized, intrinsics-rich platform-specific code, have a look at VOLK, which is a spin-off from the GNU Radio project, providing a Vector Optimized Library of Kernels, covering a generic implementation, x86/MMX/SSE2/AVX for most of the kernels, and a NEON implementation for some of them. Of specific interest to your problem might be the 16i_x5_add_quad_16i_x4 kernel.

In conclusion: Unless you're sure the C64x has a lot of advantages over the rather capable OMAP, I wouldn't use it. You mention that this is part of a larger loop on the DSP, but you don't have the means yet to count the cycles your algorithm took on the DSP – I'd recommend getting your development setup into a state where it's easy to decide how good your implementation is. The general purpose timers on the ARM surely aren't a good benchmark.