For example, have a look at this snippet of code generated by gcc.

.L11:
        vpand   ymm0, ymm1, YMMWORD PTR [rax]
        add     rax, 224
        vmovdqa YMMWORD PTR [rax-224], ymm0
        vpand   ymm0, ymm2, YMMWORD PTR [rax-192]
        vmovdqa YMMWORD PTR [rax-192], ymm0
        vpand   ymm0, ymm3, YMMWORD PTR [rax-160]
        vmovdqa YMMWORD PTR [rax-160], ymm0
        vpand   ymm0, ymm4, YMMWORD PTR [rax-128]
        vmovdqa YMMWORD PTR [rax-128], ymm0
        vpand   ymm0, ymm5, YMMWORD PTR [rax-96]
        vmovdqa YMMWORD PTR [rax-96], ymm0
        vpand   ymm0, ymm6, YMMWORD PTR [rax-64]
        vmovdqa YMMWORD PTR [rax-64], ymm0
        vpand   ymm0, ymm7, YMMWORD PTR [rax-32]
        vmovdqa YMMWORD PTR [rax-32], ymm0
        cmp     rax, rsi
        jb      .L11

It does what I want, but one thing I notice is that the result is always stored to ymm0 before being stored to memory. I mean, I know that vpand cannot operate directly on memory, but wouldn't this, for example, be more efficient?

.L11:
        vpand   ymm8, ymm1, YMMWORD PTR [rax]
        add     rax, 224
        vmovdqa YMMWORD PTR [rax-224], ymm8
        vpand   ymm9, ymm2, YMMWORD PTR [rax-192]
        vmovdqa YMMWORD PTR [rax-192], ymm9
        vpand   ymm10, ymm3, YMMWORD PTR [rax-160]
        vmovdqa YMMWORD PTR [rax-160], ymm10
        vpand   ymm11, ymm4, YMMWORD PTR [rax-128]
        vmovdqa YMMWORD PTR [rax-128], ymm11
        vpand   ymm12, ymm5, YMMWORD PTR [rax-96]
        vmovdqa YMMWORD PTR [rax-96], ymm12
        vpand   ymm13, ymm6, YMMWORD PTR [rax-64]
        vmovdqa YMMWORD PTR [rax-64], ymm13
        vpand   ymm0, ymm7, YMMWORD PTR [rax-32]
        vmovdqa YMMWORD PTR [rax-32], ymm0
        cmp     rax, rsi
        jb      .L11

This way, I think more operations can be done in parallel because no dependency is carried through ymm0.

It uses more registers, but this function in whole uses all 16 ymms anyway, and after this part, new values are loaded to each registers, so using more registers is not really a problem.

I checked clang, and it produces the same code with just different register numbers.

Can I expect a visible speedup if I write in assembly directly with the second way of register allocation?

I can actually test and see, but writing directly in assembly is not an easy task to me, and there are other parts to be worked on, so I'm asking this question to first better understand if using different registers for consecutive memory stores can actually improve performance.

0

There are 0 best solutions below