Loading xmm register with two UInt64s that are in a pointed to array

1.3k Views Asked by At

I'm trying to load a 128-bit xmm register with two UInt64 integer in Delphi (XE6).

Background

An XMM register is 128-bits, and can be loaded with multiple, independent, integers. You can then have the CPU add those multiple integers all in parallel.

For example you can load up xmm0 and xmm1 with four UInt32s each, and then have the CPU add all four pairs simultaneously.

xmm0: $00001000 $00000100 $00000010 $00000001
          +         +         +         +      
xmm1: $00002000 $00000200 $00000020 $00000002
          =         =         =         =
xmm0: $00003000 $00000300 $00000030 $00000003

After loading xmm0 and xmm0, you perform the add of the four pairs using:

paddd xmm0, xmm1    //Add packed 32-bit integers (i.e. xmm0 := xmm0 + xmm1)

You could also do it using 8 x 16-bit integers:

xmm0: $001F $0013 $000C $0007 $0005 $0003 $0002 $0001
        +     +     +     +     +     +     +     + 
xmm1: $0032 $001F $0013 $000C $0007 $0005 $0003 $0002
        =     =     =     =     =     =     =     = 
xmm0: $0051 $0032 $001F $0013 $000C $0007 $0005 $0003

With the instruction

paddw xmm0, xmm1  //Add packed 16-bit integers

Now for 64-bit integers

To load two 64-bit integers into an xmm register, you have to use either:

  • movdqu: Move double-quadword (unaligned)
  • movdqa: Move double-quadword (aligned)

In this simple example we won't worry about our UInt64s being aligned, and we'll simply use the unaligned version (movdqu)

The first thing that we have to deal with is that the Delphi compiler knows that movdqu needs a 128-bit something to load - it's loading double quadwords.

For this we will create a 128-bit structure, which also nicely lets us address the two 64-bit values:

TDoubleQuadword = packed record
   v1: UInt64; //value 1
   v2: UInt64; //value 2
end;

And now we can use this type in a test console app:

procedure Main;
var
    x, y: TDoubleQuadword;
begin
    //[1,5] + [2,7] = ?
    x.v1 := $0000000000000001;
    x.v2 := $0000000000000005;

    y.v1 := $0000000000000002;
    y.v2 := $0000000000000007;

    asm
        movdqu xmm0, x      //move unaligned double quadwords (xmm0 := x)
        movdqu xmm1, y      //move unaligned double quadwords (xmm1 := y)

        paddq  xmm0, xmm1   //add packed quadword integers    (xmm0 := xmm0 + xmm1)

        movdqu x, xmm0      //move unaligned double quadwords (x := xmm0)

    end;

    WriteLn(IntToStr(x.v1)+', '+IntToSTr(x.v2));
end;

And this works, printing out:

3, 12

Eye on the prize

With an eye towards the goal of having the x and y be aligned (but not a necessary part of my question), lets say we have a pointer to a TDoubleQuadword structure:

TDoubleQuadword = packed record
   v1: UInt64; //value 1
   v2: UInt64; //value 2
end;
PDoubleQuadword = ^TDoubleQuadword;

we now change up our hypothetical test function to use PDoubleQuadword:

procedure AlignedStuff;
var
    x, y: PDoubleQuadword;
begin
    x := GetMemory(sizeof(TDoubleQuadword));
    x.v1 := $0000000000000001;
    x.v2 := $0000000000000005;

    y := GetMemory(sizeof(TDoubleQuadword));
    y.v1 := $0000000000000002;
    y.v2 := $0000000000000007;

    asm
        movdqu xmm0, x      //move unaligned double quadwords (xmm0 := x)
        movdqu xmm1, y      //move unaligned double quadwords (xmm1 := y)

        paddq  xmm0, xmm1       //add packed quadword integers    (xmm0 := xmm0 + xmm1)
        movdqu x, xmm0         //move unaligned double quadwords (v1 := xmm0)
    end;

    WriteLn(IntToStr(x.v1)+', '+IntToSTr(x.v2));
end;

Now this doesn't compile, and it makes sense why:

movdqu xmm0, x      //E2107 Operand size mismatch

That makes sense. The x argument must be 128-bits, and the compiler knows that x is really only a (32-bit) pointer.

But what should it be?

Now we come to my question: what should it be? I've randomly mashed various things on my keyboard, hoping that the compiler gods would just accept what i obviously mean. But nothing works.

//Don't try to pass the 32-bit pointer itself, pass the thing it points to:
movdqu xmm0, x^     //E2107 Operand size mismatch    

//Try casting it
movdqu xmm0, TDoubleQuadword(x^) //E2105 Inline assembler error

//i've seen people using square brackets to mean "contents of":
movdqu xmm0, [x]     //E2107 Operand size mismatch    

And now we give up on rational thought

movdqu xmm0, Pointer(x)
movdqu xmm0, Addr(x^)
movdqu xmm0, [Addr(x^)]
movdqu xmm0, [Pointer(TDoubleQuadword(x))^]

I did get one thing to compile:

movdqu xmm0, TDoubleQuadword(x)

But of course that loads the address of x into the register, rather than the values inside x.

So i give up.

Complete Minimal Example

program Project3;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils;

type
     TDoubleQuadword = packed record
         v1: UInt64; //value 1
         v2: UInt64; //value 2
     end;
     PDoubleQuadword = ^TDoubleQuadword;

    TVectorUInt64 = array[0..15] of UInt64;
    PVectorUInt64 = ^TVectorUInt64;

 procedure AlignedStuff;
 var
    x, y: PVectorUInt64;
 begin
    x := GetMemory(sizeof(TVectorUInt64));
    //x[0] := ...
    //x[1] := ...
    // ...
    //x[3] := ...
    x[4] := $0000000000000001;
    x[5] := $0000000000000005;

    y := GetMemory(sizeof(TVectorUInt64));
    //y[0] := ...
    //y[1] := ...
    // ...
    //y[3] := ...
    y[4] := $0000000000000002;
    y[5] := $0000000000000007;

    asm
        movdqu xmm0, TDoubleQuadword(x[4])      //move unaligned double quadwords (xmm0 := x)
        movdqu xmm1, TDoubleQuadword(y[4])      //move unaligned double quadwords (xmm1 := y)

        paddq  xmm0, xmm1       //add packed quadword integers    (xmm0 := xmm0 + xmm1)
        movdqu TDoubleQuadword(x[4]), xmm0         //move unaligned double quadwords (v1 := xmm0)
    end;

    WriteLn(IntToStr(x[4])+', '+IntToSTr(x[5]));
 end;

begin
  try
        AlignedStuff;
        Writeln('Press enter to close...');
        Readln;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

Pointer?

The reason the question is asking about pointers is because:

  • you cannot use stack variables (Delphi doesn't guarantee alignment of stack variables)
  • you could copy them into a register (e.g. EAX), but then you're doing a wasted copy and function call
  • i already have the data aligned in memory

If i give an example of the code that just involves adding UInt64s:

TVectorUInt64 = array[0..15] of UInt64;
PVectorUInt64 = ^TVectorUInt64;

var
   v: PVectorUInt64;
begin
   v := GetMemoryAligned(sizeof(TVectorUInt64), 64); //64-byte alignment

   //v is initalized

   for i := 0 to 15 do
   begin
      v[0] := v[0] + v[4];
      v[1] := v[1] + v[5];
      v[2] := v[2] + v[6];
      v[3] := v[3] + v[7];

      //..and some more changes to v0..v3
      //..and some more changes to v12..v15

      v[8]  := v[8]  + v[12];
      v[9]  := v[9]  + v[13];
      v[10] := v[10] + v[14];
      v[11] := v[11] + v[15];

      //...and some more changes to v4..v7

      v[0] := v[0] + v[4];
      v[1] := v[1] + v[5];
      v[2] := v[2] + v[6];
      v[3] := v[3] + v[7];

      //...and some more changes to v0..v3
      //...and some more changes to v12..v15

      v[8]  := v[8]  + v[12];
      v[9]  := v[9]  + v[13];
      v[10] := v[10] + v[14];
      v[11] := v[11] + v[15];

      //...and some more changes to v4..v7

      v[0] := v[0] + v[4];
      v[1] := v[1] + v[5];
      v[2] := v[2] + v[6];
      v[3] := v[3] + v[7];

      //..and some more changes to v0..v3
      //..and some more changes to v12..v15

      v[8]  := v[8]  + v[12];
      v[9]  := v[9]  + v[13];
      v[10] := v[10] + v[14];
      v[11] := v[11] + v[15];

      //...and some more changes to v4..v7

      v[0] := v[0] + v[4];
      v[1] := v[1] + v[5];
      v[2] := v[2] + v[6];
      v[3] := v[3] + v[7];

      //...and some more changes to v0..v3
      //...and some more changes to v12..v15

      v[8]  := v[8]  + v[12];
      v[9]  := v[9]  + v[13];
      v[10] := v[10] + v[14];
      v[11] := v[11] + v[15];

      //...and some more changes to v4..v7
   end;

It is conceptually very easy to change the code to:

      //v[0] := v[0] + v[4];
      //v[1] := v[1] + v[5];
      asm
         movdqu xmm0, v[0]
         movdqu xmm1, v[4]
         paddq xmm0, xmm1
         movdqu v[0], xmm0
      end
      //v[2] := v[2] + v[6];
      //v[3] := v[3] + v[7];
      asm
         movdqu xmm0, v[2]
         movdqu xmm1, v[6]
         paddq xmm0, xmm1
         movdqu v[2], xmm0
      end

      //v[8]  := v[8]  + v[12];
      //v[9]  := v[9]  + v[13];
      asm
         movdqu xmm0, v[8]
         movdqu xmm1, v[12]
         paddq xmm0, xmm1
         movdqu v[8], xmm0
      end
      //v[10] := v[10] + v[14];
      //v[11] := v[11] + v[15];
      asm
         movdqu xmm0, v[10]
         movdqu xmm1, v[14]
         paddq xmm0, xmm1
         movdqu v[10], xmm0
      end

The trick is getting the Delphi compiler to accept it.

  • it works for immediate data
  • it fails for pointer to data
  • and you would think [contentsOfSquareBrackets] would work

Bonus Chatter

Using David's solution (of function calling overhead) leads to a performance improvement of -7% (90 MB/s -> 83 MB/s of algorithm throughput)

It seems like, in the XE6 compiler, it is valid to conceptually call:

movdqu xmm0, TPackedQuadword

but the compiler just doesn't have the brains to let you perform the conceptual call:

movdqu xmm0, PPackedQuadword^

or it's moral equivalent.

If that's the answer, don't be afraid of it. Embrace it, and put it as the form of an answer:

*"The compiler does not support dereferencing a pointer inside an asm block. No matter if you try that with a caret (^), or square brackets ([...]). It just cannot be done.

If that's the answer: answer it.

If it's not the case, and the compiler can support pointers in an asm block, then post the answer.

2

There are 2 best solutions below

24
MBo On

Working code:

   asm
        mov eax, x
        mov edx, y
        movdqu xmm0, DQWORD PTR [eax]   //move unaligned double quadwords (xmm0 := x)
        movdqu xmm1, DQWORD PTR [edx]  //move unaligned double quadwords (xmm1 := y)

        paddq  xmm0, xmm1     //add packed quadword integers    (xmm0 := xmm0 + xmm1)
        movdqu DQWORD PTR [eax], xmm0  //move unaligned double quadwords (v1 := xmm0)
    end;

 IntToStr(x.v1)+', '+IntToSTr(x.v2);  prints 3,12
0
David Heffernan On

The documentation for inline assembler in Delphi isn't as comprehensive as it should be and a lot of the functionality is simply not documented. So I can't be sure of this, but to the best of my knowledge there is simply no support for the assembler statement that you are trying to write, where one operand is a local variable of pointer type.

I would strongly urge you to avoid mixing Pascal code and assembler code in the same function. It makes very hard to produce efficient code, and makes it very hard to manage register usage as you move between Pascal code and assembler code in the same function.

I personally make it a rule never to mix Pascal and inline assembler. Always write pure assembler functions. For instance, for 32 bit code you would write a complete program like this:

{$APPTYPE CONSOLE}

type
  PDoubleQuadword = ^TDoubleQuadword;
  TDoubleQuadword = record
    v1: UInt64;
    v2: UInt64;
  end;

function AddDoubleQuadword(const dqw1, dqw2: TDoubleQuadword): TDoubleQuadword;
asm
  movdqu xmm0, [eax]
  movdqu xmm1, [edx]
  paddq  xmm0, xmm1
  movdqu [ecx], xmm0
end;

procedure AlignedStuff;
var
  x, y: PDoubleQuadword;
begin
  New(x);
  x.v1 := $0000000000000001;
  x.v2 := $0000000000000005;

  New(y);
  y.v1 := $0000000000000002;
  y.v2 := $0000000000000007;

  x^ := AddDoubleQuadword(x^, y^);

  Writeln(x.v1, ', ', x.v2);
end;

begin
  AlignedStuff;
  Readln;
end.

This program outputs:

3, 12

Or you could use a record with operators:

type
  PDoubleQuadword = ^TDoubleQuadword;
  TDoubleQuadword = record
    v1: UInt64;
    v2: UInt64;
    class operator Add(const dqw1, dqw2: TDoubleQuadword): TDoubleQuadword;
  end;

class operator TDoubleQuadword.Add(const dqw1, dqw2: TDoubleQuadword): TDoubleQuadword;
asm
  movdqu xmm0, [eax]
  movdqu xmm1, [edx]
  paddq  xmm0, xmm1
  movdqu [ecx], xmm0
end;

And then at the call site you have:

x^ := x^ + y^;