I'm trying to load a 128-bit xmm register with two UInt64 integer in Delphi (XE6).
Background
An XMM register is 128-bits, and can be loaded with multiple, independent, integers. You can then have the CPU add those multiple integers all in parallel.
For example you can load up xmm0 and xmm1 with four UInt32s each, and then have the CPU add all four pairs simultaneously.
xmm0: $00001000 $00000100 $00000010 $00000001
+ + + +
xmm1: $00002000 $00000200 $00000020 $00000002
= = = =
xmm0: $00003000 $00000300 $00000030 $00000003
After loading xmm0 and xmm0, you perform the add of the four pairs using:
paddd xmm0, xmm1 //Add packed 32-bit integers (i.e. xmm0 := xmm0 + xmm1)
You could also do it using 8 x 16-bit integers:
xmm0: $001F $0013 $000C $0007 $0005 $0003 $0002 $0001
+ + + + + + + +
xmm1: $0032 $001F $0013 $000C $0007 $0005 $0003 $0002
= = = = = = = =
xmm0: $0051 $0032 $001F $0013 $000C $0007 $0005 $0003
With the instruction
paddw xmm0, xmm1 //Add packed 16-bit integers
Now for 64-bit integers
To load two 64-bit integers into an xmm register, you have to use either:
- movdqu: Move double-quadword (unaligned)
- movdqa: Move double-quadword (aligned)
In this simple example we won't worry about our UInt64s being aligned, and we'll simply use the unaligned version (movdqu)
The first thing that we have to deal with is that the Delphi compiler knows that movdqu needs a 128-bit something to load - it's loading double quadwords.
For this we will create a 128-bit structure, which also nicely lets us address the two 64-bit values:
TDoubleQuadword = packed record
v1: UInt64; //value 1
v2: UInt64; //value 2
end;
And now we can use this type in a test console app:
procedure Main;
var
x, y: TDoubleQuadword;
begin
//[1,5] + [2,7] = ?
x.v1 := $0000000000000001;
x.v2 := $0000000000000005;
y.v1 := $0000000000000002;
y.v2 := $0000000000000007;
asm
movdqu xmm0, x //move unaligned double quadwords (xmm0 := x)
movdqu xmm1, y //move unaligned double quadwords (xmm1 := y)
paddq xmm0, xmm1 //add packed quadword integers (xmm0 := xmm0 + xmm1)
movdqu x, xmm0 //move unaligned double quadwords (x := xmm0)
end;
WriteLn(IntToStr(x.v1)+', '+IntToSTr(x.v2));
end;
And this works, printing out:
3, 12
Eye on the prize
With an eye towards the goal of having the x and y be aligned (but not a necessary part of my question), lets say we have a pointer to a TDoubleQuadword structure:
TDoubleQuadword = packed record
v1: UInt64; //value 1
v2: UInt64; //value 2
end;
PDoubleQuadword = ^TDoubleQuadword;
we now change up our hypothetical test function to use PDoubleQuadword:
procedure AlignedStuff;
var
x, y: PDoubleQuadword;
begin
x := GetMemory(sizeof(TDoubleQuadword));
x.v1 := $0000000000000001;
x.v2 := $0000000000000005;
y := GetMemory(sizeof(TDoubleQuadword));
y.v1 := $0000000000000002;
y.v2 := $0000000000000007;
asm
movdqu xmm0, x //move unaligned double quadwords (xmm0 := x)
movdqu xmm1, y //move unaligned double quadwords (xmm1 := y)
paddq xmm0, xmm1 //add packed quadword integers (xmm0 := xmm0 + xmm1)
movdqu x, xmm0 //move unaligned double quadwords (v1 := xmm0)
end;
WriteLn(IntToStr(x.v1)+', '+IntToSTr(x.v2));
end;
Now this doesn't compile, and it makes sense why:
movdqu xmm0, x //E2107 Operand size mismatch
That makes sense. The x argument must be 128-bits, and the compiler knows that x is really only a (32-bit) pointer.
But what should it be?
Now we come to my question: what should it be? I've randomly mashed various things on my keyboard, hoping that the compiler gods would just accept what i obviously mean. But nothing works.
//Don't try to pass the 32-bit pointer itself, pass the thing it points to:
movdqu xmm0, x^ //E2107 Operand size mismatch
//Try casting it
movdqu xmm0, TDoubleQuadword(x^) //E2105 Inline assembler error
//i've seen people using square brackets to mean "contents of":
movdqu xmm0, [x] //E2107 Operand size mismatch
And now we give up on rational thought
movdqu xmm0, Pointer(x)
movdqu xmm0, Addr(x^)
movdqu xmm0, [Addr(x^)]
movdqu xmm0, [Pointer(TDoubleQuadword(x))^]
I did get one thing to compile:
movdqu xmm0, TDoubleQuadword(x)
But of course that loads the address of x into the register, rather than the values inside x.
So i give up.
Complete Minimal Example
program Project3;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils;
type
TDoubleQuadword = packed record
v1: UInt64; //value 1
v2: UInt64; //value 2
end;
PDoubleQuadword = ^TDoubleQuadword;
TVectorUInt64 = array[0..15] of UInt64;
PVectorUInt64 = ^TVectorUInt64;
procedure AlignedStuff;
var
x, y: PVectorUInt64;
begin
x := GetMemory(sizeof(TVectorUInt64));
//x[0] := ...
//x[1] := ...
// ...
//x[3] := ...
x[4] := $0000000000000001;
x[5] := $0000000000000005;
y := GetMemory(sizeof(TVectorUInt64));
//y[0] := ...
//y[1] := ...
// ...
//y[3] := ...
y[4] := $0000000000000002;
y[5] := $0000000000000007;
asm
movdqu xmm0, TDoubleQuadword(x[4]) //move unaligned double quadwords (xmm0 := x)
movdqu xmm1, TDoubleQuadword(y[4]) //move unaligned double quadwords (xmm1 := y)
paddq xmm0, xmm1 //add packed quadword integers (xmm0 := xmm0 + xmm1)
movdqu TDoubleQuadword(x[4]), xmm0 //move unaligned double quadwords (v1 := xmm0)
end;
WriteLn(IntToStr(x[4])+', '+IntToSTr(x[5]));
end;
begin
try
AlignedStuff;
Writeln('Press enter to close...');
Readln;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
Pointer?
The reason the question is asking about pointers is because:
- you cannot use stack variables (Delphi doesn't guarantee alignment of stack variables)
- you could copy them into a register (e.g. EAX), but then you're doing a wasted copy and function call
- i already have the data aligned in memory
If i give an example of the code that just involves adding UInt64s:
TVectorUInt64 = array[0..15] of UInt64;
PVectorUInt64 = ^TVectorUInt64;
var
v: PVectorUInt64;
begin
v := GetMemoryAligned(sizeof(TVectorUInt64), 64); //64-byte alignment
//v is initalized
for i := 0 to 15 do
begin
v[0] := v[0] + v[4];
v[1] := v[1] + v[5];
v[2] := v[2] + v[6];
v[3] := v[3] + v[7];
//..and some more changes to v0..v3
//..and some more changes to v12..v15
v[8] := v[8] + v[12];
v[9] := v[9] + v[13];
v[10] := v[10] + v[14];
v[11] := v[11] + v[15];
//...and some more changes to v4..v7
v[0] := v[0] + v[4];
v[1] := v[1] + v[5];
v[2] := v[2] + v[6];
v[3] := v[3] + v[7];
//...and some more changes to v0..v3
//...and some more changes to v12..v15
v[8] := v[8] + v[12];
v[9] := v[9] + v[13];
v[10] := v[10] + v[14];
v[11] := v[11] + v[15];
//...and some more changes to v4..v7
v[0] := v[0] + v[4];
v[1] := v[1] + v[5];
v[2] := v[2] + v[6];
v[3] := v[3] + v[7];
//..and some more changes to v0..v3
//..and some more changes to v12..v15
v[8] := v[8] + v[12];
v[9] := v[9] + v[13];
v[10] := v[10] + v[14];
v[11] := v[11] + v[15];
//...and some more changes to v4..v7
v[0] := v[0] + v[4];
v[1] := v[1] + v[5];
v[2] := v[2] + v[6];
v[3] := v[3] + v[7];
//...and some more changes to v0..v3
//...and some more changes to v12..v15
v[8] := v[8] + v[12];
v[9] := v[9] + v[13];
v[10] := v[10] + v[14];
v[11] := v[11] + v[15];
//...and some more changes to v4..v7
end;
It is conceptually very easy to change the code to:
//v[0] := v[0] + v[4];
//v[1] := v[1] + v[5];
asm
movdqu xmm0, v[0]
movdqu xmm1, v[4]
paddq xmm0, xmm1
movdqu v[0], xmm0
end
//v[2] := v[2] + v[6];
//v[3] := v[3] + v[7];
asm
movdqu xmm0, v[2]
movdqu xmm1, v[6]
paddq xmm0, xmm1
movdqu v[2], xmm0
end
//v[8] := v[8] + v[12];
//v[9] := v[9] + v[13];
asm
movdqu xmm0, v[8]
movdqu xmm1, v[12]
paddq xmm0, xmm1
movdqu v[8], xmm0
end
//v[10] := v[10] + v[14];
//v[11] := v[11] + v[15];
asm
movdqu xmm0, v[10]
movdqu xmm1, v[14]
paddq xmm0, xmm1
movdqu v[10], xmm0
end
The trick is getting the Delphi compiler to accept it.
- it works for immediate data
- it fails for pointer to data
- and you would think
[contentsOfSquareBrackets]would work
Bonus Chatter
Using David's solution (of function calling overhead) leads to a performance improvement of -7% (90 MB/s -> 83 MB/s of algorithm throughput)
It seems like, in the XE6 compiler, it is valid to conceptually call:
movdqu xmm0, TPackedQuadword
but the compiler just doesn't have the brains to let you perform the conceptual call:
movdqu xmm0, PPackedQuadword^
or it's moral equivalent.
If that's the answer, don't be afraid of it. Embrace it, and put it as the form of an answer:
*"The compiler does not support dereferencing a pointer inside an
asmblock. No matter if you try that with a caret (^), or square brackets ([...]). It just cannot be done.
If that's the answer: answer it.
If it's not the case, and the compiler can support pointers in an asm block, then post the answer.
Working code: