I have recently been working on a soft-body physics simulation based on the following paper. The implementation uses points and springs and involves calculating the volume of the shape which is then used to calculate the pressure that is to be applied to each point.
On my MacBook Pro (2018, 13") I used the following code to calculate the volume for each soft-body in the simulation since all of the physics for the springs and mass points were being handled by a separate threadgroup:
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
threadgroup float volume = 0;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
// mem_none is probably all that is necessary here.
threadgroup_barrier(mem_flags::mem_none);
// Do calculations that depend on volume.
With shared_memory being passed to the kernel function as a threadgroup buffer:
threadgroup float* shared_memory [[ threadgroup(0) ]]
This worked well until much later on I ran the code on an iPhone and an M1 MacBook and the simulation broke down completely resulting in the soft bodies disappearing fairly quickly after starting the application.
The solution to this problem was to store the result of the volume sum in a threadgroup buffer, threadgroup float* volume [[ threadgroup(2) ]], and do the volume calculation as follows:
// -*- Volume calculation -*-
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
auto sum = shared_memory[0];
for (uint i = 1; i < threadsPerThreadgroup; ++i) {
sum += shared_memory[i];
}
*volume = sum;
}
threadgroup_barrier(mem_flags::mem_none);
float epsilon = 0.000001;
float pressurev = rAB * pressure * divide(1.0, *volume + epsilon);
My question is why would the initial method work on my MacBook but not on other hardware and is this now the correct way of doing this? If it is wrong to allocate a float in the threadgroup address space like this then what is the point of being able to do so?
As a side note, I am using mem_flags::mem_none since it seems unnecessary to ensure the correct ordering of memory operations to threadgroup memory in this case. I just want to make sure each thread has written to shared_memory at this point but the order in which they do so shouldn't matter. Is this assumption correct?
you should use
mem_flags::mem_threadgroup, but I think the main problem is metal cant initialize thread group memory to zero like that, the spec is unclear about thistry