My Python program has multiple threads accessing the same Numpy array instance. Since Numpy releases the GIL, these threads can end up accessing the array simultaneously.
If the threads are concurrently accessing the same array element, this can clearly cause a race condition where the result depends on the specific order in which the threads happen to execute. However, in languages such as C++, concurrent conflicting memory access by multiple threads may cause a data race that results in completely undefined behaviour.
I would like to understand what semantics are guaranteed by Numpy in case of concurrent array access. Are there rules I must follow to ensure that my program has sequential consistency? What happens if I break those rules?
- If the threads simultaneously access the same array but never simultaneously access the same array element, is there any guarantee that this can not cause a data race?
- If one thread writes an array element that another thread is simultaneously reading, can this cause the write action to fail or the written data to become corrupted?
- Is there any guarantee that the consequences of concurrent conflicting array access will be limited to the contents of the array, or can it also lead to undefined behaviour in other parts of the program or maybe crash the Python interpreter?
- Do the answers to these questions depend on the underlying machine architecture, such as x86 vs arm?
I really hope to understand what the precise rules are in these cases.
I found a similar question, but the answer only confirms that the threads can cause conflicting access. No explanation of the semantics of Numpy in such cases:
Is python numpy array operation += thread safe?
Another similar question without answers:
Are ndarray assingments to different indexes threadsafe?
# Example of a program that performs simultaneous array access.
import threading
import numpy as np
a = np.zeros(100000, dtype=np.int16)
def countup():
for i in range(10000):
a[:] += 1
def countdown():
for i in range(10000):
a[:] -= 1
t1 = threading.Thread(target=countup)
t2 = threading.Thread(target=countdown)
t1.start()
t2.start()
t1.join()
t2.join()
# Some elements of the array will be non-zero.
print(np.amin(a), np.amax(a), np.sum(a != 0))
Yes, race conditions can occur on Numpy arrays when the target Numpy function release the GIL and multiple threads access to the same array with at least one writing into it. Note that what matters is the access to the internal Numpy data buffer which can be shared by multiple array views. Besides, AFAIK most Numpy functions release the GIL.
As long as there are synchronization mechanisms (e.g. locks or atomics) enforcing that (or more specifically memory barriers), this is fine : multiple threads can accesses different parts of the internal buffer. The cache coherence protocol is responsible to ensure cache lines are coherent between the L1 cache of the different cores so software threads coherence is guaranteed.
Technically, in C and C++ a race condition is undefined behaviour and Numpy inherits such a behaviour because it is written mostly in C.
Indeed, when there is no synchronization mechanism, a threads on a core may operate on dirty data that has been invalidated and a race condition can occur because of that. This often happens because threads store items temporary in (SIMD) registers and a cache line can be invalidated meanwhile. Read-modify-write x86 instruction are not atomic by default unless a lock prefix is explicitly used. Numpy never use atomic instruction for basic array operation because they are generally (far) much slower (and this would not solve all kind of race condition anyway).
AFAIK on x86, writes never fail, but threads can still operate on corrupted data (and then write corrupted data indirectly). Indeed, unaligned writes for example are not guaranteed to be atomically done so a thread can read a partially updated item. This happens for Numpy array containing strings for example (and possibly array with the complex type that may not be aligned internally). If you play with Numpy low-level views (
np.view), then I think you can get arrays containing items that are not aligned. On other platforms, storing an item is not guaranteed to be done atomically (e.g. a CPU can perform multiple memory request for a single item). You should really not rely on such a behaviour for sake of portability, especially in a Python program using Numpy.Note that when multiple threads access exclusive items of the same cache line with at least one writing into it, the cache coherence protocol ensure the accesses are coherence but this coherence mechanism is particularly expensive (due to internal low-level synchronization between cores increasing significantly the latency of memory operations). This effect is called false sharing.
Yes, as long as threads does not concurrently work on views sharing the same internal Numpy data buffer (or a part of it). The structure of the CPython interpreter is protected by the GIL. AFAIK, Numpy releases the GIL only when a low-level C processing is performed and this processing does not access interpreter structures. When it does, it must not release the GIL. Otherwise, it would be a bug.
Overall, the observed effects can be different, but the presence of a race condition (as specified before or as specified in the C/C++ languages) is independent of the architecture. Consequently, a program with a race condition can behave correctly on x86 but incorrectly on ARM for example (data corruption or crash). One reason is the atomicity of read/stores. Another reason is that x86 has a stronger memory ordering model than ARM (or most other architectures). See this article for more information.