For some time now I am failing to call df.to_numpy(allow_copy = True).
Is there a procedure that transforms any given dataset into "zero_copy suitable" one?
For list-like values I tried
data = pl.DataFrame(dict(
points=np.random.sample((N, 3)),
color=np.random.sample((N, 4))
),
schema=dict(
points=pl.Array(pl.Float64, 3),
color=pl.Array(pl.Float64, 4),
))
or simply expr.cast(pl.Array(pl.Float32, 4)) as suggested here. It works for one of my datasets, but fails for a different one with slightly different build .
Calling rechunk(), having no null values and/or specifying order = "c" or "fortran" also seems to have no effect.
This is a generalization of my previous question that was perhaps too specific to get a real answer.
No, that operation would copy itself. Numpy matrices are contiguously allocated in a single allocation. Where Polars mostly contiguously allocated per column. That means that if memory is allocated by Polars, it cannot be transferred zero-copy to a numpy 2D matrix.
Polars columns and Polars Array types can be moved zero-copy to numpy.
Moving data from numpy to Polars and back.
A 2D matrix from numpy in fortran order, can be moved zero-copy to Polars and again zero-copy back to numpy. This works because the original numpy allocation will not be changed. All Polars columns point into the numpy contiguous array memory.
Why only F-order?
Because Polars is a columnar query engine, we only store
F-contiguousdata. Otherwise we would need to skip rows on column traversal. This would duplicate all code, or add an indirection for traversal which would be much slower.Aside from that, you now have
rowvalues in your cache line, making all columnar algorithms much much slower.C-contiguous
In c-contiguous, row values are back to back. If we need
column_valuewe must skipnrow_xslots, wherenis proportional on the number of columns.F-contiguous
In f-contiguous, column slots are back to back, leading to fast cache efficient reading
So zero-copy
C-contiguousdataDataFramescomes with a performance cost for most operations (which are columnar) and might even trigger a copy if underlying columnar algorithms expect slices with data back to back (which isn't far fetched).TLDR;
DataFrame implementations that support zero-copy of c-order data pay a price with cache trashing or make implicit copies if the algorithms expect a slice.