Note: I am using awkward version 1.10.3.
So, the general overview is that I have a set of data that is in awkward arrays, and I want to be able to pass this data to a simple feedforward pytorch model. I believe that pytorch doesn't natively handle awkward arrays so I am planning on converting the data to either torch or numpy arrays before passing through to the model. I should also note that whilst the data is stored in awkward arrays, at this point the data is not jagged.
Here is an example of the input data and of what I am looking for:
import awkward as ak
import numpy as np
import torch
arr = ak.Array({"MET_pt" : [0.0, 100.0, 20.0, 30.0, 4.0],
"MET_phi" : [0, 0.1, 0.2, 0.3, 0.4],
"class" : [0, 1, 0, 1, 0]})
# These are my input features
x = arr[['MET_pt', 'MET_phi']]
# These are my class labels
y = arr['class']
#
## Here would be the code converting to torch tensors
#
x_torch = torch.tensor([[0, 0], [100, 0.1], [20, 0.2], [30, 0.3], [4, 0.4]])
y_torch = torch.tensor([0, 1, 0, 1, 0])
However, I cannot find an easy way to convert x from the awkward arrays to the torch arrays. I can easily convert y to torch tensors by simply doing:
torch.tensor(y)
> tensor([0, 1, 0, 1, 0])
But I am unable to do this for the x array:
torch.tensor(x)
> TypeError: object of type 'Record' has no len()
This lead me to the idea of converting to numpy arrays first:
torch.tensor(ak.to_numpy(x))
> TypeError: can't convert np.ndarray of type numpy.void. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool
But as you can see this doesn't work either.
I think the problem lies in the fact that the ak.to_numpy() function converts the x array to:
ak.to_numpy(x)
> array([( 0., 0. ), (100., 0.1), ( 20., 0.2), ( 30., 0.3), ( 4., 0.4)],
dtype=[('MET_pt', '<f8'), ('MET_phi', '<f8')])
where I want it to convert like:
ak.to_numpy(x)
> [[0, 0], [100, 0.1], [20, 0.2], [30, 0.3], [4, 0.4]]
Is there anyway of converting an N-dim non-jagged awkward array such as x into the format shown immediately above? Or is there a smarter way to convert directly to torch tensors?
Sorry if this is a stupid question! Thanks!
One approach is to convert it to a list of dictionaries using
to_list(), and then read out the numerical values. Converting it directly usingto_numpy()seems to result in the keys being tied up in the dtypes, which is why I opted forto_list().