How to do a cartesian join of a numpy array and a pandas series?

69 Views Asked by At

I have a numpy array ids (unique) and a pandas series dates.

I want to create a pandas dataframe that is a cartesian product of ids and dates with columns id and date grouped by date.

e.g., ids = [1, 2] and dates = [10032023, 10042023], with resulting dataframe:

id     date
1      10032023
2      10032023
1      10042023
2      10042023

I can't seem to figure out how to do this using the existing vectorized operations in pandas. My currently approach is just to iterate over both and assign the rows individually.

3

There are 3 best solutions below

0
darren lefcoe On

You can use the product method from the built-in itertools module to achieve this.

The cartesian product is done like this:

from itertools import product

array1 = [1, 2, 3]
array2 = ['a', 'b']

result = list(product(array1, array2))
print(result)

which results in this:

[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'b')]

The docs are here: https://docs.python.org/3/library/itertools.html#itertools.product

The docs point out that this is not vectorised, but in fact this:

Roughly equivalent to nested for-loops in a generator expression. For example, product(A, B) returns the same as ((x,y) for x in A for y in B).

4
F-said On

pd.merge should work in this case, we should just specify how=cross to do cross product. We also need to convert the initial ndarray to a Series and ensure both are named. If any series is unnamed we could just handle this with series.name = "hoopla".

Here's a demo, assuming you already have a series and numpy array created, we would only need the last 2 lines:

import pandas as pd
import numpy as np

arr = np.array([1,2])

df2 = pd.Series([10032023, 10042023], name="df2")

df1 = pd.Series(arr, name="df1")
cross_prod = pd.merge(df1, df2, how="cross")

Hope it helps!

0
Panda Kim On

use MultiIndex.from_product

ids = [1, 2]
dates = [10032023, 10042023]
out = pd.MultiIndex.from_product([dates, ids]).to_frame(index=False)\
        .reindex([1, 0], axis=1).set_axis(['id', 'date'], axis=1)

out

    id  date
0   1   10032023
1   2   10032023
2   1   10042023
3   2   10042023