Aggregrating muliple atrributes in Apache pig script

13 Views Asked by At

I have some grouped data where each group looks like this

(2023-02-27,
{(2023-02-27T00:00:00.000Z,2023-02-27,6736,358,2,887,92477,6736,9307,19.44),
(2023-02-27T00:00:00.000Z,2023-02-27,8765,348,3,27,93970,8765,5156,3.4),
(2023-02-27T00:00:00.000Z,2023-02-27,8765,348,3,27,85001,8765,4841,0.34),
(2023-02-27T00:00:00.000Z,2023-02-27,9010,497,2,26,86009,9010,318,13.11),
(2023-02-27T00:00:00.000Z,2023-02-27,1348,819,3,833,87511,1348,10020,17.57),
(2023-02-27T00:00:00.000Z,2023-02-27,9642,261,2,221,95144,9642,5983,1.61),
(2023-02-27T00:00:00.000Z,2023-02-27,9642,261,2,221,84270,9642,5175,0.3),
(2023-02-27T00:00:00.000Z,2023-02-27,9642,261,2,221,92821,9642,8775,12.17),
(2023-02-27T00:00:00.000Z,2023-02-27,8231,26,3,860,81634,8231,5376,10.69),
(2023-02-27T00:00:00.000Z,2023-02-27,2267,1361,2,86,88583,2267,9738,0.27),
(2023-02-27T00:00:00.000Z,2023-02-27,6736,358,2,887,85082,6736,3350,3.53),
(2023-02-27T00:00:00.000Z,2023-02-27,312,184,2,793,93376,312,3098,16.92),
(2023-02-27T00:00:00.000Z,2023-02-27,312,184,2,793,83578,312,1441,2.24)})

--The description of the above data
grouped_data: {group: chararray,joined_data: {(
formatted_order_data::order_date: datetime,
formatted_order_data::order_day: chararray,
formatted_order_data::order_id: int,
formatted_order_data::customer_id: int,
formatted_order_data::shipping_method_id: int,
formatted_order_data::dest_address_id: int,
order_line_data::line_id: chararray,
order_line_data::order_id: int,
order_line_data::book_id: int,
order_line_data::price: float)}}

I'd like to count the number of book_id, count the distinct number of order_id and sum the price for each group. How to acheive this?

The expected result would be:

order_day    num_orders  num_books    total_price
(2023-02-27,   8,          13,         101.59)

I've tried it like this but it doesn't work. It seems there is something wrong with referencing to oined_data.order_id and joined_data.price. How should I modify the script?

result = FOREACH grouped_data GENERATE
    group AS order_day,
    COUNT(joined_data) AS num_books,  -- Count the rows in the joined_data bag
    COUNT(DISTINCT joined_data.order_id) AS num_orders,
    SUM(joined_data.price) AS total_price;

0

There are 0 best solutions below