How to get combination result by one column history sql snowflake

102 Views Asked by At

I have a table history task of user, seems like this. 1 user have 9 task.

user_id     task    IS_COMPLETED updated_at
123       Task 1    1            2024-01-01
123       Task 2    1            2024-01-01
123       Task 3    0            2024-01-01
123       Task 4    0            2024-01-01
123       Task 5    0            2024-01-01
123       Task 6    1            2024-01-01
123       Task 7    1            2024-01-01
123       Task 8    1            2024-01-01
123       Task 9    1            2024-01-01

I want to grouping with possible combination of the task, to become table like this. So I know what kind of task combinations are quickly completed by users.

combination     total_user_completed
task1_task2         20
task1_task3         15 
task1_task_4        15
task1_task_5        14
:                    : (and so on for 2 combination) 
task1_task2_task3   10
task4_task5_task_6  11
:                   : (and so on for 3 combination) 

(for more detail about combination example: from A,B,C i got combination between A-B,B-C,A-C, and A-B-C. No need to be sequential)

I have tried with recursive sql but it didn't work well

"Recursive Join ran out of memory. Please re-run this query on a larger warehouse"

Then I also try to make it with case when, but it's not ideal i need to define one by one like this.

    with all_task as (
            select 
                distinct user_id,
                task_name as task, 
                is_completed,
                updated_at
            from task
            where 
                is_completed = 1
            group by all
          ),
        agg as (
            select 
                distinct user_id, 
                activated_at as period,
                max(case when task in ('Task 1') and IS_COMPLETED = 1 then date(updated_at) end) as task_1,
                max(case when task in ('Task 2') and IS_COMPLETED = 1 then date(updated_at) end) as task_2,
                max(case when task in ('Task 3') and IS_COMPLETED = 1 then date(updated_at) end) as task_3,
                max(case when task in ('Task 4') and IS_COMPLETED = 1 then date(updated_at) end) as task_4,
                max(case when task in ('Task 5') and IS_COMPLETED = 1 then date(updated_at) end) as task_5,
                max(case when task in ('Task 6') and IS_COMPLETED = 1 then date(updated_at) end) as task_6,
                max(case when task in ('Task 7') and IS_COMPLETED = 1 then date(updated_at) end) as task_7,
                max(case when task in ('Task 8') and IS_COMPLETED = 1 then date(updated_at) end) as task_8,
                max(case when task in ('Task 9') and IS_COMPLETED = 1 then date(updated_at) end) as task_9
            from all_task 
            group by all
            ),
            task_group as (
            select user_id, 
                case when task_1 is not null and task_2 is not null then user_id end as task_1_2
                case when task_2 not null and task_3 is not null then user_id end as task_2_3
                .........
            from agg
             )
            select 'Task1_task2' as combination, 
            count(distinct task_1_2) as total
            from task_group

            union all 

            select 'task2_task3' as combination, 
            count(distinct task_2_3) as total
            from task_group
.... (and so on)

Is there any suggestion to solve this? Thanks a lot!

*Note for now I need at least 2/3 combination of task. Thank you

3

There are 3 best solutions below

0
Simeon Pilgrim On BEST ANSWER

so with a CTE for test data:

with data(user_id, task, is_completed, updated_at) as (
    select * from values
    (123, 'Task 1', 1, '2024-01-01'::date),
    (123, 'Task 2', 1, '2024-01-01'::date),
    (123, 'Task 3', 0, '2024-01-01'::date),
    (123, 'Task 4', 0, '2024-01-01'::date),
    (123, 'Task 5', 0, '2024-01-01'::date),
    (123, 'Task 6', 1, '2024-01-01'::date),
    (123, 'Task 7', 1, '2024-01-01'::date),
    (123, 'Task 8', 1, '2024-01-01'::date),
    (123, 'Task 9', 1, '2024-01-01'::date)
)

and a CTE for the combinations of 2:

), combinations_2 as (
    select
        a.user_id, 
        a.task || '_' || b.task as combi
    from data as a
    join data as b
        on a.user_id = b.user_id 
        --and a.updated_at = b.updated_at
        and a.is_completed = 1 and b.is_completed = 1
        and a.task < b.task
),

enter image description here

and a CTE for the 3's:

), combinations_3 as (
    select
        a.user_id, 
        a.task || '_' || b.task || '_' || c.task as combi
    from data as a
    join data as b
        on a.user_id = b.user_id 
        --and a.updated_at = b.updated_at
        and a.is_completed = 1 and b.is_completed = 1
        and a.task < b.task
    join data as c
        on a.user_id = c.user_id 
        --and a.updated_at = c.updated_at
        and a.is_completed = 1 and c.is_completed = 1
        and b.task < c.task
),

enter image description here

union those together:

), merged as (
    select * from combinations_2
    union all 
    select * from combinations_3
)

then count:

select 
    combi, 
    count(user_id) 
from merged
group by 1
order by 1;

enter image description here

Now you might want to use a CTE to filter the is_completed = 0 rows, but that might also cause sync between the uses, and cause performance loss. Another method would be to build the 2 combi's and and also keep the b.task and then do the third round join against that, that might perform better.

I also assumed the updated_at are meaningless, and that there are not duplicate dates for the same ID/Task, thus again it might be better to preprocess the data to remove those. Thus:

with data(user_id, task, is_completed, updated_at) as (
    select * from values
    (123, 'Task 1', 1, '2024-01-01'::date),
    (123, 'Task 2', 1, '2024-01-01'::date),
    (123, 'Task 3', 0, '2024-01-01'::date),
    (123, 'Task 4', 0, '2024-01-01'::date),
    (123, 'Task 5', 0, '2024-01-01'::date),
    (123, 'Task 6', 1, '2024-01-01'::date),
    (123, 'Task 7', 1, '2024-01-01'::date),
    (123, 'Task 8', 1, '2024-01-01'::date),
    (123, 'Task 9', 1, '2024-01-01'::date)
), cleaned as (
    select distinct user_id, task
    from data
    where is_completed = 1
), combinations_2 as (
    select
        a.user_id, 
        a.task || '_' || b.task as combi
    from cleaned as a
    join cleaned as b
        on a.user_id = b.user_id 
        and a.task < b.task
), combinations_3 as (
    select
        a.user_id, 
        a.task || '_' || b.task || '_' || c.task as combi
    from cleaned as a
    join cleaned as b
        on a.user_id = b.user_id 
        and a.task < b.task
    join cleaned as c
        on a.user_id = c.user_id 
        and b.task < c.task
), merged as (
    select * from combinations_2
    union all 
    select * from combinations_3
)
select 
    combi, 
    count(user_id) 
from merged
group by 1
order by 1;
0
Teun Kruijer On

For MS SQL Server 2016: (Should work for Snowflake)

I think I have a way to get what you want. It requires manually specifying amount of permutations and will not create all of the possible ones. This would also create a pretty large table increasing exponentially as the number of tasks increases.

Here I have a working solution for 3 permutations of the task column. You can add more by adding to the code.

It works by getting the DISTINCT tasks and CROSS JOINING these on each other while also checking that a permutations doesn't exist more than once. Then we get the number of completed tasks and LEFT JOIN these on the permutations to get the completed amount of tasks per permutation

-- Create a temporary table
CREATE TABLE #TempTable
(
    user_id INT,
    task VARCHAR(50),
    IS_COMPLETED BIT,
    updated_at DATE
);

-- Insert data into the temporary table
INSERT INTO #TempTable
(
    user_id,
    task,
    IS_COMPLETED,
    updated_at
)
VALUES
(123, 'Task 1', 1, '2024-01-01'),
(123, 'Task 2', 1, '2024-01-01'),
(123, 'Task 3', 0, '2024-01-01'),
(123, 'Task 4', 0, '2024-01-01'),
(123, 'Task 5', 0, '2024-01-01'),
(123, 'Task 6', 1, '2024-01-01'),
(123, 'Task 7', 1, '2024-01-01'),
(123, 'Task 8', 1, '2024-01-01'),
(123, 'Task 9', 1, '2024-01-01');

;with cteAllColumns
as (select DISTINCT
        Task as col
    from #TempTable
   )
-- See commented code for adding more permutations

select c1.col as 'c1.task',
       c2.col as 'c2.task',
       c3.col as 'c3.task',
       --c4.col as 'c4.task',
       ISNULL(SUM(result.completed), 0) as 'combination completed'
from cteAllColumns c1
    cross join cteAllColumns c2
    cross join cteAllColumns c3

    --cross join cteAllColumns c4

    LEFT JOIN
    (
        SELECT task,
               COUNT(task) as 'completed'
        FROM #TempTable
        WHERE IS_COMPLETED = 1
        GROUP BY task
    ) result
        ON result.task = c1.col
           OR result.task = c2.col
           OR result.task = c3.col
--OR result.task = c4.col


where c1.col < c2.col
      AND c2.col < c3.col
--AND c3.col < c4.col

GROUP BY c1.col,
         c2.col,
         c3.col --,c4.col

ORDER BY c1.col,
         c2.col,
         c3.col --,c4.col


DROP TABLE #TempTable;
0
Andrei Odegov On

First, let's create a table called "tasks" and insert data into it.

create table "tasks"(
  user_id int not null,
  "task" varchar(10) not null,
  IS_COMPLETED numeric(1) not null,
  updated_at date not null,
  primary key(user_id, "task")
);

insert into "tasks" values
  (42, 'Task 1', 1, '2023-12-29'),
  (42, 'Task 7', 1, '2023-12-29'),
  (42, 'Task 9', 0, '2023-12-29'),
  (123, 'Task 1', 1, '2024-01-01'),
  (123, 'Task 2', 1, '2024-01-01'),
  (123, 'Task 3', 0, '2024-01-01'),
  (123, 'Task 4', 0, '2024-01-01'),
  (123, 'Task 5', 0, '2024-01-01'),
  (123, 'Task 6', 1, '2024-01-01'),
  (123, 'Task 7', 1, '2024-01-01'),
  (123, 'Task 8', 1, '2024-01-01'),
  (123, 'Task 9', 1, '2024-01-01'),
  (170, 'Task 7', 1, '2024-01-09'),
  (170, 'Task 8', 1, '2024-01-09');

The result set of the b CTE contains rows for completed tasks. The pow() function raises 2 to the power of the dense_rank() window function for further using in the bitand() function.

with recursive
  b as (
    select
      user_id, updated_at, "task",
      pow(2, dense_rank() over(partition by user_id, updated_at
                               order by "task")-1) as pow2
    from "tasks"
    where is_completed = 1
  ),

The u CTE combains rows of the b CTE with rows of the b CTE for all values of the user_id column with of the b CTE rows for all values of the updated_at column with rows of the b CTE for all values of the "task" column. For all rows, the pow() function raises 2 to the power of the window function dense_rank() for later use in the bitand() function.

  u as (
    select user_id, updated_at, "task", pow2 from b
    union all
    select
      0, updated_at, "task",
      pow(2, dense_rank() over(partition by updated_at
                               order by "task")-1) as pow2
    from b
    union all
    select
      user_id, '0'::date, "task",
      pow(2, dense_rank() over(partition by user_id
                               order by "task")-1) as pow2
    from b
    union all
    select 0, '0'::date, "task",
      pow(2, dense_rank() over(order by "task")-1) as pow2
    from b
  ),

In the rec CTE, for all combinations of values in the user_id and update_at columns, numbers are recursively generated from 0 to 2 in the power of the number of different tasks minus 1.

  rec as (
    select
      user_id, updated_at,
      pow(2, count(distinct "task"))-1 as msk2
    from u
    group by user_id, updated_at
    union all
    select user_id, updated_at, msk2-1
    from rec
    where msk2 > 0
  ),

In the res CTE, the u CTE is joined with the rec CTE and the u CTE rows whose u .pow2 bit is not set to r .msk2 are eliminated.

  res as (
    select
      u.user_id, u.updated_at, u."task", r.msk2
    from u
    join rec as r
    on r.user_id = u.user_id and
       r.updated_at = u.updated_at
    where bitand(u.pow2, r.msk2) != 0
  )

Finally, using the listagg() function allows us to obtain chains of completed tasks for various combinations of user_id and updated_at column values.

select
  user_id, updated_at,
  listagg(distinct "task", '|') within group(order by "task") as combination,
  count(*) as total
from res
where user_id = 123 and updated_at != '0'::date
group by user_id, updated_at, msk2;

Results.

+---------+------------+-------------------------------------------+-------+
| USER_ID | UPDATED_AT |                COMBINATION                | TOTAL |
+---------+------------+-------------------------------------------+-------+
|     123 | 2024-01-01 | Task 1|Task 2|Task 6                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 8                      |     3 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 7                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 8               |     4 |
|     123 | 2024-01-01 | Task 1                                    |     1 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 7                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 7               |     4 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 7|Task 8        |     5 |
|     123 | 2024-01-01 | Task 2|Task 8                             |     2 |
|     123 | 2024-01-01 | Task 2|Task 7|Task 8                      |     3 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 8|Task 9               |     4 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 8                      |     3 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 6                                    |     1 |
|     123 | 2024-01-01 | Task 1|Task 7|Task 8|Task 9               |     4 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 8|Task 9        |     5 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 7                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 6                             |     2 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 8|Task 9               |     4 |
|     123 | 2024-01-01 | Task 2|Task 7|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 2|Task 6                             |     2 |
|     123 | 2024-01-01 | Task 6|Task 8|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 7                                    |     1 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 7|Task 9               |     4 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 7|Task 9               |     4 |
|     123 | 2024-01-01 | Task 1|Task 7|Task 8                      |     3 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 7|Task 8               |     4 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 7|Task 8|Task 9        |     5 |
|     123 | 2024-01-01 | Task 6|Task 7                             |     2 |
|     123 | 2024-01-01 | Task 7|Task 8                             |     2 |
|     123 | 2024-01-01 | Task 8|Task 9                             |     2 |
|     123 | 2024-01-01 | Task 6|Task 9                             |     2 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 7|Task 9        |     5 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 8|Task 9               |     4 |
|     123 | 2024-01-01 | Task 2|Task 9                             |     2 |
|     123 | 2024-01-01 | Task 2|Task 7                             |     2 |
|     123 | 2024-01-01 | Task 7|Task 9                             |     2 |
|     123 | 2024-01-01 | Task 9                                    |     1 |
|     123 | 2024-01-01 | Task 1|Task 7|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 7|Task 8               |     4 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 9               |     4 |
|     123 | 2024-01-01 | Task 1|Task 2                             |     2 |
|     123 | 2024-01-01 | Task 2|Task 6|Task 7|Task 9               |     4 |
|     123 | 2024-01-01 | Task 6|Task 8                             |     2 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 8|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 9                             |     2 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 7|Task 8|Task 9        |     5 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 7|Task 8|Task 9        |     5 |
|     123 | 2024-01-01 | Task 2|Task 7|Task 8|Task 9               |     4 |
|     123 | 2024-01-01 | Task 2                                    |     1 |
|     123 | 2024-01-01 | Task 6|Task 7|Task 8                      |     3 |
|     123 | 2024-01-01 | Task 6|Task 7|Task 8|Task 9               |     4 |
|     123 | 2024-01-01 | Task 7|Task 8|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 1|Task 8                             |     2 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 7|Task 8               |     4 |
|     123 | 2024-01-01 | Task 1|Task 7                             |     2 |
|     123 | 2024-01-01 | Task 1|Task 2|Task 6|Task 7|Task 8|Task 9 |     6 |
|     123 | 2024-01-01 | Task 1|Task 6|Task 8                      |     3 |
|     123 | 2024-01-01 | Task 2|Task 8|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 6|Task 7|Task 9                      |     3 |
|     123 | 2024-01-01 | Task 8                                    |     1 |
+---------+------------+-------------------------------------------+-------+