I have two tables which consist of the following columns:
Table1:
ID
N which describes an ID who did something, e.g. for the first time, then a second time, etc.
START_DATE is related to column N. It describes the date when an ID did something.
STOP_DATE 1 day before column GROUPS changes from 0 to 1.
GROUPS is a grouping variable.
Table2:
ID
EVENT_HAPPENED is a date on which an event happened.
Table1:
| ID | N | START_DATE | STOP_DATE | GROUPS |
|---|---|---|---|---|
| 1 | 1 | 2023-01-26 | 2023-04-13 | 0 |
| 1 | 2 | 2023-02-05 | 2023-04-13 | 0 |
| 1 | 3 | 2023-02-27 | 2023-04-13 | 0 |
| 1 | 4 | 2023-04-14 | 2023-09-13 | 1 |
| 1 | 5 | 2023-04-14 | 2023-09-13 | 1 |
| 1 | 6 | 2023-04-14 | 2023-09-13 | 1 |
Table2:
| ID | EVENT_HAPPENED |
|---|---|
| 1 | 2023-03-12 |
| 1 | 2023-03-18 |
| 1 | 2023-04-07 |
| 1 | 2023-05-31 |
So what I need to do is to check the minimum date of EVENT_HAPPENED. In this case, it is 2023-03-12 and then I reference it to table1 to create a column called MARKER which indicates with 1 that for all subsequent data entries for this ID, the event has already happened.
I also need to associate the smallest date on which the EVENT_HAPPENED occurred with the corresponding interval between START_DATE and STOP_DATE as a new column called HAPPENED_FIRST_TIME with the corresponding date as column HAPPENED_ON.
What I expect
| ID | N | START_DATE | STOP_DATE | GROUPS | MARKER | HAPPENED_FIRST_TIME | HAPPENED_ON |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 0 | null |
| 1 | 2 | 2023-02-05 | 2023-04-13 | 0 | 0 | 0 | null |
| 1 | 3 | 2023-02-27 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
| 1 | 4 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 1 | 5 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 1 | 6 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
What I tried to do is the following:
WITH EventDates AS (
SELECT
ID,
MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
FROM
Table2
GROUP BY
ID
),
IntervalDates AS (
SELECT
t1.ID,
t1.N,
t1.START_DATE,
t1.STOP_DATE,
t1.GROUPS,
ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME
FROM
Table1 t1
LEFT JOIN
EventDates ed ON t1.ID = ed.ID
)
SELECT
ID,
N,
START_DATE,
STOP_DATE,
GROUPS,
CASE
WHEN HAPPENED_FIRST_TIME <= START_DATE THEN 1
ELSE 0
END AS MARKER,
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN 1
ELSE 0
END AS HAPPENED_FIRST_TIME,
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN HAPPENED_FIRST_TIME
ELSE NULL
END AS HAPPENED_ON
FROM
IntervalDates;
The problem occurs here because the smallest EVENT_HAPPENED date is actually between START_DATE and STOP_DATE for the first 3 data entries. However, it happened after START_DATE 2023-02-27:
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN 1
ELSE 0
END AS HAPPENED_FIRST_TIME,
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN HAPPENED_FIRST_TIME
ELSE NULL
END AS HAPPENED_ON
which gives the following result:
| ID | N | START_DATE | STOP_DATE | GROUPS | MARKER | HAPPENED_FIRST_TIME | HAPPENED_ON |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
| 1 | 2 | 2023-02-05 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
| 1 | 3 | 2023-02-27 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
| 1 | 4 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 1 | 5 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 1 | 6 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
How can I adjust my code?
Thank you
------------------EDIT
When adding IDs who ever belong to GROUPS = 0 the MARKER column works fine, but the HAPPENED_FIRST_TIME indicator does not work nor the corresponding HAPPENED_ON.
Do I need to split IDs into those who belong to GROUPS = 1 and those who never do to run the RN?
Expected output:
| ID | N | START_DATE | STOP_DATE | GROUPS | MARKER | HAPPENED_FIRST_TIME | HAPPENED_ON |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 0 | null |
| 1 | 2 | 2023-02-05 | 2023-04-13 | 0 | 0 | 0 | null |
| 1 | 3 | 2023-02-27 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
| 1 | 4 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 1 | 5 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 1 | 6 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
| 3 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 1 | 2023-01-31 |
| 3 | 2 | 2023-02-05 | 2023-04-13 | 0 | 1 | 0 | null |
| 3 | 3 | 2023-02-27 | 2023-04-13 | 0 | 1 | 0 | null |
| 3 | 4 | 2023-04-14 | 2023-04-13 | 0 | 1 | 0 | null |
I have slightly updated your query to retrieve the correct HAPPENED_FIRST_TIME, I used the window function
row_number()to pinpoint the nearest START_DATE relative to the event date :Results :
Demo here
Based on the recent simple data you provided in the comments, it appears that the question has transformed into a "gaps and islands" problem . This can be resolved using the difference between two row_numbers to identify the existing groups :
Demo here