Databrick SQL - Fillng the gaps between min and max Date

30 Views Asked by At

I'm working with this example data on Azure Databricks Sql where there are gaps between dates. The data in the based_date column reflects when it was stored in the system. However, for the expected output, I need the dates to consistently represent the first day of the month, filling in any missing dates with corresponding agencies. How can I achieve this?

Currently this is my code I'm working on but it is not giving me my expected recursive output:

CREATE TEMPORARY VIEW xTableA
AS
SELECT 111 AS ACCOUNTID, '2022-02-05' AS BASED_DATE, 'XYZ' AS AGENCY
UNION ALL
SELECT 111, '2022-02-05', 'ABC'
UNION ALL
SELECT 111, '2022-05-25', 'BGG'
UNION ALL
SELECT 111, '2022-07-13', 'DXA'
UNION ALL
SELECT 111, '2023-02-22', 'VGQ'
UNION ALL
SELECT 114, '2022-08-09', 'QYD'
UNION ALL
SELECT 114, '2022-12-26', 'OMG'
UNION ALL
SELECT 114, '2023-03-12', 'TNK';

WITH xBased_Date AS (
SELECT 
    ACCOUNTID,
    AGENCY,
    CAST(date_trunc('MONTH', BASED_DATE) AS DATE) AS StartDate,
    LEAD(CAST(date_trunc('MONTH', BASED_DATE) AS DATE),1) OVER (ORDER BY CAST(date_trunc('MONTH', BASED_DATE) AS DATE)) AS EndDate
FROM xTableA
),
xRecursive AS (
    SELECT
    ACCOUNTID,
    add_months(StartDate, 1) AS xDate
    FROM xBased_Date
    WHERE add_months(StartDate, 1) <= EndDate
)

SELECT
a.ACCOUNTID,
ADD_MONTHS(a.StartDate, 1) AS BASED_DATE,
a.AGENCY
FROM xBased_Date a
CROSS JOIN xRecursive b
WHERE ADD_MONTHS(a.StartDate, 1) <= a.EndDate
ORDER BY 1, 2

Sample Data:

| ACCOUNTID | BASED_DATE | AGENCY |
| --------- | ---------- | -------|
| 111       | 2022-02-05 | XYZ    |
| 111       | 2022-02-05 | ABC    |
| 111       | 2022-05-25 | BGG    |
| 111       | 2022-07-13 | DXA    |
| 111       | 2023-02-22 | VGQ    |
| 114       | 2022-08-09 | QYD    |
| 114       | 2022-12-26 | OMG    |
| 114       | 2023-03-12 | TNK    |

Expected output: where i need to convert the date always first day of the month and filling up the value of last agency on the account

| ACCOUNTID | BASED_DATE | AGENCY |
| --------- | ---------- | -------|
| 111       | 2022-02-01 | XYZ    |
| 111       | 2022-02-01 | ABC    |
| 111       | 2022-03-01 | ABC    |
| 111       | 2022-04-01 | ABC    |
| 111       | 2022-05-01 | ABC    |
| 111       | 2022-05-01 | BGG    |
| 111       | 2022-06-01 | BGG    |
| 111       | 2022-07-01 | BGG    |
| 111       | 2022-07-01 | DXA    |
| 111       | 2022-08-01 | DXA    |
| 111       | 2023-02-01 | VGQ    |
| 111       | 2023-03-01 | VGQ    |
| 114       | 2022-08-01 | QYD    |
| 114       | 2022-09-01 | QYD    |
| 114       | 2022-10-01 | QYD    |
| 114       | 2022-11-01 | QYD    |
| 114       | 2022-12-01 | QYD    |
| 114       | 2022-12-01 | OMG    |
| 114       | 2023-01-01 | OMG    |
| 114       | 2023-02-01 | OMG    |
| 114       | 2023-03-01 | TNK    |

I would if appreciate if you can give me any idea or a similar code on any other sql syntax and I will translate it myself to databricks

1

There are 1 best solutions below

0
Let Soo Gas On BEST ANSWER

I solved my problem by searching and gathering the answers here on the community with almost the same scenario. Get the logic and apply some modification to solved my problem and get the expected output I want:

CREATE TEMPORARY VIEW xTableA
AS
SELECT 111 AS ACCOUNTID, '2022-02-05' AS BASED_DATE, 'XYZ' AS AGENCY
UNION ALL
SELECT 111, '2022-02-05', 'ABC'
UNION ALL
SELECT 111, '2022-05-25', 'BGG'
UNION ALL
SELECT 111, '2022-07-13', 'DXA'
UNION ALL
SELECT 111, '2023-02-22', 'VGQ'
UNION ALL
SELECT 114, '2022-08-09', 'QYD'
UNION ALL
SELECT 114, '2022-12-26', 'OMG'
UNION ALL
SELECT 114, '2023-03-12', 'TNK';

WITH xBased_Date AS (
SELECT 
    ACCOUNTID,
    AGENCY,
    CAST(date_trunc('MONTH', BASED_DATE) AS DATE) AS StartDate,
    LEAD(CAST(date_trunc('MONTH', BASED_DATE) AS DATE),1) OVER (PARTITION BY ACCOUNTID ORDER BY CAST(date_trunc('MONTH', BASED_DATE) AS DATE)) AS EndDate
FROM xTableA

)
,xRecursive AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS num
    FROM 
        (SELECT 1 AS n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) AS t1
    CROSS JOIN 
        (SELECT 1 AS n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) AS t2
)

SELECT 
    m.ACCOUNTID,
    ADD_MONTHS(m.StartDate, n.num) AS BASED_DATE,
    m.AGENCY
FROM xBased_Date m
CROSS JOIN xRecursive n
WHERE ADD_MONTHS(m.StartDate, n.num) <= m.EndDate
ORDER BY m.ACCOUNTID, BASED_DATE;

If you have better solution on my xRecursive CTE and make it more dynamic that will be helpful for me to use.