Group by columns when date ranges are contiguous

147 Views Asked by At

I have data in a format like this:

DECLARE @WidgetPrice TABLE (WidgetPriceId BIGINT IDENTITY(1,1), WidgitId INT, Price MONEY, 
    StartEffectiveWhen DATE, EndEffectiveWhen DATE)

INSERT INTO @WidgetPrice(WidgitId, Price, StartEffectiveWhen, EndEffectiveWhen)
VALUES
(100,      21.48, '2020-1-1',         '2021-8-5'),
(100,      19.34, '2021-8-6',         '2021-12-31'),
(100,      19.34, '2022-1-1',         '2022-12-31'),
(100,      19.34, '2023-1-1',         '2023-1-31'),
-- There is a date gap here (No price from 2023-1-31 to 2023-3-5)
(100,      19.34, '2023-3-5',         '2023-12-31'),
(100,      12.87, '2024-1-1',         '2024-1-31'),
(100,      12.87, '2024-2-1',         '2100-12-31'),
-- Next Widget's prices          
(200,      728.25, '2020-1-1',         '2021-12-31'),
(200,      728.25, '2022-1-1',         '2022-12-31'),
(200,      861.58, '2023-1-1',         '2024-5-21'),
(200,      601.19, '2024-5-22',        '2100-12-31')

I need to group by the WidgetId and Price but only when the dates are contiguous.

So, in my example data, there is a gap in the data between 2023-1-31 and 2023-3-5. Because there is a gap there, I need to have two entries for the price 19.34.

Here is an image of the data as I would hope to get it:

Desired Data

The key rows in this output are rows 2 and 3. It has the same price listed twice because there is a gap in the dates.

I had thought to make a recursive CTE that could look at the LAG value for the StartEffeciveWhen and EndEffectiveWhen, but I could not figure it out.

Any ideas on how I can build a query that does this?

NOTE: My actual data is over 113,000,000 rows with a lot more columns. I have just presented a simplified version for this question.

NOTE 2: I am running Microsoft SQL Server 2017

2

There are 2 best solutions below

8
Abraham On BEST ANSWER

SQLFiddle: I put together a SQLFiddle so that you can run this query and adjust it to see different results. Very useful site: SQL Fiddle With Answer

Strategy: Instead of focusing on chaining multiple rows togethers, you can focus on finding the edges. Then you can group by the running total of starting edges. I'm sure you could shorten this, but this gets the idea across:

  1. Use Lag Windowed Function on Price and Date to find changes.
  2. Use Case expression to define edges.
  3. Group by Sum Windowed function (Running total) of Starts to combine for min and max dates.
--Find interesting edges
SELECT 
    WidgitId
    , Price
    , MIN(StartEffectiveWhen) AS StartEffectiveWhen
    , MAX(EndEffectiveWhen) AS EndEffectiveWhen
FROM (
    SELECT 
        SUM(i.StartingEdge) OVER (
            PARTITION BY WidgitId ORDER BY StartEffectiveWhen
            ) AS LeadingCount
        , WidgitId
        , Price
        , StartEffectiveWhen
        , EndEffectiveWhen
    FROM (
        SELECT 
            CASE 
                WHEN LagPrice != Price
                    OR LagDate != DATEADD(day, - 1, StartEffectiveWhen)
                    THEN 1
                ELSE 0
                END AS StartingEdge
            , WidgitId
            , Price
            , StartEffectiveWhen
            , EndEffectiveWhen
        FROM (
            SELECT 
                LAG(price) OVER (
                    PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                    ) AS LagPrice
                , LAG(EndEffectiveWhen) OVER (
                    PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                    ) AS LagDate
                , *
            FROM @WidgetPrice
        ) p
    ) i
) g
GROUP BY g.Price,g.WidgitID,LeadingCount

Output from Query:

WidgitId Price StartEffectiveWhen EndEffectiveWhen
100 12.87 1/1/2024 12/31/2100
100 19.34 8/6/2021 1/31/2023
100 19.34 3/5/2023 12/31/2023
100 21.48 1/1/2020 8/5/2021
200 601.19 5/22/2024 12/31/2100
200 728.25 1/1/2020 12/31/2022
200 861.58 1/1/2023 5/21/2024

Alternative sub-query for those who might want to reduce the number of subqueries:

--Find interesting edges
SELECT WidgitId
    ,Price
    ,MIN(StartEffectiveWhen) AS StartEffectiveWhen
    ,MAX(EndEffectiveWhen) AS EndEffectiveWhen
FROM (
    SELECT 
        SUM(i.StartingEdge) OVER (
            PARTITION BY WidgitId ORDER BY StartEffectiveWhen
            ) AS LeadingCount
        , WidgitId
        , Price
        , StartEffectiveWhen
        , EndEffectiveWhen
    FROM (
        SELECT 
            CASE 
                WHEN LAG(price) OVER (
                            PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                            ) != Price
                    OR LAG(EndEffectiveWhen) OVER (
                            PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                            ) != DATEADD(day, - 1, StartEffectiveWhen)
                    THEN 1
                ELSE 0
                END AS StartingEdge
            , WidgitId
            , Price
            , StartEffectiveWhen
            , EndEffectiveWhen
        FROM @WidgetPrice
    ) i
) g
GROUP BY g.Price, g.WidgitID, LeadingCount

And here's a SQL Fiddle for the query with 1 less subquery

0
Vaccano On

This is a copy of Abraham's answer, but re-written using CTEs (I find them easier to read).

The actual awesomeness comes from Abraham's answer, this is just another way to view it. (Meaning you should up vote Abraham's answer before this.)

;WITH ComparableDatesAndPrices AS
(
    -- Make a CTE where we can compare the previous row's price and EndEffecitveWhen with the current row.  
    -- The goal is to be able to see if the price changed, or if there was a gap in dates.
    SELECT  LAG(price) OVER (PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC) AS LagPrice, 
            LAG(EndEffectiveWhen) OVER (PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC) AS LagEndEffectiveWhen, 
            WidgetPriceId, WidgitId, Price, StartEffectiveWhen, EndEffectiveWhen
    FROM @WidgetPrice
),
EdgeIdentification AS
(
    -- Create a CTE that will indicate if we have started a new "Group".  The end goal is to flaten down rows with contigous dates.  
    -- This will allow us to reduce the number of rows.
    SELECT  CASE 
                WHEN ISNULL(LagPrice, - 1) != Price OR ISNULL(LagEndEffectiveWhen, '1900-01-01') < DATEADD(day, - 1, StartEffectiveWhen)
                    THEN 1
                ELSE 0
            END AS StartingEdge, 
            WidgitId, Price, StartEffectiveWhen, EndEffectiveWhen, LagPrice, LagEndEffectiveWhen
    FROM  ComparableDatesAndPrices
),
EdgeCountsForGrouping AS
(
    -- Make a CTE that sums the Starting Edges.  The goal is to use this to group by.  If the starting edge does
    -- not change, then we are in a common group.
    SELECT  SUM(StartingEdge) OVER (PARTITION BY WidgitId ORDER BY StartEffectiveWhen) AS LeadingCount,
            WidgitId, Price, StartEffectiveWhen, EndEffectiveWhen, StartingEdge, LagPrice, LagEndEffectiveWhen
    FROM    EdgeIdentification  
)
SELECT  WidgitId, Price, MIN(StartEffectiveWhen) AS StartEffectiveWhen, MAX(EndEffectiveWhen) AS EndEffectiveWhen 
FROM    EdgeCountsForGrouping
GROUP BY Price, WidgitID, LeadingCount