SQL to Extract Continuous Timestamp Ranges

Question

SQL to Extract Continuous Timestamp Ranges

114 Views Asked by Net Dawg At 14 October 2023 at 02:30

Table Device_Status

Id	Status	Timestamp
1	Active	2023-01-13T18.00.01.0187528
2	Active	2023-01-13T18.00.01.0187529
1	Failed	2023-01-13T18.00.01.0187530
3	Active	2023-01-13T18.00.01.0187531
1	Failed	2023-01-13T18.00.01.0187532
1	Active	2023-01-13T18.00.01.0187533
3	Active	2023-01-13T18.00.01.0187534
1	Failed	2023-01-13T18.00.01.0187535
4	Failed	2023-01-13T18.00.01.0187536
1	Active	2023-01-13T18.00.01.0187537

Expected Output (Need SQL query to generate):

ID	Fail_Begin	Fail_End
1	2023-01-13T18.00.01.0187530	2023-01-13T18.00.01.0187532
1	2023-01-13T18.00.01.0187535	2023-01-13T18.00.01.0187535
4	2023-01-13T18.00.01.0187536	2023-01-13T18.00.01.0187536

Basically, for each ID get min (timestamp) and max (timestamp) but over continuous records for that ID. If there is only one record, then min=max as is the case with the second and third record in the sample result set.

I have tried this (and various subquery variants thereof)

SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status
GROUP BY Id

but need to group by only continuously occurring records,

So perhaps by adding a notion of status change first would help? Starting with zero as initial status and then incrementing the change code by 1 every time the next record is different in order to thereby generate an intermediate result like so...

Table Device_Status_With_Change_Column

Id	Status	Change	Timestamp
1	Active	0	2023-01-13T18.00.01.0187528
2	Active	0	2023-01-13T18.00.01.0187529
1	Failed	1	2023-01-13T18.00.01.0187530
3	Active	0	2023-01-13T18.00.01.0187531
1	Failed	1	2023-01-13T18.00.01.0187532
1	Active	2	2023-01-13T18.00.01.0187533
3	Active	0	2023-01-13T18.00.01.0187534
1	Failed	3	2023-01-13T18.00.01.0187535
4	Failed	0	2023-01-13T18.00.01.0187536
1	Active	4	2023-01-13T18.00.01.0187537

And then doing

SELECT Id, Change, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
GROUP BY Id, Change

Other than looping over the result set in a programming language, I do not yet see a direct SQL statement that would do this in one fell swoop without the intermediate table and I do not see how to compute the column Change (in SQL).

Original Q&A

There are 3 best solutions below

Ajax1234 On 14 October 2023 at 03:31

The solution below first uses a subquery to associate a row number with each failed record, this number is partitioned by id block. Then, the subquery is joined to itself:

with cte as (
   select row_number() over (partition by t.id order by t.timestamp) r, t.* 
   from device_status t where t.status = 'Failed'
)
select c.id, c.timestamp fail_start, coalesce(c1.timestamp, c.timestamp) fail_end
from cte c left join cte c1 on c.id = c1.id and c.r + 1 = c1.r
where c.r % 2 = 1

See fiddle

Net Dawg On 18 October 2023 at 01:53

Another answer to my own question.

Fiddle

WITH cte AS -- the common table expression (CTE)
(
   -- table projection with all leading and lagging columns to then inform begin and end of failed states 
    SELECT 
            LAG(id) OVER (PARTITION BY id ORDER BY timestamp) previous_id, 
            LAG(status) OVER (PARTITION BY id ORDER BY timestamp) previous_status, 
            LAG(timestamp) OVER (PARTITION BY id ORDER BY timestamp) previous_timestamp,
            id, 
            status,
            timestamp,
            LEAD(id) OVER (PARTITION BY id ORDER BY timestamp) next_id, 
            LEAD(status) OVER (PARTITION BY id ORDER BY timestamp) next_status, 
            LEAD(timestamp) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp
  
    FROM device_status
),
-- AND WITH (comment for better readability of multiple queries being introduced on the CTE 
-- fail start table being defined as FS, 
fs AS 
(
  -- the order of fail start and, later similarly, failed are preserved in row_number
  SELECT ROW_NUMBER() OVER (ORDER BY id, fail_start) r, x.* FROM 
  ( 
    -- if next _status changed to failed for same id that is the start timestamp.  
    SELECT id, next_timestamp AS fail_start FROM cte
    WHERE next_status = 'Failed' AND next_status IS DISTINCT FROM status AND next_id = id
     UNION
    -- making an exception for when it is the only fail record
    SELECT id, timestamp AS fail_start FROM cte
    WHERE previous_status IS NULL AND status = 'Failed'
   ) x
),
-- AND WITH
-- fail end table FE, mirror image of FS
fe AS 
(
  SELECT ROW_NUMBER() OVER (ORDER BY id, fail_end) r, y.* FROM 
  (   
    SELECT id, previous_timestamp AS fail_end FROM cte
    WHERE previous_status = 'Failed' AND previous_status IS DISTINCT FROM status AND next_id = id

      UNION

    SELECT id, timestamp AS fail_end FROM cte
    WHERE next_status IS NULL AND status = 'Failed'
  ) y
)
-- simple join of row numbers of FS and FE to generate the required result
SELECT fs.id, fs.fail_start, fe.fail_end FROM fs JOIN fe ON fs.r = fe.r

In this solution, instead of trying to compute a change column, I use both LEAD and LAG functions symmetrically to first line up each record along with its previous and next, when partitioned by ID and ordered by timestamp (an alternative solution of ordering by ID, timestamp proved to be less useful) and then searching for status begin and end in that table concocted specifically to reveal this pattern in a guaranteed manner. Probably best understood through intermediate results of the component subqueries in Fiddle.

Previous-Next (LAG/LEAD View)

Then, I translate in into SQL the logic that any FAILED begin timestamp will be preceded by a non-failed, with exception for the lone failure or first occurrence in the timestamped logs.

Fail Start

The mirror image showing end of Fail status in time will be like so

Fail End

The final answer simply joins the fail start and fail end queries by row number, which is guaranteed to work because of the self projection and, minor, also introduce a sort by ID to completely match result to the original post and accepted answer.

Specifically, the big difference here is that this does not bother generating the proposed change column, sum and group by as originally envisioned in the the post (in order to work backward from the required result).

Hence, I will keep the previous answer as the accepted one.

The results are identical. Performance noted as "too close to call" in a couple of datasets with millions of status records each, with several enumerations of status (not just binary 'Failed' and 'Active' as in the toy dataset).

Further, in favor of the accepted answer, I do still like it because query is much easier to read for me, and possibly others, therefore understand, maintain as code.

**Net Dawg** · Accepted Answer · 2023-10-14T10:33:18.727000

The solution by user @Ajax1234 generates the result in the original post. However, when stress-tested with several tandem 'Failed' records, it generates pairs instead of producing exactly one additional record as required for those additional table entries.

The DDL/Schema and initial sample data, tested for PostGreSQL version 16, with changed/trimmed timestamp for the parse to work and several tandem 'Failed' records added:

CREATE TYPE status AS ENUM ('Active', 'Failed');
DROP TABLE IF EXISTS Device_Status;
CREATE TABLE Device_Status
(
  ID integer,
  Status status,
  temptime text
);
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.528');
INSERT INTO Device_Status VALUES (2,'Active','2023-01-13 10.00.01.529');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.530');
INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.531');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.532');
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.533');
INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.534');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.535');
INSERT INTO Device_Status VALUES (4,'Failed','2023-01-13 10.00.01.536');
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.537');
insert into device_status values (1,'Failed','2023-01-13 10:00:01.538'), 
                                 (1,'Failed','2023-01-13 10:00:01.539'), 
                                 (1,'Failed','2023-01-13 10:00:01.540'), 
                                 (1,'Failed','2023-01-13 10:00:01.541'), 
                                 (1,'Failed','2023-01-13 10:00:01.542'), 
                                 (1,'Failed','2023-01-13 10:00:01.543'), 
                                 (1,'Failed','2023-01-13 10:00:01.544'),
                                 (1,'Failed','2023-01-13 10:00:01.545'), 
                                 (1,'Failed','2023-01-13 10:00:01.546'),
                                 (1,'Failed','2023-01-13 10:00:01.547');
ALTER TABLE Device_Status ADD timestamp TIMESTAMP;
UPDATE Device_Status set timestamp = 
      to_timestamp(temptime,'YYYY-MM-DD HH:MI:SS.MS');
ALTER TABLE Device_Status DROP COLUMN temptime;

WITH Device_Status_With_Change_Column AS
(
  WITH flag AS
  (
    WITH lag AS
    (
      SELECT *, 
             LAG(status) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Status,
             LAG(Id,1,0) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Id
      FROM Device_Status
    )
    SELECT *, 
           CASE
           WHEN lag.last_status IS NULL THEN 0
           WHEN lag.last_status IS DISTINCT FROM lag.status 
            AND lag.last_id = lag.id  
            THEN 1
           ELSE 0
         END Change_Flag
     FROM lag
  )
  SELECT *, 
         SUM(change_flag) OVER (PARTITION BY Id ORDER BY TimeStamp) Change
  FROM flag
)
SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
WHERE status = 'Failed'
GROUP BY Id, Change
ORDER BY Id

Result (fiddle):

id	fail_start	fail_end
1	2023-01-13 10:00:01.53	2023-01-13 10:00:01.532
1	2023-01-13 10:00:01.535	2023-01-13 10:00:01.535
1	2023-01-13 10:00:01.538	2023-01-13 10:00:01.547
4	2023-01-13 10:00:01.536	2023-01-13 10:00:01.536

The innermost CTElag uses the SQL LAG function to generate a column for previous entries (i.e. exactly one previous row of corresponding column, for id and status, and in the case of id, default zero). This is projected into flag to enter a 1 when previous status changes from 'Active' to 'Failed' or vice-versa for the same id. The first instance of any id will have unconditional zero and so will cases where there is no change to status relative to the previous recorded log entry for any device. To create the CTE Device_Status_With_Change_Column, the Change_Flags are summed using the cumulative SUM function to record the sequence number of status change for each device Id when sorted by timestamp. The zero in the Change column ensures that SUM will put the same integer in all consecutive failed or active records having the same change sequence. In other words, to get the 'Active' image of the result set we just need to change the WHERE clause. Therefore, any number of statuses, say, 'PAUSED', 'OFFLINE' can all be addressed just by adding the corresponding word in the DDL and query.

The full CTE Device_Status_With_Change_Column (fiddle):

SELECT * 
FROM Device_Status_With_Change_Column

I tested the join-based query by @Ajax1234 on a database with 13 million records, to generate a result set of approximately 20,000, and it is faster that my CTE-based solution (24 seconds versus 42 seconds).

SQL is far more powerful than doing loops! Look into CTEs using LAG, SUM, and the CASE Statement. Join-like operations on products of relations are more performant, if a bit harder to read and understand, and therefore need testing.

SQL to Extract Continuous Timestamp Ranges

There are 3 best solutions below

Related Questions in SQL

Related Questions in LOOPS

Related Questions in JOIN

Related Questions in SUBQUERY

Related Questions in SQL-TIMESTAMP

Trending Questions

Popular # Hahtags

Popular Questions