Nested SQL query: how to return one sample from each log

88 Views Asked by At

I'm trying to implement a nested query to pull no more than one sample per log, and I think I know how to implement its components separately:

  1. Query a set of logs that contain data relevant to my analysis:
    SELECT  
      runs.object_type as object_type,  
      runs.name as log_name,  
    from project.runs.latest_runs  
    WHERE object_type = "ROCKET"  
    group by object_type, log_name

This results in a list of log names, e.g. "log_name_2021_09_01", "log_name_2021_09_03" etc.

  1. Query no more than one event with a specific condition from a single known log:
    SELECT  
      object.path_meters as pos,  
      object_speed as speed,  
      log.run as run_name,  
    FROM project.events.last30days  
    WHERE log.run = "log_name_2021_10_01"  
      AND object.speed > 0.0  
    LIMIT 1  

The above query returns no more than one sample for the specified log.

How can I combine these queries to pull samples from a set of logs that is returned by Query 1, and at the same time there should be no more than one sample per log?

Update:
Let's say a DB contains three logs:

  1. log_name_2021_09_01. The associated object_type to the log is ROCKET. The log contains 100k data samples: 90k of them have object.speed = 0.0, 10k of them have speed > 0.0.
  2. log_name_2021_09_02. The associated object_type to the log is CAR. The log also contains 100k samples with a similar proportion to log 1.
  3. log_name_2021_09_03. The associated object_type to the log is ROCKET. The log also contains 100k samples with a similar proportion to log 1.

I'm only interested in logs with the object type ROCKET. Two logs correspond to this condition: log_name_2021_09_01 and log_name_2021_09_03. These log names can be obtained by query 1 depicted above. I'd like to pull only one sample point (with speed > 0) from each of the two logs. That is, in the end I'd like to have a query that returns two samples: one from log_name_2021_09_01 and one from log_name_2021_09_03.

1

There are 1 best solutions below

3
MatBailie On

You question omits actual example data, so we're forced to infer much from your description. It is strongly recommended that your questions include sample data and the results you'd want from that sample data. This allows us both a concrete example to base our understanding on, and provides a test-set for us to use when developing an answer.

  • Would you trust any code you've written without having run it against test data?

That said, the following should be something like what you're looking for... (For each log, it only selects the one row with the highest pos.)

WITH
  rocket_logs AS
(
  SELECT DISTINCT
    runs.object_type AS object_type,
    runs.name        AS log_name
  FROM
    project.runs.latest_runs  
  WHERE
    object_type = "ROCKET"  
),
  sorted_logs AS
(
  SELECT  
    log.run              AS run_name,
    object.path_meters   AS pos,
    object_speed         AS speed,
    ROW_NUMBER()
      OVER (
        PARTITION BY log.run
            ORDER BY object.path_meters DESC
      )
                         AS seq_num
  FROM
    project.events.last30days  
  WHERE
    object.speed > 0.0
)
SELECT
  *
FROM
  rocket_logs   r
INNER JOIN
  sorted_logs   s
    ON s.run_name = r.log_name
WHERE
  s.seq_num = 1

For a more exact answer, please give:

  • example data, for both tables
  • example results for that data
  • both being sufficient to demonstrate all necessary behaviours