How to use values from previous column definition for Athena SQL query?

111 Views Asked by At

My data looks like this:

idx,year,month,day,metadata,not_impt,metricx
123,2022,12,02,"blah blah","lah lah",-123.94
123,2022,11,05,"blah blah asd","lah lah",62.4
123,2022,12,03,"blah blah asd","lah lah",39.512
123,2022,12,09,"blah blah","lah lah",12.412
123,2022,11,19,"blah blah","lah lah",24.43
123,2022,11,26,"blah blahac ","lah lah",94.94
987,2022,12,12,"blah blah","lah lah",-23.94
987,2022,11,15,"blah blahvs","lah lah",42.4
987,2022,11,03,"blah blah","lah lah",32.512
987,2022,12,04,"blah blah kams","lahada lah",19.412
987,2022,12,19,"blah blah","lah lah",21.43
987,2022,11,26,"blah blah","lah lah",74.94

And they are already read into an Athena view tablex and there's a real number column, metricx that I'll like to compute some statistics in a new view, the values ranges from [-500, +500].

The goal is to create a new view,

I've tried the follow and it works but I'm not re-using the values I've aggregated when computing the error bar values:

CREATE VIEW AS
SELECT idx,
  concat(cast(year as varchar), '-', cast(month as varchar)) as date,
  count(*) as num_rows,
  AVG(metricx) as avg_metric,
  ((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) as norm_metric,
  
  (
    ((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) -
    (STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
  ) as errorbar_bottom_metric,

  (
    ((AVG(metricx) - MIN(metricx)) / (MAX(metricx) - MIN(metricx))) +
    (STDDEV_POP(metricx) / SQRT(count(*))) * 1.96)
  ) as errorbar_top_metric,

FROM tablex
GROUP BY idx, year, month

Although the SQL works, there's a few repetitions when I'm not reusing the values computed in the previous columns.

Given that the data size is not that big < 100,000 rows, is there an cleaner way to write the SQL without copy and pasting the computation of the previous columns?

Maybe, nested SELECT query?

1

There are 1 best solutions below

2
ValNik On

You can talk about readability if you have an idea about the reader.
I think the nested query will make it more understandable for most readers. Reviewing the calculations step by step shows the purpose and result of the calculations.

SELECT idx,
  concat(cast(year as char), '-', cast(month as char)) as date,
  cnt as num_rows,
  avg_metricx as avg_metric,
  avg_min_metricx) / max_min_metricx as norm_metric,
  
  avg_min_metricx / max_min_metricx -std_err as errorbar_bottom_metric,

  avg_min_metricx / max_min_metricx +std_err as errorbar_top_metric
from(
  select *
     ,(stddev_metricx / SQRT(cnt)) * 1.96  as std_err
     ,avg_metricx-min_metricx as avg_min_metricx
     ,max_metricx - min_metricx as max_min_metricx
FROM (
  select idx,year,month
     ,count(*) as cnt
     ,AVG(metricx) as avg_metricx
     ,MIN(metricx) as min_metricx
     ,MAX(metricx) as max_metricx
     ,STDDEV_POP(metricx) stddev_metricx
  from tablex
  GROUP BY idx, year, month
  ) x
)y;

Execution plan for MySql 8.0 with create index ix_id_dt on tablex (idx,year,month);
There is nothing to say here without knowing the structure and size of the data.

Explain for query above

EXPLAIN
-> Table scan on x  (cost=0.10..2.88 rows=30) (actual time=0.319..0.321 rows=10 loops=1)
    -> Materialize  (cost=9.35..12.12 rows=30) (actual time=0.318..0.318 rows=10 loops=1)
        -> Group aggregate: count(0), avg(tablex.metricx), min(tablex.metricx)
              , max(tablex.metricx), std(tablex.metricx)
              (cost=6.25 rows=30) (actual time=0.073..0.148 rows=10 loops=1)
            -> Index scan on tablex using ix_id_dt  
               (cost=3.25 rows=30) (actual time=0.034..0.105 rows=30 loops=1)

For your current query

EXPLAIN
-> Group aggregate: count(0), std(tablex.metricx), min(tablex.metricx), max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx), count(0)
      , std(tablex.metricx), min(tablex.metricx), max(tablex.metricx)
      , min(tablex.metricx), avg(tablex.metricx), min(tablex.metricx)
      , max(tablex.metricx), min(tablex.metricx), avg(tablex.metricx)
      , count(0), avg(tablex.metricx)
      (cost=6.25 rows=30) (actual time=0.027..0.142 rows=10 loops=1)
    -> Index scan on tablex using ix_id_dt  
        (cost=3.25 rows=30) (actual time=0.013..0.105 rows=30 loops=1)