How to perform SQL join with a table and another table having an aggregate value of one of the column of first table?

108 Views Asked by At

I am trying to subset flights table without the use of nested queries using sqldf library in R. I have calculated median and average values using one query and stored it into different tables. In the end, I am trying to apply WHERE condition to solve but I am not able to understand why the JOIN function is behaving the way it is. I have two questions.

  1. How is it able to perform INNER JOIN without two keys specified after ON argument?
  2. It is basically putting median and average value in all the rows after JOINS. Why?

Can anyone explain what is going on here?

library(sqldf)

average = sqldf('SELECT AVG(distance) AS avg FROM flights')

median = sqldf('SELECT MEDIAN(distance) AS med FROM flights')

result = sqldf(
    "SELECT flights.* 
     FROM flights 
     INNER JOIN average_q1 ON flights.distance 
     INNER JOIN median_q1 ON flights.distance 
     WHERE distance BETWEEN median_q1.med AND average_q1.avg"
)
1

There are 1 best solutions below

0
G. Grothendieck On BEST ANSWER

1) The code in the question is not reproducible (see posting instructions at the top of the tag page) so we will use the following:

library(sqldf)

# test inputs
DF <- data.frame(a = c(1, 1, 1, 2, 2, 2), b = 1:6)
DFsum <- sqldf("select a, sum(b) as sum from DF group by a")

sqldf("select *
  from DF A
  left join DFsum B using(a)")

giving:

  a b sum
1 1 1   6
2 1 2   6
3 1 3   6
4 2 4  15
5 2 5  15
6 2 6  15

2) This can be done using a window function which avoids the second table:

sqldf("select *, sum(b) over (partition by a) as sum from DF")

3) The aggregation can be combined right into the SQL statement like this:

sqldf("select *
  from DF A
  left join (select a, sum(b) as sum 
             from DF 
             group by a) using(a)")

4) This alternative is a different way of combining the SQL statements using a with clause, also known as a Common Table Expression (CTE)

sqldf("with B as (
  select a, sum(b) as sum 
    from DF 
    group by a
)
select *
  from DF A
  left join B using(a)")