Here is my original R/sqldf code:
First I create the data:
table_a <- data.frame(name = c('john', 'john', 'john', 'alex', 'alex', 'tim', 'tim', 'joe', 'joe', 'jessica', 'jessica'),
year = c(2010, 2011, 2012, 2020, 2021, 2015, 2016, 2010, 2011, 2000, 2001),
var = c('a', 'a', 'c', 'b', 'c', NA, NA, NA, NA, NA, NA))
table_b <- data.frame(name = c('sara', 'sara', 'tim', 'tim', 'tim', 'jessica'),
year = c(2001, 2002, 2005, 2006, 2021, 2020),
var = c('a', 'b', 'c', 'd', 'f', 'z'))
Next, I run the code:
library(sqldf)
sqldf("WITH min_year AS (
SELECT name
, MIN(year) AS min_year
FROM table_a
GROUP BY name
)
, b_filtered AS (
SELECT b.name
, MAX(b.year) AS max_year
, b.var
FROM table_b AS b
INNER JOIN min_year AS m
ON b.name = m.name
AND b.year < m.min_year
GROUP BY b.name
)
SELECT a.name
, a.year
, CASE WHEN a.var IS NULL AND b.name IS NOT NULL THEN b.var
ELSE a.var
END AS var_mod
FROM table_a AS a
LEFT JOIN b_filtered b
ON a.name = b.name")
Is it possible to combine the data creation step and the sql into the same piece of code? For example:
sqldf("WITH
table_a (name, year, var) AS
(
VALUES
('john', 2010, 'a' )
, ('john', 2011, 'a' )
, ('john', 2012, 'c' )
, ('alex', 2020, 'b' )
, ('alex', 2021, 'c' )
, ('tim', 2015, NULL)
, ('tim', 2016, NULL)
, ('joe', 2010, NULL)
, ('joe', 2011, NULL)
, ('jessica', 2000, NULL)
, ('jessica', 2001, NULL)
)
, table_b (name, year, var) AS
(
VALUES
('sara', 2001, 'a')
, ('sara', 2002, 'b')
, ('tim', 2005, 'c')
, ('tim', 2006, 'd')
, ('tim', 2021, 'f')
, ('jessica', 2020, 'z')
)
WITH min_year AS (
SELECT name
, MIN(year) AS min_year
FROM table_a
GROUP BY name
)
, b_filtered AS (
SELECT b.name
, MAX(b.year) AS max_year
, b.var
FROM table_b AS b
INNER JOIN min_year AS m
ON b.name = m.name
AND b.year < m.min_year
GROUP BY b.name
)
SELECT a.name
, a.year
, CASE WHEN a.var IS NULL AND b.name IS NOT NULL THEN b.var
ELSE a.var
END AS var_mod
FROM table_a AS a
LEFT JOIN b_filtered b
ON a.name = b.name")
While creating the data outside of the sqldf statement and then running the sqldf code works just fine, I was just wondering if it possible to combine both of them into a single piece of code.
This makes things much easier for testing and debugging procedures.
Is it possible
The problem is that the last
sqldfstatement in the question has a syntax error. Replace the secondwithin that statement with a comma.