I have a dataset with some missing data (in reality <1%). We are going to use data from other sites to fill in the missing data. We have a data.table named dt1, which contains the data. We have another data.table named dt2 and this is an index of the sites that can be used to replace (rsite) the missing data from a site. In dt1 I would like to create a column in dt1 named UsedSite that records which site the data we end up using are from.
Create some data to use
dt1<-data.table(site=c("s1","s1","s2","s2"), x=c(NA, 1, NA,1))
dt2<-data.table(site=c("s1","s1","s2","s2"), rsite=c("s3","s3","s4","s4"))
Since only a small fraction of data are missing, start by setting the values of UsedSite to site
dt1[,UsedSite:=site]
The following is where I can't figure out how to do this. I want to just replace the UsedSite values with rsite where x is NA. Not at all surprisingly, the line below replaces everything - I don't want that. I don't understand how to instruct it to consider only lines where a condition (is.na(x)) is being met.
dt1[dt2,UsedSite:=rsite, on=.(site)]
My natural impulse is to try dt1[dt2&is.na(x),UsedSite:=rsite, on=.(site)], but that doesn't work. Thoughts? I would like to stick with the data.table paradigm in this.
We may use
fcoalescehere to replace the NA values with the corresponding values of 'rsite' If it is a non-NA, it won't get replaced asfcoalescewill return the first non-NA for each row-output
Update
-output