I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.
The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.
EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below).
END EDIT
see link at the bottom of the page to reproduce this example.
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).
However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.
I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.
In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by
SiteID, Year, Mo, Day, Hrthe result wasInf, NaN, and NAvalues for a fewSiteID.The reason for this is that dB is not a fully continuous dataset to merge with. Thus,
Inf, NaN, and NAvalues for someSiteIDwere returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package
imputeTSwas valuable here.So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.
Because I did the above interpolation of NAs based on
jDay, I am missing the Month (Mo),Day, andYearinformation for those rows.Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this: