I have a dataset of universities which I need to merge with a dataset with metadata on universities. The challenge is that my data and the metadataset do not name all universities exactly the same way.
mydata = data.frame(matrix(nrow=10))
mydata$id <- c("harvard university",
"harvard university",
"university of cambridge school of clinical medicine",
"university of cambridge school of clinical medicine",
"stony brook university",
"stony brook university",
"medical research council laboratory of molecular biology",
"medical research council laboratory of molecular biology",
"netherlands cancer institute",
"netherlands cancer institute")
metadata = data.frame(matrix(nrow=5))
metadata$id <- c("havard university",
"university of cambridge",
"stony brook university, the state university of new york",
"mrc laboratory of molecular biology",
"the netherlands cancer institute")
metadata$coolinfo <- c(1, 2, 3, 4, 5)
The cases where the metadata has a longer name in the manner where something follows the name which is stated in my data can be sucessfully merged this way:
#First part of my data matches metadata
metadata$n <- seq.int(nrow(metadata))
mydata$n <- charmatch(mydata$id, metadata$id)
merged <- merge(mydata, metadata, by="n")
The cases where the metadata has a longer name in the manner where something precedes the name which is stated in my data can be sucessfully merged this way:
#Last part of my data matches metadata
metadata$n <- seq.int(nrow(metadata))
metadata$id_revs <- sapply(strsplit(metadata$id, "\\s+"), function(x) paste(rev(x), collapse=" "))
mydata$id_revs <- sapply(strsplit(mydata$id, "\\s+"), function(x) paste(rev(x), collapse=" "))
mydata$n <- charmatch(mydata$id_revs, metadata$id_revs)
merged <- merge(mydata, metadata, by="n")
However, these two methods do not succesfully merge the opposite cases, i.e., where my data has a longer name than the metadata, because:
#First part of metadata matches my data
mydata$n <- seq.int(nrow(mydata))
metadata$n <- charmatch(metadata$id, mydata$id)
...results in 0 for university of cambridge, because it has several partial matches in my data. This would be similar for the case when last part of metadata matches my data.
Can anyone suggest a solution which can merge my data with the metadata, which takes advantage of the fact that there is a match in the first/last part of the string?
I know that in my example data, I could simply remove all the from both id-variables, but this is not a viable solution in my real data, as there are too many cases of single words distinguishing the id in my data and the meta data, to write them out manually.