I have two sets of bibliographic data. One contains records from PubMed, one from Scopus:
data_PubMed <- data.frame(
Authors = c("AB, CD.", "XX, XY.", "FG, HJ.", "KL, MN.", "OP, QR."),
Year = c(2020, 2019, 2018, 2017, 2016),
Title = c("Title 1", "Title 2", "Title 3", "Title 4", "Title 5"),
`Source title` = c("Journal A", "Journal B", "Journal C", "Journal D", "Journal E"),
DOI = c("DOI1", "DOI2", NA, "DOI4", NA),
Database = c("PubMed", "PubMed", "PubMed", "PubMed", "PubMed")
)
data_Scopus <- data.frame(
Authors = c("F.G.; H.J.", "K.L.; M.N.", "O.P.; Q.R.", "S.T.; U.V.", "C.D.; E.F."),
Year = c(2018, 2017, 2016, 2015, 2014),
Title = c("Title: 3.", "Title 4", "Title - 5.", "Title 6", "Title 7"),
`Source title` = c("Journal C", "Journal D", "Journal E", "Journal F", "Journal G"),
DOI = c(NA, "doi4", "dOi5", NA, NA),
Database = c("Scopus", "Scopus", "Scopus", "Scopus", "Scopus")
)
I want to create a combined data frame which only contains distinct records. In case of a duplicate, I want to retain the record from PubMed, discarding the one from Scopus.
The tricky part is that actual duplicates don't necessarily appear as such due to differently applied conventions when it comes to spelling authors and titles. I tried to reflect these pitfalls in my example code.
So the obvious first choice is to resort to the DOI. Sometimes the same DOI is written lowercase, sometimes uppercase, but this is easy to deal with, using tolower(DOI). So if all records had DOIs, that task would be simple. Just arrange by $Database to have "PubMed" first, then group_by(DOI) and apply a filter that keeps the PubMed record of each group (DOI):
data_Combined <- rbind(data_PubMed, data_Scopus) |>
mutate(DOI = tolower(DOI)) |>
arrange(Database) |>
group_by(DOI) |>
filter(Database == "PubMed" | (Database == "Scopus" & !any(Database == "PubMed"))) |>
ungroup()
> data_Combined
# A tibble: 6 × 6
Authors Year Title Source.title DOI Database
<chr> <dbl> <chr> <chr> <chr> <chr>
1 AB, CD. 2020 Title 1 Journal A doi1 PubMed
2 XX, XY. 2019 Title 2 Journal B doi2 PubMed
3 FG, HJ. 2018 Title 3 Journal C NA PubMed
4 KL, MN. 2017 Title 4 Journal D doi4 PubMed
5 OP, QR. 2016 Title 5 Journal E NA PubMed
6 O.P.; Q.R. 2016 Title - 5. Journal E doi5 Scopus
So with my inconsistent data, this approach is obviously wrong for a couple of reasons. "Title 6" and "Title 7" have been eliminated entirely and "Title 5" is actually a duplicate with slight differences that I would have liked to detect.
I know that there is the stringdist package that can identify "close enough" strings as matches, but I tried to implement this in my approach to no avail.