I've got this vector of strings (y) and a single string (x) which I want to compare and see which y fits x best if only deletions are considered.
x = "PCOR1"
y = c("PCor", "TCor", "TMMON", "INTMAX")
What I tried so far is to use adist but it leads to strange results:
adist(x,y,costs=c(substitutions = 0, insertions = 0, deletions = 1), ignore.case=TRUE)
[,1] [,2] [,3] [,4] [1,] 1 1 0 0
I can have a closer look at this doing:
drop(attr(adist(x,y,costs=c(substitutions = 0, insertions = 0, deletions = 1), ignore.case=TRUE, counts=TRUE),"counts"))
ins del sub [1,] 0 1 0 [2,] 0 1 1 [3,] 0 0 5 [4,] 1 0 5
This now tells me, if I get it right, that I need one deletion to get from "PCOR1" to "PCor", one deletion and one substitution to get from "TCOR1" to "TCor" and so on.
Why does adist return this if I set insertions and substitutions to 0? Is there a way to only use deletions?
I would expect something like:
[,1] [,2] [,3] [,4] [1,] 1 0 0 0
It seems you want to return it if it is a subset of the original string. In this case
grepl()should suffice, i.e.or