I'm having an issue with my own project regarding preparation of KNN method for each possible combination of problems (regression, multiclass classification, binary classification). I want to implement "gower distance" into problems where there is mixed data.
I implemented code in R, it is calculated exactly the same as:
library(StatMatch)
gower.dist()
Results are equal, so the definition of Gower should be implemented correctly. My code below:
gowerDist <- function(info, x1, x2) {
sum <- 0
for (i in 1:ncol(x1)) {
if (info$type[i] %in% c("categorial", "binary")) {
if (x1[, i] != x2[, i]) {
sum <- sum + 1
}
} else if (info$type[i] %in% c("ordered", "numeric")) {
if (info$range[i] != 0) {
sum <- sum + abs((as.numeric(x2[, i]) - as.numeric(x1[, i]))/
info$range[i])
}
}
}
return(sum/(ncol(x1)))
}
gowerDistance <- function(dataTrain, dataTest) {
data_imported <- rbind.data.frame(dataTrain, dataTest)
information <- list(range=c(), type=c())
for (i in 1:ncol(data_imported)) {
information$type[i] <- check_variable_type(data_imported[, i])
if (information$type[i] == "numeric") {
information$range[i] <- c(as.numeric(max(data_imported[, i])) -
as.numeric(min(data_imported[, i])))
} else if (information $typ[i] == "ordered") {
numeric_values <- as.numeric(data_imported[, i])
information$range[i] <- diff(range(numeric_values))
} else information$range[i] <- NA
}
distances <- matrix( 0, nrow(dataTest), nrow(dataTrain))
for (i in 1:nrow(dataTrain)) {
for (j in 1:nrow(dataTest)) {
distances[j, i] <- gowerDist(information, dataTest[j, ], dataTrain[i, ])
}
}
return(distances)
}
The problem i'm having is with the time complexicity of this code - in case of around 300 observations the code executes slowly, with more cases it might be stuck for hours. I wanted to perform crossvalidation of models with the Gower scale, so i'd like to speed up the process, however I don't want to implement complex structues. Is it possible to speed up that code?