This question is perhaps in an uncanny valley between CrossValidated and StackOverflow, as I'm trying to understand the methodology of functions in an R package, in the context of executing them properly.
The data is gene expression; with thousands of variables. Outcome is a binary variable.
I have managed to get a logistic lasso from glmnet but I've been asked to do it again with the knockoff package: https://cran.r-project.org/web/packages/knockoff/index.html
The problem is, if I've understood the vignettes correctly, the choices in the package are (a) assume the response variable is a normal (not true lol) or (b) pre-specify the distribution, mu, and sigma of the predictors. Perhaps it is my inexperience showing, but I don't feel confident this dataset can work for either of those things. Those who have tasked me with this cryptically insist the package works for logistic lasso, though.
Am I missing something? How would one go about doing a logistic lasso on a massive dataset using knockoff?
I found this quite tricky to work out, but I think the code below should work with a logistic lasso model. I found the
stat.lasso_coefdiff_binfunction in theknockoffpackage, which I think is internally building a logistic lasso model, then computing knockoff statistics by comparing coefficients of original and knockoff predictors.With my example data I could only get this to actually select some variables by making the false discovery rate (
fdr) quite high. I'm not sure what an appropriate value might be. It will probably depend on your data, so I suggest trying it out with something lower first, maybe 0.1. You would also want to check what function to use to generate your knockoffs - I've usedcreate.fixedbut you can also usecreate.gaussian.Hope that helps