I am trying to perform an Ordinal Logistic Regression in Python calling R's mass.polr function with rpy2 (Python interface for the R language). However, I run into trouble when there are some collinear or almost collinear columns in my predictors: mass.polr automatically discards some of those columns during fitting, which causes an error when I try to get predictions on the training data.
Here is a minimum example:
from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
pandas2ri.activate()
mass = importr("MASS")
# dataframe with two collinear predictors (x1 and x2)
df = pd.DataFrame(columns = ['target', 'x1', 'x2', 'x3'],
data = [[ 0 , 0 , 0 , 1 ],
[ 1 , 1 , 1 , 0 ],
[ 2 , 1 , 1 , 1 ]])
model = mass.polr('as.factor(target) ~ .', df, Hess = True) # gives warning below
'''
Warning message:
In polr(as.factor(target) ~ ., data = df, Hess = TRUE) :
design appears to be rank-deficient, so dropping some coefs
'''
r.predict(model, df, type = "class").__array__() # gives error below
'''
Error in X %*% object$coefficients : non-conformable arguments
'''
The same error actually happens in R as well, but there I can at least see which columns have been discarded by looking at summary(model).
Instead, in Python, r.summary(model).rx2('coefficients') (which should display the same output as summary(model) in R) does not display coefficient names, but just bare values:
array([[4.57292582e+01, 8.25605929e+02, 5.53887231e-02],
[2.11604944e+01, 2.85721885e+02, 7.40597606e-02],
[3.19476895e+01, 3.60605165e+02, 8.85946531e-02],
[5.66312792e+01, 8.93862000e+02, 6.33557296e-02]])
Does anyone know a way to retrieve the coefficient names in Python? Or is there any other workaround?
Even without
pandas2ri.activate(), the FloatMatrix that gets returned fromr.summary(model).rx2('coefficients')does not include the variable names. However, we are able to extract those names with R'sdimnamesfunction. Full example below:returns
['x1', 'x3', '0|1', '1|2'], showing x2 has been dropped.Alternatively, you can print the full model output with
r.print(r.summary(model))