Using GLM to reproduce built-in regression models in statsmodels

1.8k Views Asked by At

I am currently trying to reproduce a regression model eq. (3) (edit: fixed link) in python using statsmodels. As this model is no part of the standard models provided by statsmodels I clearly have to write it myself using the provided formula api.

Since I have never worked with the formula api (or patsy for that matter), I wanted to start and verify my approach by reproducing standard models with the formula api and a generalized linear model. My code and the results for a poisson regression are given below at the end of my question.

You will see that it predicts the parameters beta = (2, -3, 1) for all three models with good accuracy. However, I have a couple of questions:

  1. How do I explicitly add covariates to the glm model with a regression coefficient equal to 1?

From what I understand, a poisson regression in general has the shape ln(counts) = exp(intercept + beta * x + log(exposure)), i.e. the exposure is added through a fixed constant of value 1. I would like to reproduce this behaviour in my glm model, i.e. I want something like ln(counts) = exp(intercept + beta * x + k * log(exposure)) where k is a fixed constant as a formula.

Simply using formula = "1 + x1 + x2 + x3 + np.log(exposure)" returns a perfect seperation error (why?). I can bypass that by adding some random noise to y, but in that case np.log(exposure) has a non-unity regression coefficient, i.e. it is treated as a normal regression covariate.

  1. Apparently both built-in models 1 and 2 have no intercept, eventhough I tried to explicitly add one in model 2. Or is there a hidden intercept that is simply not reported? In either case, how do I fix that?

Any help would be greatly appreciated, so thanks in advance!

import numpy as np
import pandas as pd

np.random.seed(1+8+2022)

# Number of random samples
n = 4000

# Intercept
alpha = 1

# Regression Coefficients
beta = np.asarray([2.0,-3.0,1.0])

# Random Data
data =  {
            "x1" :  np.random.normal(1.00,0.10, size = n),
            "x2" :  np.random.normal(1.50,0.15, size = n),
            "x3" :  np.random.normal(-2.0,0.20, size = n),
            "exposure":  np.random.poisson(14, size = n),
        }

# Calculate the response
x = np.asarray([data["x1"], data["x2"] , data["x3"]]).T
t = np.asarray(data["exposure"])

# Add response to random data
data["y"] = np.exp(alpha + np.dot(x,beta) + np.log(t))

# Convert dict to df
data = pd.DataFrame(data)

print(data)

#-----------------------------------------------------
#   My Model
#-----------------------------------------------------

import statsmodels.api as sm
import statsmodels.formula.api as smf

formula = "y ~ x1 + x2 + x3"
model = smf.glm(formula=formula, data=data, family=sm.families.Poisson()).fit()

print(model.summary())

#-----------------------------------------------------
#   statsmodels.discrete.discrete_model.Poisson 1
#-----------------------------------------------------

import statsmodels.api as sm

data["offset"] = np.ones(n)

model = sm.Poisson( endog = data["y"],
                    exog = data[["x1", "x2", "x3"]],
                    exposure = data["exposure"],
                    offset = data["offset"]).fit()

print(model.summary())

#-----------------------------------------------------
#   statsmodels.discrete.discrete_model.Poisson 2
#-----------------------------------------------------

import statsmodels.api as sm

data["x1"] = sm.add_constant(data["x1"])

model = sm.Poisson( endog = data["y"],
                    exog = data[["x1", "x2", "x3"]],
                    exposure = data["exposure"]).fit()

print(model.summary())

RESULTS:

            x1        x2        x3  exposure         y
0     1.151771  1.577677 -1.811903        13  0.508422
1     0.897012  1.678311 -2.327583        22  0.228219
2     1.040250  1.471962 -1.705458        13  0.621328
3     0.866195  1.512472 -1.766108        17  0.478107
4     0.925470  1.399320 -1.886349        13  0.512518
...        ...       ...       ...       ...       ...
3995  1.073945  1.365260 -1.755071        12  0.804081
3996  0.855000  1.251951 -2.173843        11  0.439639
3997  0.892066  1.710856 -2.183085        10  0.107643
3998  0.763777  1.538938 -2.013619        22  0.363551
3999  1.056958  1.413922 -1.722252        19  1.098932

[4000 rows x 5 columns]
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                            GLM   Df Residuals:                     3996
Model Family:                 Poisson   Df Model:                            3
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2743.7
Date:                Sat, 08 Jan 2022   Deviance:                       141.11
Time:                        09:32:32   Pearson chi2:                     140.
No. Iterations:                     4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.6857      0.378      9.755      0.000       2.945       4.426
x1             2.0020      0.227      8.800      0.000       1.556       2.448
x2            -3.0393      0.148    -20.604      0.000      -3.328      -2.750
x3             0.9937      0.114      8.719      0.000       0.770       1.217
==============================================================================
Optimization terminated successfully.
         Current function value: 0.668293
         Iterations 10
                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                        Poisson   Df Residuals:                     3997
Method:                           MLE   Df Model:                            2
Date:                Sat, 08 Jan 2022   Pseudo R-squ.:                 0.09462
Time:                        09:32:32   Log-Likelihood:                -2673.2
converged:                       True   LL-Null:                       -2952.6
Covariance Type:            nonrobust   LLR p-value:                4.619e-122
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.0000      0.184     10.875      0.000       1.640       2.360
x2            -3.0000      0.124    -24.160      0.000      -3.243      -2.757
x3             1.0000      0.094     10.667      0.000       0.816       1.184
==============================================================================
Optimization terminated successfully.
         Current function value: 0.677893
         Iterations 5
                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 4000
Model:                        Poisson   Df Residuals:                     3997
Method:                           MLE   Df Model:                            2
Date:                Sat, 08 Jan 2022   Pseudo R-squ.:                 0.08162
Time:                        09:32:32   Log-Likelihood:                -2711.6
converged:                       True   LL-Null:                       -2952.6
Covariance Type:            nonrobust   LLR p-value:                2.196e-105
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.9516      0.304      9.711      0.000       2.356       3.547
x2            -2.9801      0.147    -20.275      0.000      -3.268      -2.692
x3             0.9807      0.113      8.655      0.000       0.759       1.203
==============================================================================

Process return 0 (0x0)
Press Enter to continue...
0

There are 0 best solutions below