statsmodels patsy hypothesis testing categorical variable in constraint 'C()'

1.5k Views Asked by At

Hi am running the following model with statsmodel and it works fine.

from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col #for summary stats of large tables
time_FE_str = ' + C(hour_of_day) + C(day_of_week) + C(week_of_year)'
weather_2_str = ' +  C(weather_index) + rain + extreme_temperature + wind_speed'
model = ols("activity_count ~ C(city_id)"+weather_2_str+time_FE_str, data=df)
results = model.fit()
print summary_col(results).tables

print 'F-TEST:'
hypotheses = '(C(weather_index) = 0), (rain=0), (extreme_temperature=0), (wind_speed=0)'
f_test = results.f_test(hypotheses)

However, I do not know how to formulate the hypthosis for the F-test if I want to include the categorical variable C(weather_index). I tried all for me imaginable versions but I always get an error.

Did someone face this issue before?

Any ideas?

F-TEST:
Traceback (most recent call last):
  File "C:/VK/scripts_python/predict_activity.py", line 95, in <module>
    f_test = results.f_test(hypotheses)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\statsmodels\base\model.py", line 1375, in f_test
    invcov=invcov, use_f=True)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\statsmodels\base\model.py", line 1437, in wald_test
    LC = DesignInfo(names).linear_constraint(r_matrix)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\design_info.py", line 536, in linear_constraint
    return linear_constraint(constraint_likes, self.column_names)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 391, in linear_constraint
    tree = parse_constraint(code, variable_names)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 225, in parse_constraint
    return infix_parse(_tokenize_constraint(string, variable_names),
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 184, in _tokenize_constraint
    Origin(string, offset, offset + 1))
patsy.PatsyError: unrecognized token in constraint
   (C(weather_index) = 0), (rain=0), (extreme_temperature=0), (wind_speed=0)
    ^
2

There are 2 best solutions below

3
Josef On BEST ANSWER

The methods t_test, wald_test and f_test are for hypothesis test on the parameters directly and not for a entire categorical or composite effect.

Results.summary() shows the parameter names that patsy created for the categorical variables. Those can be used to create contrast or restrictions for the categorical effects.

As alternative anova_lm directly computes the hypothesis test that a term,e.g. A categorical variable has no effect.

10
Bill Bell On

Leave out the C()!

I tried making an analysis of these data.

Area    Clover_yield    Yarrow_stems
A   19.0    220
A   76.7    20
A   11.4    510
A   25.1    40
A   32.2    120
A   19.5    300
A   89.9    60
A   38.8    10
A   45.3    70
A   39.7    290
B   16.5    460
B   1.8 320
B   82.4    0
B   54.2    80
B   27.4    0
B   25.8    450
B   69.3    30
B   28.7    250
B   52.6    20
B   34.5    100
C   49.7    0
C   23.3    220
C   38.9    160
C   79.4    0
C   53.2    120
C   30.1    150
C   4.0 450
C   20.7    240
C   29.8    250
C   68.5    0

When I used the linear model in the first call to ols shown in the code I eventually ran into the snag you experienced. However, when I omitted mention of the fact that Area assumes discrete levels I was able to calculate F-tests on contrasts.

import pandas as pd
from statsmodels.formula.api import ols

df = pd.read_csv('clover.csv', sep='\s+')
model = ols('Clover_yield ~ C(Area) + Yarrow_stems', data=df)
model = ols('Clover_yield ~ Area + Yarrow_stems', data=df)

results = model.fit()
print (results.summary())

print (results.f_test(['Area[T.B] = Area[T.C], Yarrow_stems=150']))

Here is the output.

Notice that, the summary indicates the names that can be used in formulating contrasts for factors, in our case Area[T.B] and Area[T.C].

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Clover_yield   R-squared:                       0.529
Model:                            OLS   Adj. R-squared:                  0.474
Method:                 Least Squares   F-statistic:                     9.726
Date:                Thu, 04 Jan 2018   Prob (F-statistic):           0.000177
Time:                        17:26:03   Log-Likelihood:                -125.61
No. Observations:                  30   AIC:                             259.2
Df Residuals:                      26   BIC:                             264.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       57.5772      6.337      9.086      0.000      44.551      70.603
Area[T.B]        0.3205      7.653      0.042      0.967     -15.411      16.052
Area[T.C]       -0.5432      7.653     -0.071      0.944     -16.274      15.187
Yarrow_stems    -0.1086      0.020     -5.401      0.000      -0.150      -0.067
==============================================================================
Omnibus:                        0.459   Durbin-Watson:                   2.312
Prob(Omnibus):                  0.795   Jarque-Bera (JB):                0.449
Skew:                           0.260   Prob(JB):                        0.799
Kurtosis:                       2.702   Cond. No.                         766.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<F test: F=array([[ 27873807.59795523]]), p=4.939796675253845e-83, df_denom=26, df_num=2>

As mentioned in my comment, I'm not clear what you intended to test.

Edit: Prompted by user333700's comment, I've tried running this code again, with two slighly different statements, model = ols('Clover_yield ~ C(Area) + Yarrow_stems', data=df) and print (results.f_test(['C(Area)[T.B] = C(Area)[T.C], Yarrow_stems=150'])). 'C(Area)[T.B]' and 'C(Area)[T.C]' came from the summary with the changed model. Thus, it doesn't matter, for this type of analysis, whether you declare with C() or not. You must simply remember to use the appropriate form for the dummy variables, as mentioned in the summary.