I am using vowpalwabbit in contextual bandit settings. But I am struck with a strange issue where vowpalwabbit just generating same PMF irrespective of context. Ideally it should generate different PMFs for action selection based on different context. Here is the sample data I am using.
shared |Context t1=a_c t2:5 t3=a_b t4:2 t5:10
|Action arm=a1
|Action arm=a2
|Action arm=a3
|Action arm=a4
0:-5:0.09 | Action arm=a5
|Action arm=a6
|Action arm=a7
|Action arm=a8
|Action arm=a9
|Action arm=a10
|Action arm=a11
I initialized my vowpalwabbit with following setting.
--cb_explore_adf --cb_type mtr --epsilon 0.05
Here is the action distribution irrespective of context in data.
Action Dist. of Contextual Bandit
Wondering what could be the cause of vowpalwabbit saturating. Is it something with the hyperparams provided?
--cb_explore_adf --cb_type mtr -q CA --epsilon 0.05worked for me.