Inspection of contextual bandit prediction results

20 Views Asked by At

I have trained a contextual bandit model on some production policy data with

--cb <nb_labels> --cb_type ips -b 25 -d /tmp/data/train.vw -f candidate-model.vw -c

getting the following logging results:

final_regressor = candidate-model.vw
using cache_file = /tmp/data/train.vw.cache
ignoring text input in favor of cache input
num sources = 1
Num weight bits = 25
learning rate = 0.1
initial_t = 0
power_t = 0.5
cb_type = ips
Enabled learners: gd, scorer-identity, csoaa_ldf-rank, cb_adf, shared_feature_merger, cb_to_cbadf
Input label = CB
Output pred = MULTICLASS
average  since         example        example        current        current  current
loss     last          counter         weight          label        predict features
0.000000 0.000000            1            1.0       38:1:0.9            0:0      988
0.000000 0.000000            2            2.0       38:1:0.9            0:0     1824
0.000000 0.000000            4            4.0      28:1:0.81            0:0      380
0.000000 0.000000            8            8.0       38:1:0.9            0:0      380
0.000000 0.000000           16           16.0        0:1:0.1       21:-0.09      988
0.205863 0.411726           32           32.0      50:1:0.15          50:-0      836
0.258559 0.311255           64           64.0    24:1:0.0076        33:-0.1      380
0.129279 0.000000          128          128.0      21:1:0.59       13:-0.18      912
0.130960 0.132640          256          256.0      21:1:0.59       15:-0.21      380
0.098640 0.066320          512          512.0      21:1:0.59       36:-0.55      380
0.309922 0.521204         1024         1024.0      28:1:0.81        35:-0.2      380
0.286929 0.263936         2048         2048.0       38:1:0.9        74:-1.3     1520
0.286034 0.285139         4096         4096.0      21:1:0.59       17:-0.03      380
0.177546 0.069058         8192         8192.0      28:1:0.81            2:0      380
0.144739 0.111932        16384        16384.0      21:1:0.59       51:-0.02      380
0.080961 0.017183        32768        32768.0      21:1:0.59       51:-0.01      380
0.113071 0.145181        65536        65536.0      50:1:0.15       66:-0.13      988
0.082870 0.052669       131072       131072.0       38:1:0.9            2:0      304
0.080738 0.078605       262144       262144.0       38:1:0.9       66:-0.08      304
0.096056 0.111375       524288       524288.0       21:1:0.6          27:-0      380
0.075760 0.055464      1048576      1048576.0      38:1:0.89       44:-0.11      380
0.135795 0.195831      2097152      2097152.0       6:1:0.79       25:-0.12      988

finished run
number of examples = 2472547
weighted example sum = 2472547.000000
weighted label sum = 0.000000
average loss = 0.125687
total feature number = 1735872072
  • I don't understand why I get negative scores in the column current predict, does anyone have possible explanations about it? I can't find documentation about these scores by the way, so any help is very much appreciated.
  • Rows of training and test data contain explicit available actions, i.e. 19 21:1:0.8088 33 40 60 72 |User feat1=<feat1> feat2=<feat2> ... but predictions contain labels different than the available ones, is this normal expected behavior? If I want to restrict the prediction strictly to some given available actions do I need to switch to cb_adf (even in the case where I do not have rich features associated to the actions)?
  • On the test set I get a quite low average loss which would suggest that the optimization worked fine (average loss is significantly lower than sum of the cost over length of the test set which translates to better performance of the candidate policy) in spite of the negative scores, but this puzzles me since I can't evaluate properly the quality of the new optimized policy yet.

Note: training on same data with cb_explore run normally with no issues or negative scores/probabilities.

0

There are 0 best solutions below