It was my understanding that the final prediction of an XGBoost model (in this particular case an XGBRegressor) was obtained by summing the values of the predicted leaves [1] [2]. Yet I'm failing to match the prediction summing the values. Here is a MRE:
import json
from collections import deque
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import xgboost as xgb
def leafs_vector(tree):
"""Returns a vector of nodes for each tree, only leafs are different of 0"""
stack = deque([tree])
while stack:
node = stack.popleft()
if "leaf" in node:
yield node["leaf"]
else:
yield 0
for child in node["children"]:
stack.append(child)
# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the XGBoost regressor model
xg_reg = xgb.XGBRegressor(objective='reg:squarederror',
max_depth=5,
n_estimators=10)
# Train the model
xg_reg.fit(X_train, y_train)
# Compute the original predictions
y_pred = xg_reg.predict(X_test)
# get the index of each predicted leaf
predicted_leafs_indices = xg_reg.get_booster().predict(xgb.DMatrix(X_test), pred_leaf=True).astype(np.int32)
# get the trees
trees = xg_reg.get_booster().get_dump(dump_format="json")
trees = [json.loads(tree) for tree in trees]
# get a vector of nodes (ordered by node id)
leafs = [list(leafs_vector(tree)) for tree in trees]
l_pred = []
for pli in predicted_leafs_indices:
l_pred.append(sum(li[p] for li, p in zip(leafs, pli)))
assert np.allclose(np.array(l_pred), y_pred, atol=0.5) # fails
I also tried adding the default value (0.5) of the base_score (as written here) to the total sum but it also didn't work.
l_pred = []
for pli in predicted_leafs_indices:
l_pred.append(sum(li[p] for li, p in zip(leafs, pli)) + 0.5)
The problem is that even is the parameter
base_scoreof the model is None, it can have abase_score(different of the default one) [1].To access the
base_scorevalue the following works in version 2.0.3 of XGBoostAdding the
base_scoreto the total sum make it match the predicted value