I'm on a search problem, I have a dataset of queries and urls. Each couple (query, url) has a relevance (the target), a float which should preserve the order of the urls, for a given query. I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.
I went through the documentation and saw that it is important to keep the instances in the same group, because an instance is actually a query with all its associated urls.
I however have an issue regarding this, as I get the following error :
ValueError: Computing NDCG is only meaningful when there is more than 1 document. Got 1 instead.
I used the debugger, and while I do not have any group which size is inferior to 2 in my dataset, I have groups which are smaller in the _feval function, meaning the cv() fucntion did not actually keep the groups together.
In the lightgbm.cv I see no sign of the group argument which is used in the LGBMRanker.
But I can see that the function lightbm.cv precises that Values passed through params take precedence over those supplied via arguments. My understanding was that this value is passed to the underlying model of the cv function.
Here is the code that I have so far :
def eval_model(
self,
model: lightgbm.LGBMRanker,
k_fold: int = 3,
seed: int = 42,
):
"""Evaluates with NDCG"""
def _feval(y_pred: np.ndarray, lgb_dataset: lightgbm.basic.Dataset):
y_true = lgb_dataset.get_label()
serp_sizes = lgb_dataset.get_group()
ndcg_values = []
start = 0
for size in serp_sizes:
end = start + size
y_true_serp, y_pred_serp = y_true[start:end], y_pred[start:end]
ndcg_serp = sklearn.metrics.ndcg_score(
[y_true_serp], [y_pred_serp], k=10
)
ndcg_values.append(ndcg_serp)
start = end
eval_name = "my-ndcg"
eval_result = np.mean(ndcg_values)
greater_is_better = True
return eval_name, eval_result, greater_is_better
lgb_dataset = lightgbm.Dataset(data=self.X, label=self.y, group=self.serp_sizes)
cv_results = lightgbm.cv(
params={**model.get_params(), "group": self.serp_sizes},
train_set=lgb_dataset,
num_boost_round=1_000,
nfold=k_fold,
stratified=False,
seed=seed,
feval=_feval,
)
ndcg = np.mean(cv_results["my-ndcg"])
return ndcg
Where is my mistake/misunderstanding ?
is there a simple workaround to perform cross-validation using a lightgbm.LGBMRanker, and keeping the groups together ?
As of
lightgbm==4.1.0(the latest version as of this writing),lightgbm.sklearn.LGBMRankercannot be used withscikit-learn's cross-validation APIs.It also cannot be passed to
lightgbm.cv().As described in LightGBM's documentation (link),
lightgbm.cv()expects to be passed alightgbm.Datasetobject.groupis an attribute of theDatasetobject.To perform cross-validation of a LightGBM learning-to-rank model, use
lightgbm.cv()instead oflightgbm.sklearn.LGBMRanker().Here's a minimal, reproducible example using 3.11.7 and
lightgbm==4.1.0.lightgbm.cv()will correctly preserve query groups when creating cross-validation folds.In LightGBM's documentation, "param" refers specifically to the configuration described at https://lightgbm.readthedocs.io/en/v4.1.0/Parameters.html.
The statement you've quoted does not apply to data like
group,init_score, andlabel, and those things should not be passed through theparamskeyword argument in any of LightGBM's interfaces.