How to perform cross-validation with LightGBM.LGBMRanker, while keeping groups together?

Question

How to perform cross-validation with LightGBM.LGBMRanker, while keeping groups together?

125 Views Asked by Durand At 19 December 2023 at 18:47

I'm on a search problem, I have a dataset of queries and urls. Each couple (query, url) has a relevance (the target), a float which should preserve the order of the urls, for a given query. I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.

I went through the documentation and saw that it is important to keep the instances in the same group, because an instance is actually a query with all its associated urls. I however have an issue regarding this, as I get the following error :

ValueError: Computing NDCG is only meaningful when there is more than 1 document. Got 1 instead.

I used the debugger, and while I do not have any group which size is inferior to 2 in my dataset, I have groups which are smaller in the _feval function, meaning the cv() fucntion did not actually keep the groups together.

In the lightgbm.cv I see no sign of the group argument which is used in the LGBMRanker. But I can see that the function lightbm.cv precises that Values passed through params take precedence over those supplied via arguments. My understanding was that this value is passed to the underlying model of the cv function.

Here is the code that I have so far :

def eval_model(
    self,
    model: lightgbm.LGBMRanker,
    k_fold: int = 3,
    seed: int = 42,
):
    """Evaluates with NDCG"""

    def _feval(y_pred: np.ndarray, lgb_dataset: lightgbm.basic.Dataset):
        y_true = lgb_dataset.get_label()
        serp_sizes = lgb_dataset.get_group()

        ndcg_values = []
        start = 0
        for size in serp_sizes:
            end = start + size
            y_true_serp, y_pred_serp = y_true[start:end], y_pred[start:end]
            ndcg_serp = sklearn.metrics.ndcg_score(
                [y_true_serp], [y_pred_serp], k=10
            )
            ndcg_values.append(ndcg_serp)
            start = end

        eval_name = "my-ndcg"
        eval_result = np.mean(ndcg_values)
        greater_is_better = True
        return eval_name, eval_result, greater_is_better

    lgb_dataset = lightgbm.Dataset(data=self.X, label=self.y, group=self.serp_sizes)
    cv_results = lightgbm.cv(
        params={**model.get_params(), "group": self.serp_sizes},
        train_set=lgb_dataset,
        num_boost_round=1_000,
        nfold=k_fold,
        stratified=False,
        seed=seed,
        feval=_feval,
    )
    ndcg = np.mean(cv_results["my-ndcg"])

    return ndcg

Where is my mistake/misunderstanding ? is there a simple workaround to perform cross-validation using a lightgbm.LGBMRanker, and keeping the groups together ?

Original Q&A

There are 1 best solutions below

**James Lamb** · Accepted Answer · 2023-12-20T06:02:58.487000

I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.

As of lightgbm==4.1.0 (the latest version as of this writing), lightgbm.sklearn.LGBMRanker cannot be used with scikit-learn's cross-validation APIs.

It also cannot be passed to lightgbm.cv().

In the lightgbm.cv I see no sign of the group argument which is used in the LGBMRanker

As described in LightGBM's documentation (link), lightgbm.cv() expects to be passed a lightgbm.Dataset object.

group is an attribute of the Dataset object.

To perform cross-validation of a LightGBM learning-to-rank model, use lightgbm.cv() instead of lightgbm.sklearn.LGBMRanker().

Here's a minimal, reproducible example using 3.11.7 and lightgbm==4.1.0.

import lightgbm as lgb
import numpy as np
import requests
from sklearn.datasets import load_svmlight_file
from tempfile import NamedTemporaryFile

# get training data from LightGBM examples
data_url = "https://raw.githubusercontent.com/microsoft/LightGBM/master/examples/lambdarank"
with NamedTemporaryFile(mode="w") as f:
    train_data_raw = requests.get(f"{data_url}/rank.train").text
    f.write(train_data_raw)
    X, y = load_svmlight_file(f.name)

group = np.loadtxt(f"{data_url}/rank.train.query")

# create a LightGBM Dataset
dtrain = lgb.Dataset(
    data=X,
    label=y,
    group=group
)

# perform LambdaRank 3-fold cross-validation with 1 set of hyperparameters
cv_results = lgb.cv(
    train_set=dtrain,
    params={
        "objective": "lambdarank",
        "eval_at": 2,
        "num_iterations": 10
    },
    nfold=3,
    return_cvbooster=True
)

# check metrics
np.round(cv_results["valid ndcg@2-mean"], 3)
# array([0.593, 0.597, 0.64 , 0.632, 0.64 , 0.636, 0.655, 0.655, 0.653, 0.669])

lightgbm.cv() will correctly preserve query groups when creating cross-validation folds.

Values passed through params take precedence over those supplied via arguments

In LightGBM's documentation, "param" refers specifically to the configuration described at https://lightgbm.readthedocs.io/en/v4.1.0/Parameters.html.

The statement you've quoted does not apply to data like group, init_score, and label, and those things should not be passed through the params keyword argument in any of LightGBM's interfaces.

How to perform cross-validation with LightGBM.LGBMRanker, while keeping groups together?

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in CROSS-VALIDATION

Related Questions in RANKING

Related Questions in LIGHTGBM

Trending Questions

Popular # Hahtags

Popular Questions