Need help implementing a custom loss function in lightGBM (Zero-inflated Log Normal Loss)

5.1k Views Asked by At

Im trying to implement this zero-inflated log normal loss function based on this paper in lightGBM (https://arxiv.org/pdf/1912.07753.pdf) (page 5). But, admittedly, I just don’t know how. I don’t understand how to get the gradient and hessian of this function in order to implement it in LGBM and I’ve never needed to implement a custom loss function in the past.

The authors of this paper have open sourced their code, and the function is available in tensorflow (https://github.com/google/lifetime_value/blob/master/lifetime_value/zero_inflated_lognormal.py), but I’m unable to translate this to fit the parameters required for a custom loss function in LightGBM. An example of how LGBM accepts custom loss functions— loglikelihood loss would be written as:

def loglikelihood(preds, train_data):
    labels = train_data.get_label()
    preds = 1. / (1. + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1. - preds)
    return grad, hess

Similarly, I would need to define a custom eval metric to accompany it, such as:

def binary_error(preds, train_data):
    labels = train_data.get_label()
    preds = 1. / (1. + np.exp(-preds))
    return 'error', np.mean(labels != (preds > 0.5)), False

Both of the above two examples are taken from the following repository:

https://github.com/microsoft/LightGBM/blob/e83042f20633d7f74dda0d18624721447a610c8b/examples/python-guide/advanced_example.py#L136

Would appreciate any help on this, and especially detailed guidance to help me learn how to do this on my own.

According to the LGBM documentation for custom loss functions:

It should have the signature objective(y_true, y_pred) -> grad, hess or objective(y_true, y_pred, group) -> grad, hess:

y_true: numpy 1-D array of shape = [n_samples]
The target values.

y_pred: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

group: numpy 1-D array
Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

grad: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The value of the first order derivative (gradient) of the loss with respect to the elements of y_pred for each sample point.

hess: numpy 1-D array of shape = [n_samples] or numpy 2-D array of shape = [n_samples, n_classes] (for multi-class task)
The value of the second order derivative (Hessian) of the loss with respect to the elements of y_pred for each sample point.
1

There are 1 best solutions below

2
CAPSLOCK On

This is the "translation", as you defined it, of the tensorflow implementation. Most of the work is just defining the functions yourself (i.e. softplus, crossentropy, etc.)

The mean absolute percentage error is used in the linked paper, not sure if that is the eval metric you want to use.

import math
import numpy as np
epsilon = 1e-7
def sigmoid(x):
  return 1 / (1 + math.exp(-x))

def softplus(beta=1, threshold=20):
  return 1 / beta* math.log(1 + math.exp(beta*x))

def BinaryCrossEntropy(y_true, y_pred):
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    term_0 = (1-y_true) * np.log(1-y_pred + epsilon)
    term_1 = y_true * np.log(y_pred + epsilon)
    return -np.mean(term_0+term_1, axis=0)

def zero_inflated_lognormal_pred(logits):
  positive_probs = sigmoid(logits[..., :1])
  loc = logits[..., 1:2]
  scale = softplus(logits[..., 2:])
  preds = (
      positive_probs *
      np.exp(loc + 0.5 * np.square(scale)))
  return preds

def mean_abs_pct_error(preds, train_data):
    labels = train_data.get_label()
    decile_labels=np.percentile(labels,np.linspace(10,100,10))
    decile_preds=np.percentile(preds,np.linspace(10,100,10))
    MAPE = sum(np.absolute(decile_preds - decile_labels)/decile_labels)
    return 'error', MAPE, False

def zero_inflated_lognormal_loss(train_data,
                                 logits):

  labels = train_data.get_label()
  positive = labels > 0

  positive_logits = logits[..., :1]
  classification_loss = BinaryCrossEntropy(
      y_true=positive, y_pred=positive_logits)

  loc = logits[..., 1:2]
  scale = math.maximum(
      softplus(logits[..., 2:]),
      math.sqrt(epsilon))
  safe_labels = positive * labels + (
      1 - positive) * np.ones(labels.shape)
  regression_loss = -np.mean(
      positive * np.LogNormal(mean=loc, stdev=scale).log_prob(safe_labels),
      axis=-1)
  return classification_loss + regression_loss