Let's say I let a classification model classify a single object multiple times but under varying circumstances. Ideally it should predict the same class again and again. But in reality its class predictions may vary.

So given a sequence of class predictions for the single object, I'd like to measure how consistent the sequence is. To be clear, this is not about comparing predictions against some ground truth. This is about consistency within the prediction sequence itself.

  • For instance, a perfectly consistent prediction sequence like class_a, class_a, class_a, class_a should get a perfect score.
  • A less consistent sequence like class_a, class_b, class_a, class_c should get a lower score.
  • And a completely inconsistent sequence like class_a, class_b, class_c, class_d should get the lowest score possible.

The goal is to find out on what objects we may need to keep training the classification model. If the classification model is not very consistent in its predictions for a certain object, then we might need to add that object to a dataset for further training.

Preferably it works for any number of possible classes and also takes into account prediction confidences. The sequence class_a (0.9), class_b (0.9), class_a (0.9), class_c (0.9) should give a lower score then class_a (0.9), class_b (0.2), class_a (0.8), class_c (0.3), as it's no good when the predictions are inconsistent with high confidences.

I could build something myself, but I'd like to know if there's a standard sklearn or scipy (or similar) function for this? Thanks in advance!

The comment to this question suggests Spearman's correlation coefficient or the Kandell correlation coefficient. I'll look into that as well.

1

There are 1 best solutions below

2
DataSciRookie On

Not sure if it's what you are looking for :

import numpy as np
from collections import Counter

def consistency_score(predictions, confidences):
    """
    Calculate a consistency score for a sequence of predictions.
    

    """
    # Calculate base consistency as the frequency of the most common class
    most_common_class, most_common_freq = Counter(predictions).most_common(1)[0]
    base_consistency = most_common_freq / len(predictions)
    
    # Adjust consistency based on confidences
    # Penalize deviations from the most common class, especially with high confidence
    penalty = sum(conf for pred, conf in zip(predictions, confidences) if pred != most_common_class) / len(predictions)
    adjusted_consistency = max(0, base_consistency - penalty)
    
    return adjusted_consistency
  • Example :

      predictions = ["class_a", "class_b", "class_a", "class_c"]
      confidences = [0.9, 0.9, 0.9, 0.9]
      score_high_confidence = consistency_score(predictions, confidences)
    
      predictions_low_confidence = ["class_a", "class_b", "class_a", "class_c"]
      confidences_low_confidence = [0.9, 0.2, 0.8, 0.3]
      score_low_confidence = consistency_score(predictions_low_confidence, confidences_low_confidence)
    
      print(f"High confidence inconsistencies score: {score_high_confidence}")
      print(f"Lower confidence inconsistencies score: {score_low_confidence}")