[EXAMTOPIC] Evaluation Metric - Accuracy, Precision, Recall, Classification Threshold

To evaluate your model’s quality, commonly-used metrics are : loss, accuracy, precision & recall, area under the ROC curve (AUC)

  • Accuracy 정확도
    The fraction of classification predictions produced by the model that were correct.
    • # of correct predictions/Total # of predictions : (TP+TN) / (TP+TN+FP+FN)
    • For Class-Imbalanced data sets USE PRECISION & RECALL.

[EXAMTOPIC] Difference between Precision and Recall for binary classification

  1. Precision 정밀도 : 참이라고 판단한 경우 중 실제 참인 경우 비율

    • Precision = TP / (TP+FP)
    • What proportion of positive identifications was actually correct ?
    • The fraction of positive predictions produced by the model that were correct. (Positive predictions : the false positives & the true positives combined.)
  2. Recall = True Positive Rate = Sensitivity 재현율

    • Recall = TP / (TP+FN)
    • What proportion of actual positives was identified correctly?
    • The fraction of rows with this label that the model correctly predicted. 올바르게 예측된 경우의 비율
    • inversely related to precision.
    • like a person who never wants to be left out of a positive decision.
  • False positive rate
    • The fraction of rows predicted by the model to be the target label but aren't (false positive).
  • F1 score : Precision과 Recall의 조화평균
    • The harmonic mean of precision and recall.

Precision and Recall: A Tug of War

Must examine both precision and recall to evaluate the effectiveness of classification model. They are in tension : trade-off!
Improving precision typically reduces recall and vice versa.

Binary Classification model - spam or not-spam using 30 examples

Threshold Changes in False Negative & False Positive
Baseline
Raise ⇑ Precesion ⇑, Recall ⇓
Lower ⇓ Precesion ⇓, Recall ⇑
  • Changes in Precision & Recall depending on the Threshold

    Baseline Threshold Precision & Recall
    TP : 8 FP : 2 Precision : 8 / (8+2) = 0.8
    FN : 3 TN : 17 Recall : 8 / (8+3) = 0.73
    Threshold ⇑ Increasing Precision & Recall
    TP : 7 FP : 1 ⇓ Precision : 8 / (8+2) = 0.8 ⇑
    FN : 4 ⇑ TN : 18 Recall : 8 / (8+3) = 0.73 ⇓
    Threshold ⇓ Decreasing Precision & Recall
    TP : 9 FP : 3 ⇑ Precision : 9 / (9+3) = 0.75 ⇓
    FN : 2 ⇓ TN : 16 Recall : 9 / (9+2) = 0.82 ⇑

Understanding Q : Precision, Recall

[Accuracy]

In which of the following scenarios would a high accuracy value suggest that the ML model is doing a good job?

  • ⭕ In the game of roulette, a ball is dropped on a spinning wheel and eventually lands in one of 38 slots. Using visual features (the spin of the ball, the position of the wheel when the ball was dropped, the height of the ball over the wheel), an ML model can predict the slot that the ball will land in with an accuracy of 4%.
    ML model is making predictions far better than chance. The benefits of success far outweigh the disadvantages of failure.

    • A random guess would be correct 1/38 of the time—yielding an accuracy of 2.6%.
    • Model's accuracy : 4%
  • ❌ A deadly, but curable, medical condition afflicts .01% of the population. An ML model uses symptoms as features and predicts this affliction with an accuracy of 99.99%.
    Accuracy is not useful when IMBALANCED DATASET.

    • Even a "dumb" model that always predicts "not sick" would still be 99.99% accurate. Mistakenly predicting "not sick" for a person who actually is sick could be deadly.
  • ❌ An expensive robotic chicken crosses a very busy road a thousand times per day. An ML model evaluates traffic patterns and predicts when this chicken can safely cross the street with an accuracy of 99.99%.
    99.99% accuracy value on a very busy road strongly suggests that the ML model is far better than chance.

    • However, in some settings, the cost of making even a small number of mistakes is still too high. (e.g, 99.99% accuracy means that the expensive chicken will need to be replaced, on average, every 10 days. The chicken might also cause extensive damage to cars that it hits.)
[Precision]

Consider a classification model that separates email into two categories : "spam" or "not spam." If you raise the classification threshold, what will happen to precision?

Threshold ⇑ → FP ⇓ (or same) → PRECISION WILL INCREASE ⇑ (or same)

In general, raising the classification threshold reduces false positives, thus raising precision. However, precision is not guaranteed to increase monotonically as we raise the threshold.

  • ❌ Definitely decrease.
  • ⭕ Probably increase.
  • Definitely increase.
  • ❌ Probably decrease.
[Recall]

Consider a classification model that separates email into two categories: "spam" or "not spam." If you raise the classification threshold, what will happen to recall?

Threshold ⇑ → TP ⇓ (or same) & TN ⇑ (or same) → RECALL WILL DECREASE ⇓ (or same)

Raising our classification threshold will cause the number of true positives to decrease or stay the same and will cause the number of false negatives to increase or stay the same. Thus, recall will either stay constant or decrease.

  • ❌ Always stay constant.
  • ❌ Always increase.
  • ⭕ Always decrease or stay the same.
[Precision and Recall]

Consider two models—A and B—that each evaluate the same dataset. Which one of the following statements is true?

  • In general, a model that outperforms another model on both precision and recall is likely the better model.
  • Obviously, we'll need to make sure that comparison is being done at a precision / recall point that is useful in practice for this to be meaningful.
  • ❌ If model A has better recall than model B, then model A is better.

    • While better recall is good, it might be coming at the expense of a large reduction in precision.
    • In general, we need to look at both precision and recall together, or summary metrics like AUC.
  • ⭕ If model A has better precision and better recall than model B, then model A is probably better.

  • ❌ If Model A has better precision than model B, then model A is better.

  • e.g, spam detection model_ needs to have at least 90% precision_ to be useful and avoid unnecessary false alarms.
    (1) model at {20% precision, 99% recall}
    (2) model at {15% precision, 98% recall}
    ⇒ not particularly instructive, as neither model meets the 90% precision requirement. But with that caveat in mind, this is a good way to think about comparing models when using precision and recall.