[EXAMTOPIC] Data Prep - Imbalanced data

Imbalanced Data

A classification data set with skewed class proportions Majority classes & minority classes
  • 훈련 데이터가 특정 클래스에 집중되어있을 경우, 모델은 집중된 클래스에 대해서 오래 학습되고, 소수 클래스는 충분히 학습되지 않는 문제점이 존재한다. (e.g, fraud detection model)
    • The balancing issue corresponds to the difference of the number of samples in the different classes. With a greater imbalanced ratio, the decision function favour the class with the larger number of samples(the majority class).
      Solution for Imbalanced data : SAMPLING METHOD.
  • First try training on the true distribution ! : 원래 데이터 분포에 대해 학습한 모델이 잘 작동하고 일반화를 잘한다면, 샘플링 기법을 사용할 필요가 없다.
  • Trained model using the imbalanced data including majority class can display very high training accuracy.
    • For Imbalanced data, better insights using proper measures : Balanced Accuracy, Precision-Recall Curves, AUC, F1-score (rather than accuracy)

Downsampling & Upweighting

  • Downsampling : Training on a disproportionately low subset of the majority class examples.
  • Upweighting : adding an example weight to the downsampled class equal to the factor by which you downsampled.

Effect of Downsampling ⇒ Upweighting : Faster Convergence, Saving on Disk space, Calibration

  1. Faster convergence : 훈련동안 소수 클래스에 이전보다 자주 학습하게 함으로써, 더 빠르게 수렴하는 데 도움이 된다.
  2. Saving on Disk space : By consolidating the majority class into fewer examples with larger weights ⇒ Savings allows more disk space for the minority class (so we can collect a greater number and a wider range of examples from that class).
  3. Model Calibration: Upweighting ensures our model is still calibrated ⇒ the outputs can still be interpreted as probabilities.

Process of Downsampling & Upweighting

A conceptual diagram of downsampling and upweighting
1. Downsampling : pulls a randomly selected example from a block representing the dataset of the majority class.
2. Upweighting : adds a weight to each randomly selected example.
  1. Downsample the majority class

    Fraud data Downsampling by a factor of 20
    0.5% : with 1 positive to 200 negatives 10% with 1 positive to 10 negatives
    • The proportion of positives to negatives 10% is still moderately imbalanced, is much better than the original extremely imbalanced proportion 0.5%
  2. Upweight the downsampled class : Add example weights to the downsampled class by a factor of 20 (downsampling)

    • {example weight} = {original example weight} × {downsampling factor}