[EXAMTOPIC] Data Prep - Imbalanced data

Imbalanced Data

A classification data set with skewed class proportions	⇒ Majority classes & minority classes

훈련 데이터가 특정 클래스에 집중되어있을 경우, 모델은 집중된 클래스에 대해서 오래 학습되고, 소수 클래스는 충분히 학습되지 않는 문제점이 존재한다. (e.g, fraud detection model)
- The balancing issue corresponds to the difference of the number of samples in the different classes. With a greater imbalanced ratio, the decision function favour the class with the larger number of samples(the majority class).
  ⇒ Solution for Imbalanced data : SAMPLING METHOD.
First try training on the true distribution ! : 원래 데이터 분포에 대해 학습한 모델이 잘 작동하고 일반화를 잘한다면, 샘플링 기법을 사용할 필요가 없다.
Trained model using the imbalanced data including majority class can display very high training accuracy.
- For Imbalanced data, better insights using proper measures : Balanced Accuracy, Precision-Recall Curves, AUC, F1-score (rather than accuracy)

Downsampling : Training on a disproportionately low subset of the majority class examples.
Upweighting : adding an example weight to the downsampled class equal to the factor by which you downsampled.

Faster convergence : 훈련동안 소수 클래스에 이전보다 자주 학습하게 함으로써, 더 빠르게 수렴하는 데 도움이 된다.
Saving on Disk space : By consolidating the majority class into fewer examples with larger weights ⇒ Savings allows more disk space for the minority class (so we can collect a greater number and a wider range of examples from that class).
Model Calibration: Upweighting ensures our model is still calibrated ⇒ the outputs can still be interpreted as probabilities.

A conceptual diagram of downsampling and upweighting

1. Downsampling : pulls a randomly selected example from a block representing the dataset of the majority class.
2. Upweighting : adds a weight to each randomly selected example.

Downsample the majority class

Fraud data Downsampling by a factor of 20

0.5% : with 1 positive to 200 negatives 10% with 1 positive to 10 negatives
- The proportion of positives to negatives 10% is still moderately imbalanced, is much better than the original extremely imbalanced proportion 0.5%
Upweight the downsampled class : Add example weights to the downsampled class by a factor of 20 (downsampling)
- {example weight} = {original example weight} × {downsampling factor}

[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q13-Q16 (0)	2021.12.10
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q9-Q12 (0)	2021.12.09
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q5-Q8 (0)	2021.12.09
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q1-Q4 (0)	2021.12.09
[EXAMTOPIC] AI Platform built-in algorithms (0)	2021.12.08

Fraud data	Downsampling by a factor of `20`

`0.5%` : with 1 positive to 200 negatives	`10%` with 1 positive to 10 negatives