Imbalanced Data
A classification data set with skewed class proportions | ⇒ Majority classes & minority classes |
---|---|
- 훈련 데이터가 특정 클래스에 집중되어있을 경우, 모델은 집중된 클래스에 대해서 오래 학습되고, 소수 클래스는 충분히 학습되지 않는 문제점이 존재한다. (e.g, fraud detection model)
- The balancing issue corresponds to the difference of the number of samples in the different classes. With a greater imbalanced ratio, the decision function favour the class with the larger number of samples(the majority class).
⇒ Solution for Imbalanced data : SAMPLING METHOD.
- The balancing issue corresponds to the difference of the number of samples in the different classes. With a greater imbalanced ratio, the decision function favour the class with the larger number of samples(the majority class).
- First try training on the true distribution ! : 원래 데이터 분포에 대해 학습한 모델이 잘 작동하고 일반화를 잘한다면, 샘플링 기법을 사용할 필요가 없다.
- Trained model using the imbalanced data including majority class can display very high training accuracy.
- For Imbalanced data, better insights using proper measures : Balanced Accuracy, Precision-Recall Curves, AUC, F1-score (rather than accuracy)
Downsampling & Upweighting
- Downsampling : Training on a disproportionately low subset of the majority class examples.
- Upweighting : adding an example weight to the downsampled class equal to the factor by which you downsampled.
Effect of Downsampling ⇒ Upweighting : Faster Convergence, Saving on Disk space, Calibration
- Faster convergence : 훈련동안 소수 클래스에 이전보다 자주 학습하게 함으로써, 더 빠르게 수렴하는 데 도움이 된다.
- Saving on Disk space : By consolidating the majority class into fewer examples with larger weights ⇒ Savings allows more disk space for the minority class (so we can collect a greater number and a wider range of examples from that class).
- Model Calibration: Upweighting ensures our model is still calibrated ⇒ the outputs can still be interpreted as probabilities.
Process of Downsampling & Upweighting
A conceptual diagram of downsampling and upweighting |
---|
1. Downsampling : pulls a randomly selected example from a block representing the dataset of the majority class. |
2. Upweighting : adds a weight to each randomly selected example. |
Downsample the majority class
Fraud data Downsampling by a factor of 20
0.5%
: with 1 positive to 200 negatives10%
with 1 positive to 10 negatives- The proportion of positives to negatives
10%
is still moderately imbalanced, is much better than the original extremely imbalanced proportion0.5%
- The proportion of positives to negatives
Upweight the downsampled class : Add example weights to the downsampled class by a factor of
20
(downsampling){example weight} = {original example weight} × {downsampling factor}
'Certificate - DS > Machine learning engineer' 카테고리의 다른 글
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q13-Q16 (0) | 2021.12.10 |
---|---|
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q9-Q12 (0) | 2021.12.09 |
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q5-Q8 (0) | 2021.12.09 |
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q1-Q4 (0) | 2021.12.09 |
[EXAMTOPIC] AI Platform built-in algorithms (0) | 2021.12.08 |