[PMLE CERTIFICATE - EXAMTOPIC] Data/Target leakage - Training data includes predictive information that is not available when you ask for a prediction
Data/Target leakage
- Training data includes predictive information that is not available when you ask for a prediction. 예측 시점에는 알 수 없는 미래의 정보가 모델 훈련 데이터에 포함되는 경우
- can cause your model to show excellent evaluation metrics, but perform poorly on real data.
Data/Target leakage - Time-Series Prediction model problems
Tabular data preparation best practices > Avoid target leakage
For example, suppose you want to know how much ice cream your store will sell tomorrow. You cannot include the target day's temperature in your training data, because you will not know the temperature (it hasn't happened yet). However, you could use the predicted temperature from the previous day, which could be included in the prediction request.
- How to quickly solve machine learning forecasting problems using Pandas and BigQuery | Google Cloud Blog
- In time-series problems, it’s important to split them temporally so that you are not leaking future information that would not be available at test time into the trained model.
Data leakage - other prediction models
You’ve trained a model to predict “probability a patient has cancer” from medical records and that you've selected
patient age, gender, prior medical conditions, hospital name, vital signs, test results
as features. Your model had excellent performance on held-out test data but performed terribly on new patients.
- Trained model using feature that wasn’t legitimately available at decision time ⇒ when the model was deployed into production, the distribution of this feature changed and it was no longer a reliable predictor.
hospital name
특정 질병 예를 들어 암과 같은 특정 질환 전문병원이 존재하기 때문에, 모델이 이 변수가 중요함을 학습한 것이다. 그러나, decision time 예측 시점에는hospital name
을 모델의 피쳐로 사용할 수 없다. (환자가 어느 병원으로 배정될 지 알 수 없기 때문에) 이렇게 예측 시점에 피쳐로 사용할 수 없는 (미래의) 정보를 활용하는 경우를 "Data Leakage" 라고 한다.- (+) capable of handling thanks to out-of-vocabulary buckets in its representations of words 모델은 이러한 변수 값 (empty string)을 OOV : Out-of-vocabulary buckets 로 처리해 오류를 발생시키지 않았다.
EXAMTOPIC Q 36.
You are building a model to predict daily temperatures. You split the data randomly and then transformed the training and test datasets. Temperature data for model training is uploaded hourly_. During testing, your model performed with 97% accuracy; however, after deploying to production, the model's accuracy dropped to 66%. How can you make your production model more accurate?
- ❌ A. Normalize the data for the training, and test datasets as two separate steps.
- ⭕ B. Split the training and test data based on time rather than a random split to avoid leakage.
- ❌ C. Add more data to your test set to ensure that you have a fair distribution and sample for testing.
- ❌ D. Apply data transformations before splitting, and cross-validate to make sure that the transformations are applied to both the training and test sets.
시계열 예측 모델의 테스트 정확도가 97%로 매우 높고, production 정확도와 차이가 크기 때문에 Data Leakage 문제를 의심해볼 수 있다.
- the prediction model for daily temperatures : testing accuracy 97% production accuracy66%
- training data is uploaded hourly
'Certificate - DS > Machine learning engineer' 카테고리의 다른 글
Professional Machine Learning Engineer 샘플 문제 정리 (0) | 2021.11.26 |
---|---|
Cloud IAM, API Gateway - Security, Privacy, compliance, legal issues (0) | 2021.11.26 |
Explainable AI, Feature Attribution (2) | 2021.11.26 |
Which GCP services to use - AI Platform for Hyperparameter Tuning (0) | 2021.11.25 |
[Certificate] GCP Professional ML Engineer 자격증 (0) | 2021.11.14 |