[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q45-Q48

EXAMTOPIC DUMPS Q45-Q48 문제, 관련 내용을 정리합니다. (문제에 대한 답은 개인적인 학습내용과 discussion 기반해 작성한 것으로, 공식사이트에서 제안하는 답과 상이할 수 있습니다.

Q 48.

You started working on a classification problem with time series data and achieved an area under the receiver operating characteristic curve (AUC ROC) value of 99% for training data after just a few experiments. You haven't explored using any sophisticated algorithms or spent any time on hyperparameter tuning. What should your next step be to identify and fix the problem?

Time-Series data split
  • Classification on TS data, 99% Too high accuracy on training data after a few experiments : NEXT steps before Algorithms Search, Hyp tuning
  • ❌ A. Address the model overfitting by using a less complex algorithm.
  • B. Address data leakage by applying nested cross-validation during model training.
  • ❌ C. Address data leakage by removing features highly correlated with the target value.
  • ❌ D. Address the model overfitting by tuning the hyperparameters to reduce the AUC ROC value.
  • overfitting ; usually detected by the big difference between train & val error

Data Leakage & Training-Serving Skew

To prevent Data Leakeage & Training-Serving Skew_
  • Before using any data, make sure you know what the data means and whether or not you should use it as a feature
  • Check the correlation in the Train tab. High correlations should be flagged for review.
  • Training-serving skew: make sure you only provide input features to the model that are available in the exact same form at serving time.
  • Data Leakage : When you use input features during training that "leak" information about the target that you are trying to predict which is unavailable when the model is actually served.
    • This can be detected when a feature that is highly correlated with the target column is included as one of the input features.
    • EX ) Model to predict whether a customer will sign up for a subscription in the next month and one of the input features is a future subscription payment from that customer. This can lead to strong model performance during testing, but not when deployed in production, since future subscription payment information isn't available at serving time.
  • Training-serving skew : When input features used during training time are different from the ones provided to the model at serving time, causing poor model quality in production.
    • EX 1) Building a model to predict hourly temperatures but training with data that only contains weekly temperatures.
    • EX 2) Always providing a student's grades in the training data when predicting student dropout, but not providing this information at serving time.