EXAMTOPIC DUMPS Q53-Q56 ; Data distribution considerations, Managed service for ML model, Test for Production-Readiness, Evaluation Metric for Imbalanced data & Anomaly Dectection (문제에 대한 답은 개인적인 학습내용과 discussion 기반해 작성한 것으로, 공식사이트에서 제안하는 답과 상이할 수 있습니다.
Q 53.
Your team is working on an NLP research project to predict political affiliation of authors based on articles they have written. You have a large training dataset that is structured like this:
You followed the standard 80%-10%-10% data distribution across the training, testing, and evaluation subsets. How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?
Data distribution considerations
- A. Distribute texts randomly across the train-test-eval subsets:
Train set: [TextA1, TextB2, ...] Test set: [TextA2, TextC1, TextD2, ...] Eval set: [TextB1, TextC2, TextD1, ...]
- ⭕ B. Distribute authors randomly across the train-test-eval subsets:
(*) Train set: [TextA1, TextA2, TextD1, TextD2, ...] Test set: [TextB1, TextB2, ...] Eval set: [TexC1,TextC2 ...]
→ GIVEN THAT PREDICTION TARGET IS POLITICAL AFFLIATION OF AUTHORS, DISTRIBUTING AUTHORS RANDOMLY ACROSS T-T-E SUBSETS MAKE MORE SENSE.
- C. Distribute sentences randomly across the train-test-eval subsets:
Train set: [SentenceA11, SentenceA21, SentenceB11, SentenceB21, SentenceC11, SentenceD21 ...] Test set: [SentenceA12, SentenceA22, SentenceB12, SentenceC22, SentenceC12, SentenceD22 ...] Eval set: [SentenceA13, SentenceA23, SentenceB13, SentenceC23, SentenceC13, SentenceD31 ...]
- D. Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the train-test-eval subsets:
Train set: [SentenceA11, SentenceA12, SentenceD11, SentenceD12 ...] Test set: [SentenceA13, SentenceB13, SentenceB21, SentenceD23, SentenceC12, SentenceD13 ...] Eval set: [SentenceA11, SentenceA22, SentenceB13, SentenceD22, SentenceC23, SentenceD11 ...]
"The moral: carefully consider how you split examples. Know what the data represents."
Determine flaws in real-world ML experimental design
Q. Real World Example: 18th Century Literature
Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used. Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets. Trained model did nearly perfectly on test data, but researchers felt results were suspiciously accurate. What might have gone wrong?
- ⭕ Data Split A : Researchers put some of each author's examples in training set, some in validation set, some in test set. All of Richardson's examples might be in the training set, while all of Swift's examples might be in the validation set.
→ The model had the ability to learn specific qualities about Richardson's use of language beyond just the metaphors that he used. And, in a sense, get to memorize a little extra stuff about him when it came time to be applied at test time.
- Data Split B : Researchers put all of each author's examples in a single set.
→ It was much more difficult to get good accuracy on test data and that it's much more difficult to predict the political affiliation based only on the metaphorical data.
Poor performance |
- Review your schema. |
Make sure all your columns have the correct type, and that you excluded from training any columns that were not predictive, such as ID columns. |
- Review your data |
Missing values in non-nullable columns cause that row to be ignored. Make sure your data does not have too many errors. |
- Export the test dataset and examine it. |
By inspecting the data and analyzing when the model is making incorrect predictions, you might determine that you need more training data for a particular outcome, or that your training data introduced leakage. |
- Increase the amount of training data. |
If you don't have enough training data, model quality suffers. Make sure your training data is as unbiased as possible. |
- Increase the training time |
If you had a short training time, you might get a higher-quality model by allowing it to train for a longer period of time. |
Perfect performance |
If your model returned near-perfect evaluation metrics, something might be wrong with your training data. |
- (1) Target leakage |
Target leakage happens when a feature is included in the training data that cannot be known at training time, and which is based on the outcome. |
For example, if you included a Frequent Buyer number for a model trained to decide whether a first-time user would make a purchase, that model would have very high evaluation metrics, but would perform poorly on real data, because the Frequent Buyer number could not be included. |
To check for target leakage, review theFeature importance graph on the Evaluate tab for your model. Make sure the columns with high importance are truly predictive and are not leaking information about the target. |
- (2) Time column : Data Split Issue |
If the time of your data matters, make sure you used a Time column or a manual split based on time. Not doing so can skew your evaluation metrics. |
Q 54.
Your team has been tasked with creating an ML solution in Google Cloud to classify support requests for one of your platforms. You analyzed the requirements and decided to use TensorFlow
to build the classifier so that you have full control of the model's code, serving, and deployment. You will use Kubeflow pipelines
for the ML platform. To save time, you want to build on existing resources and use managed services instead of building a completely new model. How should you build the classifier?
Managed service for ML model
- ❌ A. Use
the Natural Language API to classify support requests. → NOT managed service
- ❌ B. Use
AutoML Natural Language to build the support requests classifier. → NOT managed service
- ⭕ C. Use an established text classification model on AI Platform to perform transfer learning.
→ TF MODEL USING EXISTING RESOURCES, MANAGED SERVICE
- ❌ D. Use an established text classification model on AI Platform
as-is to classify support requests.
→ cannot work as-is as the classes to predict will likely not be the same;
Q 55.
You recently joined a machine learning team that will soon release a new project. As a lead on the project, you are asked to determine the production readiness of the ML components. The team has already tested features and data, model development, and **infrastructure. Which additional readiness check** should you recommend to the team?
Test for Production-Readiness
- ❌ A. Ensure that training is reproducible.
→ INFRA
- ❌ B. Ensure that all hyperparameters are tuned.
→ MODEL
- ⭕ C. Ensure that model performance is monitored.
- ❌ D. Ensure that feature expectations are captured in the schema.
→ DATA
TESTING DATA/MODEL/INFRA & MONITORING are key considerations for ensuring the production-readiness of an ML system
DATA TESTS |
1 - Feature expectations are captured in a schema. |
2 - All features are beneficial. |
3 - No feature’s cost is too much. |
4 - Features adhere to meta-level requirements. |
5 - The data pipeline has appropriate privacy controls. |
6 - New features can be added quickly. |
7 - All input feature code is tested. |
MODEL TESTS |
1 - Model specs are reviewed and submitted. |
2 - Offline and online metrics correlate. |
3 - All hyperparameters have been tuned. |
4 - The impact of model staleness is known. |
5 - A simpler model is not better. |
6 - Model quality is sufficient on important data slices. |
7 - The model is tested for considerations of inclusion. |
ML INFRASTRUCTURE TESTS |
1 - Training is reproducible. |
2 - Model specs are unit tested. |
3 - The ML pipeline is Integration tested. |
4 - Model quality is validated before serving. |
5 - The model is debuggable. |
6 - Models are canaried before serving. |
7 - Serving models can be rolled back. |
MONITORING TESTS |
1 - Dependency changes result in notification. |
2 - Data invariants hold for inputs. |
3 - Training and serving are not skewed. |
4 - Models are not too stale. |
5 - Models are numerically stable. |
6 - Computing performance has not regressed. |
7 - Prediction quality has not regressed. |
Q 56.
You work for a credit card company and have been asked to create a custom fraud detection model based on historical data using AutoML Tables
. You need to prioritize detection of fraudulent transactions while minimizing false positives. Which optimization objective should you use when training the model?
Evaluation Metric for Imbalanced data & Anomaly Dectection
- A. An optimization objective that minimizes
Log loss
→ LOG LOSS FOR MULTICLASS-CLASSIFICATION
- B. An optimization objective that
maximizes the Precision at a Recall value of 0.50
- ⭕ C. An optimization objective that maximizes the area under the precision-recall curve (AUC PR) value
→ Optimize results for predictions for the less common class.
- D. An optimization objective that maximizes the area under the receiver operating characteristic curve (AUC ROC) value
AUC PR optimize for less common class
A higher threshold decreases FP, at the expense of more FN. A lower threshold decreases FN at the expense of more FP.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1 = 2*(Precision*Recall)/(Precision+Recall)
Model optimization objectives
Classification
CASE |
Optimization obj |
Distinguish between classes. Default value for binary classification. |
AUC ROC |
Keep prediction probabilities as accurate as possible. Only supported objective for multi-class classification. |
Log loss |
Optimize results for predictions for the less common class. |
AUC PR |
Optimize precision at a specific recall value. |
Precision at Recall |
Optimize recall at a specific precision value. |
Recall at Precision |
Regression
CASE |
Optimization obj |
Capture more extreme values accurately. |
RMSE |
View extreme values as outliers with less impact on model. |
MAE |
Penalize error on relative size rather than absolute value. Especially helpful when both predicted and actual values can be quite large. |
RMSLE |