Q 33.
You have a demand forecasting pipeline in production that uses Dataflow to preprocess raw data prior to model training and prediction. During preprocessing, you employ Z-score normalization on data stored in BigQuery and write it back to BigQuery. New training data is added every week. You want to make the process more efficient by minimizing computation time and manual intervention. What should you do?
MINIMIZE computation time & manual intervention for data normalization in Bigquery
- ❌ A. Normalize the data using Google Kubernetes Engine.
- ⭕ B. Translate the normalization algorithm into SQL for use with
BigQuery
. - ❌ C. Use the
normalizer_fn
argument in TensorFlows Feature Column API. - ❌ D. Normalize the data with Apache Spark using the
Dataproc
connector for BigQuery.
Q 34.
You need to design a customized deep neural network in Keras that will predict customer purchases based on their purchase history. You want to explore model performance using multiple model architectures, store training data, and be able to compare the evaluation metrics in the same dashboard. What should you do?
Experiment on the model performance of multiple Keras DNN model architectures in the same dashboard.
- ❌A. Create multiple models using
AutoML Tables
.- ❌ B. Automate multiple training runs using
Cloud Composer
.- ❌ C. Run multiple training jobs on
AI Platform
with similar job names.- ⭕ D. Create an experiment in
Kubeflow Pipelines
to organize multiple runs.
Kubeflow Pipelines/AI Platform Pipelines
Kubeflow : End-to-end orchestration of machine learning pipelines.
- Allows for easy experimentation and reusability.
- Built on top of Kubernetes ⇒ scaling and portability.
- ❌ To use Kubeflow on GCP,
additional work is required to setup and manage the kubernetes cluster.
→ Google AI Platform Pipelines offers taking care of setting up a Google Kubernetes Engine, a bucket and installing Kubeflow Pipelines. - Visualize Results in the Pipelines UI | Kubeflow : A user interface (UI) for managing and tracking experiments, jobs, and runs An end-to-end open-source platform Built-in Notebook server service.
Kubeflow Metadata
- Kubeflow pipeline running on premise. You need to record logs and data about deployed models for audit reasons : USE
Kubeflow Metadata
- allows tracking and managing metadata of machine learning workflows in Kubeflow**.
- Metadata : information about runs, models, datasets and other artifacts.
TFX $vs.$ Kubeflow
TFX
- runs on Apache Beam designed for machine learning deployment pipelines created with Tensorflow
Kubeflow
- Runs on Kubernetes & Offers pipelines for many frameworks
Tensorflow, PyTorch, XGBoost, ..
- Other tools : notebooks and metadata management
Orchestration tool : Cloud Composer/Apache Airflow especially for ETL&ELT
- Apache Airflow를 기반으로 구축
- Fully-managed Service for orchestration
- 워크플로 생성, 예약, 모니터링, 관리
- NOT suitable if low latency was required in between tasks
- Need to specify how many workers you need for a given Composer environment
- Building a batch orchestration workflow for data engineering. (ETL)
Orchestration tool : Cloud Scheduler especially for A Single Service
Orchestration tool : Workflows especially for Microservices
- Serverless : no infrastructure to manage or scale
→ No need to specify how many workers you need - Designed for latency sensitive use cases : low latency or have a high execution count.
Q 36.
You are building a model to predict daily temperatures. You split the data randomly and then transformed the training and test datasets. Temperature data for model training is uploaded hourly. During testing, your model performed with 97% accuracy; however, after deploying to production, the model's accuracy dropped to 66%. How can you make your production model more accurate?
Test Performance vs. Production Performance
- ❌ A.
Normalizethe data for the training, and test datasets as two separate steps.
→ solution for overfitting- ⭕ B. Split the training and test data based on time rather than a random split to avoid leakage.
- ❌ C. Add more data to your test set to ensure that you have a fair distribution and sample for testing.
→ solution for overfitting- ❌ D. Apply data transformations before splitting, and cross-validate to make sure that the transformations are applied to both the training and test sets.
→ doesn't improve anything at all. Split and Transform is no different than Transform and Split if the transform logic is the same.
- model predict daily temperatures : Time Series data & testing accuracy 97% vs. production accuracy66% → Data leakeage ?
Target leakage
Target leakage happens when your training data includes predictive information that is not available when you ask for a prediction. Target leakage can cause your model to show excellent evaluation metrics, but perform poorly on real data.
In time-series problems, it’s important to split them temporally so that you are not leaking future information that would not be available at test time into the trained model. If you are leaking it, you are artificially increasing your accuracy.
For example, suppose you want to know how much ice cream your store will sell tomorrow. You cannot include the target day's temperature in your training data, because you will not know the temperature (it hasn't happened yet). However, you could use the predicted temperature from the previous day, which could be included in the prediction request.
Tabular data preparation best practices > Avoid target leakage
'Certificate - DS > Machine learning engineer' 카테고리의 다른 글
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q41-Q44 (0) | 2021.12.10 |
---|---|
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q37-Q40 (0) | 2021.12.10 |
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q25-Q28 (0) | 2021.12.10 |
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q29-Q32 (0) | 2021.12.10 |
[PMLE CERTIFICATE - EXAMTOPIC] DUMPS Q21-Q24 (0) | 2021.12.10 |