Section 3 - Designing data preparation and processing systems 데이터 전처리 관련 내용을 정리합니다.
Best practices of preprocessing data in ML pipeline on Google Cloud
Using BigQuery, Dataflow, AI Platform (ML) Engine, tf.estimator API.
Data engineering $vs.$ feature engineering
Preprocessing the data for ML
Data engineering
: the process of converting raw data into prepared data.Feature engineering
then tunes the prepared data to create the features expected by the ML model.
Raw data → Prepared data → Engineered features.
Raw data or just data
: the data in its raw form (in a data lake) or in a transformed form (in a data warehouse) ⇒ not prepared specifically for ML task- data sent from streaming systems
Prepared data
: the dataset in the form ready for ML task. parsed, joined, and put into a tabular form. & aggregated and summarized to the right granularityEngineered features
: the dataset with the tuned features expected by the model (ML-specific operations)- one-hot-encoding categorical features, scaled numerical columns
Preprocessing operations for Sturtured data
- Data cleansing
- Removing or correcting records with corrupted or invalid values, removing records that are missing a large number of columns.
- Instances selection and partitioning
- Selecting data points from the input dataset to create training, evaluation (validation), and test sets.
- techniques for repeatable random sampling, minority classes oversampling, and stratified partitioning.
- Feature tuning : Improving the quality of a feature for ML
- scaling, normalizing numeric values, imputing missing values, clipping outliers, adjusting values with skewed distributions.
- Representation transformation
- Converting a numeric feature to a categorical feature ; bucketization
- converting categorical features to a numeric representation; one-hot encoding, learning with counts, sparse feature embeddings.
- Feature extraction : Reducing the number of features by creating lower-dimension, more powerful data representations using techniques
- Feature selection : Selecting a subset of the input features for training the model, and ignoring the irrelevant or redundant ones.
- using filter or wrapper methods
- simply dropping features if the features are missing a large number of values.
- Feature construction : Creating new features
Preprocessing operations for Unsturcted data
- Text : stemming and lemmatization, TF-IDF calculation, n-gram extraction, embedding lookup.
- Images : clipping, resizing, cropping, Gaussian blur, and canary filters.
- Transfer learning 전이학습 : treating all-but-last layers of the fully trained model as a feature engineering step. 마지막 레이어를 제외한 모든 레이어를 피쳐 엔지니어링 단계로 처리
Machine learning pipeline on Google Cloud
How to build blocks of a typical end-to-end pipeline to train and serve TensorFlow ML models on Google Cloud using managed services.
High-level architecture of typical ML pipeline for training and serving TensorFlow models
Import raw data and store
- in BigQuery (
A
)- in Cloud Storage : in the case of images, docs, audio, video
Dataflow (
B
)
- Execute Data engineering (preparation) & feature engineering at scale.
- Produce ML-ready training, evaluation, and test sets that are stored in Cloud Storage. (as the optimized format for TensorFlow computations)
TF Model Trainer Package (
C
) submitted to AI Platform (using preprocessed data from the previous steps)
- Trainer Package
- Output : trained TensorFlow SavedModel exported to Cloud Storage.
The trained TensorFlow model is deployed to AI Platform as a microservice that has a REST API so that it can be used for online predictions. (or batch prediction jobs.)
_After the model is deployed as a REST API _, client apps and internal systems can invoke this API by sending requests with some data points, and receiving responses from the model with predictions.
Cloud Composer : Orchestrate and Automating the pipeline as a scheduler to invoke the data preparation, model training, and model deployment steps.
Source&Reference : Data preprocessing for machine learning: options and recommendations
'Certificate - DS > Machine learning engineer' 카테고리의 다른 글
Which GCP service to use - Cloud Dataflow & Cloud Dataproc (0) | 2021.11.28 |
---|---|
Production ML Systems - Tuning Prediction performance ( Batch / Streaming pipeline) 예측 성능 튜닝 (0) | 2021.11.28 |
Deep Learning VM Image (0) | 2021.11.27 |
Production ML Systems - Design Training&Serving Architecture (0) | 2021.11.26 |
Google Cloud hardware components - CPU/GPU/TPU (0) | 2021.11.26 |