Designing data preparation and processing systems - Data Pipeline for Preprocessing 데이터 전처리 파이프라인

Section 3 - Designing data preparation and processing systems 데이터 전처리 관련 내용을 정리합니다.

Best practices of preprocessing data in ML pipeline on Google Cloud

Using BigQuery, Dataflow, AI Platform (ML) Engine, tf.estimator API.

Data engineering $vs.$ feature engineering

Preprocessing the data for ML
- Data engineering : the process of converting raw data into prepared data.
- Feature engineering then tunes the prepared data to create the features expected by the ML model.
Raw data → Prepared data → Engineered features.
- Raw data or just data : the data in its raw form (in a data lake) or in a transformed form (in a data warehouse) ⇒ not prepared specifically for ML task
  - data sent from streaming systems
- Prepared data : the dataset in the form ready for ML task. parsed, joined, and put into a tabular form. & aggregated and summarized to the right granularity
- Engineered features : the dataset with the tuned features expected by the model (ML-specific operations)
  - one-hot-encoding categorical features, scaled numerical columns

Preprocessing operations for Sturtured data

Data cleansing
- Removing or correcting records with corrupted or invalid values, removing records that are missing a large number of columns.
Instances selection and partitioning
- Selecting data points from the input dataset to create training, evaluation (validation), and test sets.
- techniques for repeatable random sampling, minority classes oversampling, and stratified partitioning.
Feature tuning : Improving the quality of a feature for ML
- scaling, normalizing numeric values, imputing missing values, clipping outliers, adjusting values with skewed distributions.
Representation transformation
- Converting a numeric feature to a categorical feature ; bucketization
- converting categorical features to a numeric representation; one-hot encoding, learning with counts, sparse feature embeddings.
Feature extraction : Reducing the number of features by creating lower-dimension, more powerful data representations using techniques
- PCA
- Embedding extraction
- Hashing
Feature selection : Selecting a subset of the input features for training the model, and ignoring the irrelevant or redundant ones.
- using filter or wrapper methods
- simply dropping features if the features are missing a large number of values.
Feature construction : Creating new features
- Polynomial expansion by using univariate mathematical functions
- Feature crossing to capture feature interactions

Preprocessing operations for Unsturcted data

Text : stemming and lemmatization, TF-IDF calculation, n-gram extraction, embedding lookup.
Images : clipping, resizing, cropping, Gaussian blur, and canary filters.
Transfer learning 전이학습 : treating all-but-last layers of the fully trained model as a feature engineering step. 마지막 레이어를 제외한 모든 레이어를 피쳐 엔지니어링 단계로 처리

Machine learning pipeline on Google Cloud

How to build blocks of a typical end-to-end pipeline to train and serve TensorFlow ML models on Google Cloud using managed services.

High-level architecture of typical ML pipeline for training and serving TensorFlow models

Architecture diagram showing stages for processing data

Import raw data and store

in BigQuery (A)

in Cloud Storage : in the case of images, docs, audio, video

Dataflow (B)

Execute Data engineering (preparation) & feature engineering at scale.

Produce ML-ready training, evaluation, and test sets that are stored in Cloud Storage. (as the optimized format for TensorFlow computations)

TF Model Trainer Package (C) submitted to AI Platform (using preprocessed data from the previous steps)

Trainer Package

Output : trained TensorFlow SavedModel exported to Cloud Storage.

The trained TensorFlow model is deployed to AI Platform as a microservice that has a REST API so that it can be used for online predictions. (or batch prediction jobs.)

_After the model is deployed as a REST API _, client apps and internal systems can invoke this API by sending requests with some data points, and receiving responses from the model with predictions.

Cloud Composer : Orchestrate and Automating the pipeline as a scheduler to invoke the data preparation, model training, and model deployment steps.

Source&Reference : Data preprocessing for machine learning: options and recommendations

저작자표시 비영리 변경금지 (새창열림)

'Certificate - DS > Machine learning engineer' 카테고리의 다른 글

Which GCP service to use - Cloud Dataflow & Cloud Dataproc (0)	2021.11.28
Production ML Systems - Tuning Prediction performance ( Batch / Streaming pipeline) 예측 성능 튜닝 (0)	2021.11.28
Deep Learning VM Image (0)	2021.11.27
Production ML Systems - Design Training&Serving Architecture (0)	2021.11.26
Google Cloud hardware components - CPU/GPU/TPU (0)	2021.11.26

JINSTORY

Designing data preparation and processing systems - Data Pipeline for Preprocessing 데이터 전처리 파이프라인

Best practices of preprocessing data in ML pipeline on Google Cloud

Data engineering $vs.$ feature engineering

Preprocessing operations for Sturtured data

Preprocessing operations for Unsturcted data

Machine learning pipeline on Google Cloud

High-level architecture of typical ML pipeline for training and serving TensorFlow models

'Certificate - DS > Machine learning engineer' 카테고리의 다른 글

티스토리툴바