Designing data preparation and processing systems - Data Pipeline for Preprocessing 데이터 전처리 파이프라인

Section 3 - Designing data preparation and processing systems 데이터 전처리 관련 내용을 정리합니다.

Best practices of preprocessing data in ML pipeline on Google Cloud

Using BigQuery, Dataflow, AI Platform (ML) Engine, tf.estimator API.

Data engineering $vs.$ feature engineering

  • Preprocessing the data for ML

    • Data engineering : the process of converting raw data into prepared data.
    • Feature engineering then tunes the prepared data to create the features expected by the ML model.
  • Raw data → Prepared data → Engineered features.

    • Raw data or just data : the data in its raw form (in a data lake) or in a transformed form (in a data warehouse) ⇒ not prepared specifically for ML task
      • data sent from streaming systems
    • Prepared data : the dataset in the form ready for ML task. parsed, joined, and put into a tabular form. & aggregated and summarized to the right granularity
    • Engineered features : the dataset with the tuned features expected by the model (ML-specific operations)
      • one-hot-encoding categorical features, scaled numerical columns

Preprocessing operations for Sturtured data

  • Data cleansing
    • Removing or correcting records with corrupted or invalid values, removing records that are missing a large number of columns.
  • Instances selection and partitioning
  • Feature tuning : Improving the quality of a feature for ML
    • scaling, normalizing numeric values, imputing missing values, clipping outliers, adjusting values with skewed distributions.
  • Representation transformation
    • Converting a numeric feature to a categorical feature ; bucketization
    • converting categorical features to a numeric representation; one-hot encoding, learning with counts, sparse feature embeddings.
  • Feature extraction : Reducing the number of features by creating lower-dimension, more powerful data representations using techniques
  • Feature selection : Selecting a subset of the input features for training the model, and ignoring the irrelevant or redundant ones.
  • Feature construction : Creating new features

Preprocessing operations for Unsturcted data

  • Text : stemming and lemmatization, TF-IDF calculation, n-gram extraction, embedding lookup.
  • Images : clipping, resizing, cropping, Gaussian blur, and canary filters.
  • Transfer learning 전이학습 : treating all-but-last layers of the fully trained model as a feature engineering step. 마지막 레이어를 제외한 모든 레이어를 피쳐 엔지니어링 단계로 처리

Machine learning pipeline on Google Cloud

How to build blocks of a typical end-to-end pipeline to train and serve TensorFlow ML models on Google Cloud using managed services.

High-level architecture of typical ML pipeline for training and serving TensorFlow models

Architecture diagram showing stages for processing data

  1. Import raw data and store

    • in BigQuery (A)
    • in Cloud Storage : in the case of images, docs, audio, video
  2. Dataflow (B)

    • Execute Data engineering (preparation) & feature engineering at scale.
    • Produce ML-ready training, evaluation, and test sets that are stored in Cloud Storage. (as the optimized format for TensorFlow computations)
  3. TF Model Trainer Package (C) submitted to AI Platform (using preprocessed data from the previous steps)

  4. The trained TensorFlow model is deployed to AI Platform as a microservice that has a REST API so that it can be used for online predictions. (or batch prediction jobs.)

  5. _After the model is deployed as a REST API _, client apps and internal systems can invoke this API by sending requests with some data points, and receiving responses from the model with predictions.

  6. Cloud Composer : Orchestrate and Automating the pipeline as a scheduler to invoke the data preparation, model training, and model deployment steps.


Source&Reference : Data preprocessing for machine learning: options and recommendations