Production ML Systems - Tuning Prediction performance ( Batch / Streaming pipeline) 예측 성능 튜닝

Batch / Streaming pipeline 파이프라인 유형에 따른 예측 성능 튜닝 방법을 정리합니다.

Prediction/Inference Performance

Performance must consider prediction-time, not just training.

  • 3 Considerations for Performance during inference : Throughputs, Latency, Cost
    1. Throughputs requirements : how many queries per second do you need to process?
    2. Latency requirements : how long can a query take?
    3. Cost : in terms of infrastructure and maintenance
  • 3 Options for Prediction/Inference Implementation

    • Using a deployed model REST/HTTP API for Streaming
    • Using prediction jobs on Cloud ML Engine(CMLE) for Batch
    • Using direct model prediction on Cloud Dataflow for Batch/Streaming

Batch data pipeline

Batch = Bounded Dataset

  1. Read Data & Data Processing
    • Read data from some persistent storage
      • Google Cloud Storage(data lake), BigQuery(data warehouse)
    • Processing, carried by Cloud Dataflow, typically enriches the data with the predictions of a ML model
  2. Inference
    1. Using a TensorFlow SavedModel
      • Loading a TF SavedModel from Cloud Storage into the Dataflow job and invoke it
    2. Using TF Serving
      • Accessing TF Serving via a HTTP end-point as a microservice from CLME or Kubeflow (running on Kubernetes Engine)

Prediction Performance for Batch Pipelines

Performance for Batch Pipelines

  • Raw processing speed
    Cloud ML Engine (AI Platform Notebooks) batch predictions > TF SavedModel > TF Serving on Cloud ML Engine
  • Maintainability
    Cloud ML Engine (AI Platform Notebooks) batch predictions > TF Serving on Cloud ML Engine > TF SavedModel

Laurence said "what’s not to love about a fully managed service?" Fully managed service인 CMLE/AI Platform Notebooks 는 전처리 속도와 유지/보수 측면에서 제일 좋은 성능을 보인다.

Using online predictions as a microservice allows for easier upgradability and dependency management than loading up the current version into the Dataflow job. 반면 TF SavedModel, TF Serving on CMLE 둘은 전처리 속도와 유지/보수 측면에서 성능 순위가 달라진다.

Streaming data pipeline

A streaming pipeline is similar, except that the input dataset is not bounded.