[DATA ENGINEER LEARNING PATH] 1 - Building and Operationalizing Data Processing Systems

CourseNote - Preparing for the Google Cloud Professional Data Engineer Exam | Google Cloud Skills Boost

Building and Operationalizing Pipelines

  • Continuous data can arrive out of order.
    • Simple windowing can separate related events into independent windows, losing relationship information.
    • Time-based windowing (shuffling) overcomes this limitation.

Dataflow does batch and streaming

  • Apache Beam : open programming platform for unifying batch and streaming.
    • Before Apache Beam, you needed 2 pipelines to balance latency, throughput, and fault tolerance.
  • Dataflow : Apache Beam as a service, a fully-managed autoscaling service that runs Beam pipelines.

Dataflow solves many stream processing issues

  • Size
    Autoscaling and rebalancing handles variable volumes of data and growth.
  • Scalability and Fault-tolerance
    On-demand and distribution of processing scales with fault tolerance.
  • Programming Model
    Windowing, triggering, incremental processing, and out-of-order/late data are addressed in the streaming model.
  • Unbounded data
    Efficient pipelines (Apache Beam) + Efficient execution (Dataflow).
No replacement for the Dataflow windowing capability for streaming data.

Dataflow windowing for streams

  • To compute averages on streaming, we need to bound the computation within time windows.
  • Windows are the answer to "Where in event time?"
    • Windowing creates individual results for different slices of event time.
    • Windowing divides a PCollection up into finite chunks based on the event time of each message.
    • Useful in many contexts but is required when aggregating over infinite data.
    • Basic windowing methods : Fixed, sliding, and session-based windows.
      • Fixed time such as a daily window
      • Sliding and overlapping windows such as the last 24 hours
      • Session-based windows that are triggered to capture bursts of activity
  • Triggering controls how results are delivered to the next transforms in the pipeline.
  • Watermark is a heuristic that tracks how far behind the system is in processing data from the event time. Where in event time does processing occur?
  • Updated results (late), or speculative results (early).

Side inputs in Dataflow

  • Pipeline to detect accidents
    DetectAccidents uses the average speed at each location as a side input.

Building and Operationalizing Processing Infrastructure

Building a streaming pipeline

  • Stream from Pub/Sub into BigQuery
    • BigQuery
      • Can provide streaming ingest to unbounded data sets.
      • Provide streaming ingestion at a rate of 100,000 rows/table/second.
    • Pub/Sub
      • Guarantees delivery, but not the order of messages.
      • "At least once" : repeated delivery of the same message is possible.
        → If you have a TIMESTAMP, Dataflow stream processing can remove duplicates based on internal Pub/Sub ID and can work with out-of-order messages when computing aggregates.

Data processing solutions

Scaling streaming beyond BigQuery

  • Cloud Bigtable : Low latency/high-throughput
    • 100,000 QPS at 6ms latency for a 10-node cluster
    • More Cost-Efficient than Cloud Spanner (~150 nodes)
  • BigQuery : Easy, inexpensive
    • latency in order of seconds
    • 100k rows/second streaming