[DATA ENGINEER LEARNING PATH] 1 - Building and Operationalizing Data Processing Systems

CourseNote - Preparing for the Google Cloud Professional Data Engineer Exam | Google Cloud Skills Boost

Building and Operationalizing Pipelines

Continuous data can arrive out of order.
- Simple windowing can separate related events into independent windows, losing relationship information.
- Time-based windowing (shuffling) overcomes this limitation.

Apache Beam : open programming platform for unifying batch and streaming.
- Before Apache Beam, you needed 2 pipelines to balance latency, throughput, and fault tolerance.
Dataflow : Apache Beam as a service, a fully-managed autoscaling service that runs Beam pipelines.

Size
Autoscaling and rebalancing handles variable volumes of data and growth.
Scalability and Fault-tolerance
On-demand and distribution of processing scales with fault tolerance.
Programming Model
Windowing, triggering, incremental processing, and out-of-order/late data are addressed in the streaming model.
Unbounded data
Efficient pipelines (Apache Beam) + Efficient execution (Dataflow).

Dataflow windowing for streams

To compute averages on streaming, we need to bound the computation within time windows.
Windows are the answer to "Where in event time?"
- Windowing creates individual results for different slices of event time.
- Windowing divides a PCollection up into finite chunks based on the event time of each message.
- Useful in many contexts but is required when aggregating over infinite data.
- Basic windowing methods : Fixed, sliding, and session-based windows.
  - Fixed time such as a daily window
  - Sliding and overlapping windows such as the last 24 hours
  - Session-based windows that are triggered to capture bursts of activity
Triggering controls how results are delivered to the next transforms in the pipeline.
Watermark is a heuristic that tracks how far behind the system is in processing data from the event time. Where in event time does processing occur?
Updated results (late), or speculative results (early).

Pipeline to detect accidents
DetectAccidents uses the average speed at each location as a side input.

Cloud Bigtable : Low latency/high-throughput
- 100,000 QPS at 6ms latency for a 10-node cluster
- More Cost-Efficient than Cloud Spanner (~150 nodes)
BigQuery : Easy, inexpensive
- latency in order of seconds
- 100k rows/second streaming

[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q21-Q25 (0)	2022.01.25
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q6-Q10 (0)	2022.01.24
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q1-Q5 (0)	2022.01.24
[DATA ENGINEER LEARNING PATH] Sample Questions (0)	2022.01.18
[Certificate] Google Professional Data Engineer (PDE) 자격증 (0)	2021.12.27