CourseNote - Preparing for the Google Cloud Professional Data Engineer Exam | Google Cloud Skills Boost
Building and Operationalizing Pipelines
- Continuous data can arrive out of order.
- Simple windowing can separate related events into independent windows, losing relationship information.
- Time-based windowing (shuffling) overcomes this limitation.
Dataflow does batch and streaming
- Apache Beam : open programming platform for unifying batch and streaming.
- Before Apache Beam, you needed 2 pipelines to balance latency, throughput, and fault tolerance.
Dataflow
: Apache Beam as a service, a fully-managed autoscaling service that runs Beam pipelines.
Dataflow solves many stream processing issues
- Size
Autoscaling and rebalancing handles variable volumes of data and growth. - Scalability and Fault-tolerance
On-demand and distribution of processing scales with fault tolerance. - Programming Model
Windowing, triggering, incremental processing, and out-of-order/late data are addressed in the streaming model. - Unbounded data
Efficient pipelines (Apache Beam) + Efficient execution (Dataflow).
No replacement for the Dataflow windowing capability for streaming data.
Dataflow windowing for streams
- To compute averages on streaming, we need to bound the computation within time windows.
- Windows are the answer to "Where in event time?"
- Windowing creates individual results for different slices of event time.
- Windowing divides a PCollection up into finite chunks based on the event time of each message.
- Useful in many contexts but is required when aggregating over infinite data.
- Basic windowing methods : Fixed, sliding, and session-based windows.
- Fixed time such as a daily window
- Sliding and overlapping windows such as the last 24 hours
- Session-based windows that are triggered to capture bursts of activity
- Triggering controls how results are delivered to the next transforms in the pipeline.
- Watermark is a heuristic that tracks how far behind the system is in processing data from the event time. Where in event time does processing occur?
- Updated results (late), or speculative results (early).
Side inputs in Dataflow
- Pipeline to detect accidents
DetectAccidents uses the average speed at each location as a side input.
Building and Operationalizing Processing Infrastructure
Building a streaming pipeline
- Stream from
Pub/Sub
intoBigQuery
BigQuery
- Can provide streaming ingest to unbounded data sets.
- Provide streaming ingestion at a rate of 100,000 rows/table/second.
Pub/Sub
- Guarantees delivery, but not the order of messages.
- "At least once" : repeated delivery of the same message is possible.
→ If you have a TIMESTAMP,Dataflow
stream processing can remove duplicates based on internal Pub/Sub ID and can work with out-of-order messages when computing aggregates.
Data processing solutions
Scaling streaming beyond BigQuery
Cloud Bigtable
: Low latency/high-throughput- 100,000 QPS at 6ms latency for a 10-node cluster
- More Cost-Efficient than
Cloud Spanner
(~150 nodes)
BigQuery
: Easy, inexpensive- latency in order of seconds
- 100k rows/second streaming
'Certificate - DS > Data engineer' 카테고리의 다른 글
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q21-Q25 (0) | 2022.01.25 |
---|---|
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q6-Q10 (0) | 2022.01.24 |
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q1-Q5 (0) | 2022.01.24 |
[DATA ENGINEER LEARNING PATH] Sample Questions (0) | 2022.01.18 |
[Certificate] Google Professional Data Engineer (PDE) 자격증 (0) | 2021.12.27 |