Professional Data Engineer 샘플 문제 정리

Source : Professional Data Engineer Sample Questions

Q 1.

You are building storage for files for a data pipeline on Google Cloud. You want to support JSON files. The schema of these files will occasionally change. Your analyst teams will use running aggregate ANSI SQL queries on this data. What should you do?

  • ❌ A. Use BigQuery for storage. Provide format files for data load. Update the format files as needed.
    Ineffitient
  • B. Use BigQuery for storage. Select "Automatically detect" in the Schema section.
    BQ support occasionally schema changing JSON files & aggregate ANSI SQL queries
    'Automatically detect' for schema changes is the quickest way to use
  • ❌ C. Use Cloud Storage for storage. Link data as temporary tables in BigQuery and turn on the "Automatically detect" option in the Schema section of BigQuery.
  • ❌ D. Use Cloud Storage for storage. Link data as permanent tables in BigQuery and turn on the "Automatically detect" option in the Schema section of BigQuery.

Q 2.

You use a Hadoop cluster both for serving analytics and for processing and transforming data. The data is currently stored on HDFS in Parquet format. The data processing jobs run for 6 hours each night. Analytics users can access the system 24 hours a day.

  • Phase 1 is to quickly migrate the entire Hadoop environment without a major re-architecture.
  • Phase 2 will include migrating to BigQuery for analytics and to Dataflow for data processing.

You want to make the future migration to BigQuery and Dataflow easier by following Google-recommended practices and managed services. What should you do?

  • ❌ A. Lift and shift Hadoop/HDFS to Dataproc.
    Not recommended to attach persistent HDFS to Dataproc clusters in Google Cloud.
  • ❌ B. Lift and shift Hadoop/HDFS to Compute Engine.
  • ❌ C. Create a single Dataproc cluster to support both analytics and data processing, and point it at a Cloud Storage bucket that contains the Parquet files that were previously stored on HDFS.
    It is recommended that Dataproc clusters be job specific.
  • D. Create separate Dataproc clusters to support analytics and data processing, and point both at the same Cloud Storage bucket that contains the Parquet files that were previously stored on HDFS.
    To leverage managed services ; Dataproc
    Separate Dataproc clusters are job specific
    Easily load the data stored on Cloud Storage in Parquet format to BigQuery

Q 3.

You are building a new real-time data warehouse for your company and will use BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

  • ❌ A. Include ORDER BY DESC on timestamp column and LIMIT to 1.
  • ❌ B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
  • ❌ C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
    might exclude events that are not duplicated
  • D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
    just pick out a single row for each set of duplicates.

Q 4.

You are designing a streaming pipeline for ingesting player interaction data for a mobile game. You want the pipeline to handle out-of-order data delayed up to 15 minutes on a per-player basis and exponential growth in global users. What should you do?

  • A. Design a Dataflow streaming pipeline with session windowing and a minimum gap duration of 15 minutes. Use "individual player" as the key. Use Pub/Sub as a message bus for ingestion.
    Session windowing assigns different windows to each data key
    Pub/Sub handles the need to scale exponentially with traffic coming from around the globe.
  • ❌ B. Design a Dataflow streaming pipeline with session windowing and a minimum gap duration of 15 minutes. Use "individual player" as the key. Use Apache Kafka as a message bus for ingestion.
  • ❌ C. Design a Dataflow streaming pipeline with a single global window of 15 minutes. Use Pub/Sub as a message bus for ingestion.
  • ❌ D. Design a Dataflow streaming pipeline with a single global window of 15 minutes. Use Apache Kafka as a message bus for ingestion.

Dataflow Streaming pipeline - Windowing Functions

Windowing functions divide unbounded collections into logical components, or "windows". Windowing functions group unbounded collections by the timestamps of the individual elements. Each window contains a finite number of elements.

Session windows

A session window contains elements within a gap duration of another element. The gap duration is an interval between new data in a data stream. If data arrives after the gap duration, the data is assigned to a new window.

  • For example, session windows can divide a data stream representing user mouse activity. This data stream might have long periods of idle time interspersed with many clicks. A session window can contain the data generated by the clicks.
  • Session windowing assigns different windows to each data key. Tumbling and hopping windows contain all elements in the specified time interval, regardless of data keys.

Q 5.

Your company is loading CSV files into BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

  • ❌ A. The CSV data loaded in BigQuery is not flagged as CSV.
    The data is fully imported successfully
  • ❌ B. The CSV data had invalid rows that were skipped on import.
    The data is fully imported successfully
  • C. The CSV data loaded in BigQuery is not using BigQuery’s default encoding.
  • ❌ D. The CSV data has not gone through an ETL phase before loading into BigQuery.

Data ingestion format

Choose a data ingestion format based on the following factors

  • Schema support. Avro, ORC, Parquet, and Firestore exports are self-describing formats. BigQuery creates the table schema automatically based on the source data. For JSON and CSV data, you can provide an explicit schema, or you can use schema auto-detection.
  • Flat data or nested and repeated fields. Avro, CSV, JSON, ORC, and Parquet all support flat data. Avro, JSON, ORC, Parquet, and Firestore exports also support data with nested and repeated fields. Nested and repeated data is useful for expressing hierarchical data. Nested and repeated fields also reduce duplication when denormalizing the data.
  • Embedded newlines. When you are loading data from JSON files, the rows must be newline delimited. BigQuery expects newline-delimited JSON files to contain a single record per line.
  • Encoding. BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files.