[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q1-Q5

Google Professional Data Engineer Certificate EXAMTOPIC DUMPS Q1-Q5

Q 1.

Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

  • ❌ A. Threading
    Nothing to do with overfitting.
  • ❌ B. Serialization
    Use while saving the model
  • C. Dropout Methods
    Dropping out units (both hidden and visible) randomly in a neural network during the training process to prevent complexity.
  • ❌ D. Dimensionality Reduction
    Applicable for other branch of machine learning techniques which may not involve Neural Networks.
Threading

Multithreading the training process to make training faster.

Serialization

Process of converting models/data in memory into a structure which can be stored.

Dimensionality reduction

Process of reducing number of features and can be used to avoid overfitting, but generally it is used when need to train on large number of features. Also, for other branch of machine learning techniques which may not involve Neural Networks.

Overfitting

  • Sign of overfitting : Models fits well on training data and poor performance on test data.
Solutions for overfitting
  • Early Stopping
  • Regularization
    • L1 and L2 Regularization
    • Dropout
    • Max-Norm Regularization
  • Data Augmentation

Q 2.

🙆🏻‍♀️ I've seen this same question on the actual PMLE exam!

You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

  • ❌ A. Continuously retrain the model on just the new data.
  • B. Continuously retrain the model on a combination of existing data and the new data.
    The user preferences keeps changing so new data should also be used. Using existing data is also necessary to understand past behavior.
  • ❌ C. Train on the existing data while using the new data as your test set.
  • ❌ D. Train on the new data while using the existing data as your test set.

Q 3.

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

  • ❌ A. Add capacity (memory and disk space) to the database server by the order of 200.
    Adding additional compute resources is not a recommended way to resolve database schema problems.
  • ❌ B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
    Just creating many tables.
    Still need to perform "self-join" to in individual table to get data of Patient and it visits.
  • C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
  • ❌ D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
    Still need to perform "self-join" to in individual table to get data of Patient and it visits.
Self-join increases the overhead.

Avoiding SQL anti-patterns

Avoiding SQL anti-patterns | BigQuery | Google Cloud

Best practice: Avoid self-joins. Use a window (analytic) function instead.
  • Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance.
  • Instead of using a self-join, use a window (analytic) function to reduce the number of additional bytes that are generated by the query.

Q 5.

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

  • ❌ A. Use federated data sources, and check data in the SQL query.
  • ❌ B. Enable BigQuery monitoring in Google Stackdriver and create an alert.
  • ❌ C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
  • D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
    Dataflow for transforming the data
    outputDeadletterTable : A dead-letter BigQuery Table is automatically created to catch messages that fail due to various reasons including — message schemas that do not match BigQuery table schema, malformed JSON and messages which throw errors while transforming via the JavaScript function.

New Updates on Pub/Sub to BigQuery Dataflow Templates from GCP