Google Professional Data Engineer Certificate EXAMTOPIC DUMPS Q1-Q5
Q 1.
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
- ❌ A. Threading
→ Nothing to do with overfitting.- ❌ B. Serialization
→ Use while saving the model- ⭕ C. Dropout Methods
→ Dropping out units (both hidden and visible) randomly in a neural network during the training process to prevent complexity.- ❌ D. Dimensionality Reduction
→ Applicable for other branch of machine learning techniques which may not involve Neural Networks.
Threading
Multithreading the training process to make training faster.
Serialization
Process of converting models/data in memory into a structure which can be stored.
Dimensionality reduction
Process of reducing number of features and can be used to avoid overfitting, but generally it is used when need to train on large number of features. Also, for other branch of machine learning techniques which may not involve Neural Networks.
Overfitting
- Sign of overfitting : Models fits well on training data and poor performance on test data.
Solutions for overfitting
- Early Stopping
- Regularization
- L1 and L2 Regularization
- Dropout
- Max-Norm Regularization
- Data Augmentation
Q 2.
🙆🏻♀️ I've seen this same question on the actual PMLE exam!
You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?
- ❌ A. Continuously retrain the model
on just the new data.- ⭕ B. Continuously retrain the model on a combination of existing data and the new data.
→ The user preferences keeps changing so new data should also be used. Using existing data is also necessary to understand past behavior.- ❌ C. Train
on the existing datawhile using the new data as your test set.- ❌ D. Train
on the new datawhile using the existing data as your test set.
Q 3.
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?
- ❌ A.
Add capacity (memory and disk space)to the database server by the order of 200.
→ Adding additional compute resources is not a recommended way to resolve database schema problems.- ❌ B.
Shard the tables into smaller onesbased on date ranges, and only generate reports with prespecified date ranges.
→ Just creating many tables.
→ Still need to perform "self-join" to in individual table to get data of Patient and it visits.- ⭕ C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
- ❌ D.
Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
→ Still need to perform "self-join" to in individual table to get data of Patient and it visits.
Self-join increases the overhead.
Avoiding SQL anti-patterns
Avoiding SQL anti-patterns | BigQuery | Google Cloud
Best practice: Avoid self-joins. Use a window (analytic) function instead.
- Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance.
- Instead of using a self-join, use a window (analytic) function to reduce the number of additional bytes that are generated by the query.
Q 5.
An external customer provides you with a daily dump of data from their database. The data flows into
Google Cloud Storage GCS
as comma-separated values (CSV) files. You want to analyze this data inGoogle BigQuery
, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
- ❌ A. Use federated data sources, and check data in the SQL query.
- ❌ B. Enable
BigQuery
monitoring inGoogle Stackdriver
and create an alert.- ❌ C. Import the data into
BigQuery
using the gcloud CLI and setmax_bad_records
to 0.- ⭕ D. Run a
Google Cloud Dataflow
batch pipeline to import the data intoBigQuery
, and push errors to another dead-letter table for analysis.
→Dataflow
for transforming the data
→outputDeadletterTable
: A dead-letter BigQuery Table is automatically created to catch messages that fail due to various reasons including — message schemas that do not match BigQuery table schema, malformed JSON and messages which throw errors while transforming via the JavaScript function.
New Updates on Pub/Sub to BigQuery Dataflow Templates from GCP
'Certificate - DS > Data engineer' 카테고리의 다른 글
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q21-Q25 (0) | 2022.01.25 |
---|---|
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q6-Q10 (0) | 2022.01.24 |
[DATA ENGINEER LEARNING PATH] Sample Questions (0) | 2022.01.18 |
[DATA ENGINEER LEARNING PATH] 1 - Building and Operationalizing Data Processing Systems (0) | 2022.01.16 |
[Certificate] Google Professional Data Engineer (PDE) 자격증 (0) | 2021.12.27 |