[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q26-Q30

Google Professional Data Engineer Certificate EXAMTOPIC DUMPS Q26-Q30

Q 26.

Not Sure😥

You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?

  • ❌ A. Grant the consultant the Viewer role on the project.
    External user is going to "transform" data with coding
  • ❌ B. Grant the consultant the Cloud Dataflow Developer role on the project.
    dataflow.developer role enable the developer
    interacting with the Cloud Dataflow job, with data privacy.
  • ❌ C. Create a service account and allow the consultant to log on with it.
    Service Accounts are for non human users such as applications.
  • ⭕ D. Create an anonymized sample of the data for the consultant to work with in a different project.

Dataflow - IAM

Dataflow roles can currently be set on organizations and projects only.

  • dataflow.developer role : enable the developer
    interacting with the Cloud Dataflow job, with data privacy.
  • dataflow.worker role : provides the permissions necessary for a Compute Engine service account to execute work units for a Dataflow pipeline

IAM - Service Account

IAM - service accounts

  • A service account is a special kind of account used by an application or compute workload, such as a Compute Engine virtual machine (VM) instance, rather than a person.
  • Applications use service accounts to make authorized API calls, authorized as either the service account itself, or as Google Workspace or Cloud Identity users through domain-wide delegation.
  • A service account is identified by its email address, which is unique to the account.
  • For example, a service account can be attached to a Compute Engine VM, so that applications running on that VM can authenticate as the service account. In addition, the service account can be granted IAM roles that let it access resources. The service account is used as the identity of the application, and the service account's roles control which resources the application can access.

Q 27.

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

  • ❌ A. Eliminate features that are highly correlated to the output labels.
    deteriorate model accuracy ; correlated to output means that feature can contribute a lot to the model
  • B. Combine highly co-dependent features into one representative feature.
    Feature Construction e.g, FEATURE CROSS
  • ❌ C. Instead of feeding in each feature individually, average their values in batches of 3.
    It may affect model accuracy
  • ❌ D. Remove the features that have null values for more than 50% of the training records.
    It may affect model accuracy ; null values can have many meanings and need different approach to handle, otherwise it causes inaccurate model

Feature Engineering - Feature Construction

Data preprocessing for machine learning: options and recommendations | Cloud Architecture Center | Google Cloud

  • _Feature construction*. Creating new features either by using typical techniques, such as polynomial expansion (by using univariate mathematical functions) or feature crossing (to capture feature interactions). Features can also be constructed by using business logic from the domain of the ML use case._

Q 28.

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour. The data scientists have written the following code to read the data for a new key features in the logs.

BigQueryIO.Read -  
.named("ReadLogData")  
.from("clouddataflow-readonly:samples.log\_data")  

You want to improve the performance of this data read. What should you do?

  • ❌ A. Specify the TableReference object in the code.
  • B. Use .fromQuery operation to read specific fields from the table.
    Optimize the code by reading only required data
  • ❌ C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
  • ❌ D. Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.
    Need to READ all the COLUMNS and then apply TRANSFORM.

Best Practice in Bigquery

BigQueryIO.read.

  • BigQueryIO.read.from()

    • Directly reads the whole table from BigQuery.
    • This function exports the whole table to temporary files in Google Cloud Storage, where it will later be read from.
    • This requires almost no computation, as it only performs an export job, and later Dataflow reads from Google Cloud Storage (not from BigQuery).
  • BigQueryIO.read.fromQuery()

    • Execute a query and then reads the results received after the query execution. ; READING ONLY REQUIRED DATA
      • Can greatly reduce the amount of data ; The best Practice in BigQuery to query only the columns that are required
    • This function is more time-consuming, given that it requires that a query is first executed (which will incur in the corresponding economic and computational costs)

Q 29.

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

  • ❌ A. Use a row key of the form <timestamp>.
    Starting with timestamp ; AVOID
  • ❌ B. Use a row key of the form <sensorid>.
    Include a timestamp as part of your row key if you often need to retrieve data based on the time when it was recorded.
  • C. Use a row key of the form <timestamp>#<sensorid>.
    Starting with timestamp ; AVOID
  • D. Use a row key of the form >#<sensorid>#<timestamp>.
    Not starting with timestamp but including it
In Bigtable, Each table has only one (unique) index, the row key.

Best Practice of Bigtable Row keys

Design your row key based on the queries you will use to retrieve the data

  • Keep your row keys short
  • Store multiple delimited values in each row key.
    • Prefixes
    • Range of rows defined by starting and ending row keys
      • pad the integers with leading zeroes.
  • Use human-readable string values in your row keys
  • Design row keys that start with a common value and end with a granular value.
## Ex1 - row key that consists of device type, device ID, and the day the data is recorded
phone#4c410523#20200501
phone#4c410523#20200502
tablet#a0b81f74#20200501
tablet#a0b81f74#20200502

## Ex2 - granular value
asia#india#bangalore
asia#india#mumbai
asia#japan#okinawa
asia#japan#sapporo
southamerica#bolivia#cochabamba
southamerica#bolivia#lapaz
southamerica#chile#santiago
southamerica#chile#temuco

Row keys to "avoid"

  • Row keys that start with a timestamp.
    • Include a timestamp as part of your row key if you often need to retrieve data based on the time when it was recorded. Don't use a timestamp by itself or at the beginning of a row key, because this will cause sequential writes to be pushed onto a single node, creating a hotspot.
  • Row keys that cause related data to not be grouped together.
  • Sequential numeric IDs.
  • Frequently updated identifiers.
  • Hashed values.
  • Values expressed as raw bytes rather than human-readable strings.
  • Unless your use case demands it, avoid using personally identifiable information (PII) or user data in row keys or column family IDs.

Q 30.

Your company's customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

  • ❌ A. Add a node to the MySQL cluster and build an OLAP cube there.
  • B. Use an ETL tool to load the data from MySQL into Google BigQuery.
    ANALYTICS
  • ❌ C. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
  • ❌ D. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc
    Dataproc is recommended as easiest way for migration of hadoop processes.