[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q51-Q55

Google Professional Data Engineer Certificate EXAMTOPIC DUMPS Q51-Q55

Q 51.

You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)

  • A. Get more training examples
    Adding more training data will increase the complexity of the training set and help with the variance problem.
  • ❌ B. Reduce the number of training examples
  • C. Use a smaller set of features
    Reducing the feature set will ameliorate the overfitting and help with the variance problem.
  • ❌ D. Use a larger set of features
  • E. Increase the regularization parameters
    Increasing the regularization parameter will reduce overfitting and help with the variance problem.
  • ❌ F. Decrease the regularization parameters

Overfitting

Source

Overfitting : performs well on the training set but high error on the test set. → HIGH VARIANCE

⭕/❌ Answer Explanation
Try evaluating the hypothesis on a cross validation set rather than the test set. A cross validation set is useful for choosing the optimal non-model parameters like the regularization parameter λ, but the train / test split is sufficient for debugging problems with the algorithm itself.
Try decreasing the regularization parameter λ. The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Decreasing the regularization parameter will increase the overfitting, not decrease it.
Try using a smaller set of features. The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Reducing the feature set will ameliorate the overfitting and help with the variance problem.
Try increasing the regularization parameter λ. The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Increasing the regularization parameter will reduce overfitting and help with the variance problem.
Get more training examples The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Adding more training data will increase the complexity of the training set and help with the variance problem.
BigQuery ML - Overfitting

BigQuery ML supports two methods for preventing overfitting: early stopping and regularization.

  • Adjusting Regularization Parameter
    If the number of features is large compared to the size of the training set, try large values for the regularization parameters. The risk of overfitting is greater when there are only a few observations per feature.

Underfitting

Since the hypothesis performs poorly on the training set, it is suffering from high bias (underfitting)
Underfitting : performs poorly on the training set. → HIGH BIAS

⭕/❌ Answer Explanation
Try adding polynomial features. The poor performance on both the training and test sets suggests a high bias problem. Adding more complex features will increase the complexity of the hypothesis, thereby improving the fit to both the train and test data.
Try increasing the regularization parameter λ. The poor performance on both the training and test sets suggests a high bias problem. Increasing the regularization parameter will allow the hypothesis to fit the data worse, decreasing both training and test set performance.
Try using a smaller set of features. The poor performance on both the training and test sets suggests a high bias problem. Using fewer features will decrease the complexity of the hypothesis and will make the bias problem worse.
Try to obtain and use additional features. The poor performance on both the training and test sets suggests a high bias problem. Using additional features will increase the complexity of the hypothesis, thereby improving the fit to both the train and test data.

Q 53.

You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country
You check the query plan for the query and see the following output in the Read section of Stage:1:
What is the most likely cause of the delay for this query?

  • ❌ A. Users are running too many concurrent queries in the system
  • ❌ B. The [myproject:mydataset.mytable] table has too many partitions
  • ❌ C. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
  • D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew

Avoiding SQL anti-patterns

Avoiding SQL anti-patterns | BigQuery | Google Cloud

  • Self-joins
    Use a window (analytic) function instead.
  • Data skew / Partition skew
    If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
    • Partitions become large when your partition key has a value that occurs more often than any other value. For example, grouping by a user_id field where there are many entries for guest or NULL.
    • Unbalanced joins
  • Cross joins (Cartesian product)
  • DML statements that update or insert single rows