Google Professional Data Engineer Certificate EXAMTOPIC DUMPS Q51-Q55
Q 51.
You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)
- ⭕ A. Get more training examples
→ Adding more training data will increase the complexity of the training set and help with the variance problem. - ❌ B.
Reducethe number of training examples - ⭕ C. Use a smaller set of features
→ Reducing the feature set will ameliorate the overfitting and help with the variance problem. - ❌ D. Use a larger set of
features - ⭕ E. Increase the regularization parameters
→ Increasing the regularization parameter will reduce overfitting and help with the variance problem. - ❌ F.
Decreasethe regularization parameters
Overfitting
Overfitting : performs well on the training set but high error on the test set. → HIGH VARIANCE
⭕/❌ | Answer | Explanation |
---|---|---|
❌ | Try evaluating the hypothesis on a cross validation set rather than the test set. | A cross validation set is useful for choosing the optimal non-model parameters like the regularization parameter λ, but the train / test split is sufficient for debugging problems with the algorithm itself. |
❌ | Try decreasing the regularization parameter λ. | The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Decreasing the regularization parameter will increase the overfitting, not decrease it. |
⭕ | Try using a smaller set of features. | The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Reducing the feature set will ameliorate the overfitting and help with the variance problem. |
⭕ | Try increasing the regularization parameter λ. | The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Increasing the regularization parameter will reduce overfitting and help with the variance problem. |
⭕ | Get more training examples | The gap in errors between training and test suggests a high variance problem in which the algorithm has overfit the training set. Adding more training data will increase the complexity of the training set and help with the variance problem. |
BigQuery ML - Overfitting
BigQuery ML supports two methods for preventing overfitting: early stopping
and regularization
.
- Adjusting Regularization Parameter
If the number of features is large compared to the size of the training set, try large values for the regularization parameters. The risk of overfitting is greater when there are only a few observations per feature.
Underfitting
Since the hypothesis performs poorly on the training set, it is suffering from high bias (underfitting)
Underfitting : performs poorly on the training set. → HIGH BIAS
⭕/❌ | Answer | Explanation |
---|---|---|
⭕ | Try adding polynomial features. | The poor performance on both the training and test sets suggests a high bias problem. Adding more complex features will increase the complexity of the hypothesis, thereby improving the fit to both the train and test data. |
❌ | Try increasing the regularization parameter λ. | The poor performance on both the training and test sets suggests a high bias problem. Increasing the regularization parameter will allow the hypothesis to fit the data worse, decreasing both training and test set performance. |
❌ | Try using a smaller set of features. | The poor performance on both the training and test sets suggests a high bias problem. Using fewer features will decrease the complexity of the hypothesis and will make the bias problem worse. |
⭕ | Try to obtain and use additional features. | The poor performance on both the training and test sets suggests a high bias problem. Using additional features will increase the complexity of the hypothesis, thereby improving the fit to both the train and test data. |
Q 53.
You are using
Google BigQuery
as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country
You check the query plan for the query and see the following output in the Read section of Stage:1:
What is the most likely cause of the delay for this query?
- ❌ A. Users are running
too many concurrent queries in the system - ❌ B. The [myproject:mydataset.mytable] table has
too many partitions - ❌ C. Either the state or the city columns in the [myproject:mydataset.mytable] table
have too many NULL values - ⭕ D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
Avoiding SQL anti-patterns
Avoiding SQL anti-patterns | BigQuery | Google Cloud
- Self-joins
Use a window (analytic) function instead. - Data skew / Partition skew
If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.- Partitions become large when your partition key has a value that occurs more often than any other value. For example, grouping by a user_id field where there are many entries for guest or NULL.
- Unbalanced joins
- Cross joins (Cartesian product)
- DML statements that update or insert single rows
'Certificate - DS > Data engineer' 카테고리의 다른 글
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q76-Q80 (0) | 2022.02.16 |
---|---|
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q71-Q75 (0) | 2022.02.16 |
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q36-Q40 (0) | 2022.01.26 |
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q31-Q35 (0) | 2022.01.25 |
[PDE CERTIFICATE - EXAMTOPIC] DUMPS Q26-Q30 (0) | 2022.01.25 |