This is the post for Google Could Professional ML Engineer certificate preparation ; Feature Engineering - Properties of Good Features, Feature Crosses
Feature Engineering
What is Good Features
- Be related to the objective.
- Be known at prediction-time.
- Be numeric with meaningful magnitude.
- Have enough examples.
- Bring human insight to problem.
Feature Crosses
Feature crosses
Define a Synthetic feature that encodes nonlinearity in the feature space by multiplying(crossing) two or more input features together*.
Crossing combinations of features can provide *predictive abilities beyond what those features can provide individually.
- Useful when the data is not linearly separable.
- Bin the features and treat them as categoricals rather than cross real value features. Crossing real value features may enable change in one feature to be equivalent to change in the other feature (the multiple still remains the same).
- Feature cross requires memorizing how we discretize our feature space ⇒ need to have enough data in each bin to make the inference statistically significant. (need a lot of data to make it effective)
Related ML Rules for Feature crosses
Rule #20 Combine and modify existing features to create new features in human-understandable ways.
2 most standard approaches to combine and modify features :
Discretizations
이산화,Crosses
교차Discretizations
: creating many discrete features from a continuous feature from it
e.g, age is less than 18, another feature which is 1 when age is between 18 and 35, et cetera.) Use basic quantiles for boundaries.Crosses
: combining two or more feature columns(sets of homogenous features).
e.g, Base feature columns :{male, female}, {US, Canada, Mexico}
→ Feature Crosses{male, female} × {US, Canada, Mexico}
→ New Column :{male, Canada}
representing male Canadians.- What if crosses that produce very large feature columns ?
→ Can take massive amounts of data to learn model with crosses of 3 and more base feature colums
→ May overfit
- What if crosses that produce very large feature columns ?
Rule #21 The number of feature weights you can learn in a linear model is roughly proportional to _the amount of data you have.
선형모델이 학습할 수 있는 feature weights는 보유한 데이터 양에 거의 비례한다. ⇒ Need to scale your learning to the size of your data.
Understanding Q for feature crosses.
Different cities in California have markedly different housing prices. Suppose you must create a model to predict housing prices. Which of the following sets of features or feature crosses could learn city-specific relationships between
roomsPerPerson
and housing price?
❌ One feature cross:
[latitude X longitude X roomsPerPerson]
- Crossing real-valued features is not a good idea for this. Crossing the real value of, say, latitude with roomsPerPerson enables a 10% change in one feature (say, latitude) to be equivalent to a 10% change in the other feature (say, roomsPerPerson).
❌ Three separate binned features:
[binned latitude], [binned longitude], [binned roomsPerPerson]
- Binning is good because it enables the model to learn nonlinear relationships within a single feature.
- City exists in more than one dimension, so learning city-specific relationships requires crossing latitude and longitude.
❌ Two feature crosses:
[binned latitude X binned roomsPerPerson] and [binned longitude X binned roomsPerPerson]
- Binning is a good idea
- A city is the conjunction of latitude and longitude, so separate feature crosses prevent the model from learning city-specific prices., so separate feature crosses prevent the model from learning city-specific prices.
⭕ One feature cross:
[binned latitude X binned longitude X binned roomsPerPerson]
Crossing binned latitude with binned longitude enables the model to learn city-specific effects of roomsPerPerson.- Binning prevents a change in latitude producing the same result as a change in longitude. (Depending on the granularity of the bins, this feature cross could learn city-specific or neighborhood-specific or even block-specific effects.)
EXAMTOPIC 43
You are an ML engineer at a global car manufacture. You need to build an ML model to predict car sales in different cities around the world. Which features or feature crosses should you use to train city-specific relationships between car type and number of sales?
❌ A. Thee individual features: binned latitude, binned longitude, and one-hot encoded car type.
❌ B. One feature obtained as an element-wise product between latitude, longitude, and car type.
C. One feature obtained as an element-wise product between binned latitude, binned longitude, and one-hot encoded car type.
❌ D. Two feature crosses as an element-wise product: the first between binned latitude and one-hot encoded car type, and the second between binned longitude and one-hot encoded car type.
city-specific : conjunction of latitude and longitude ⇒
A,D
Source/Reference : Feature Crosses,EXAMTOPICBest Practices for ML Engineering
'Certificate - DS > Machine learning engineer' 카테고리의 다른 글
Production ML Systems - Tuning System Performance (0) | 2021.11.29 |
---|---|
Which GCP service to use - BigQuery ML (BQML) (0) | 2021.11.29 |
Which GCP service to use - Orchestration : Scheduler, Composer, Workflows (0) | 2021.11.28 |
Which GCP service to use - Cloud Functions (0) | 2021.11.28 |
Which GCP service to use - Cloud Dataflow & Cloud Dataproc (0) | 2021.11.28 |