Feature Engineering Patterns
I wanted to document a few common patterns for feature engineering here for my quick reference:
Absolute Time
The easy way to encode time is to make it continuous considering duration since the epoch.
Time of day
The naive way to do this is to bucketize into parts of day or to make make a continuous feature out of the hours. But there is a more sophisticated was using trigonometry.
Buckets
For example, add 3 one-hot columns: morning
, afternoon
, night
and cluster times into their respective buckets. Or just add 1 column the represents the hour of the day 0-23
as a continuous feature.
Radians
The problem of using hour of day is that it ruins the circular nature of time by adding a discontinuity between 12PM and 1AM. Instead, consider time as an angle w/ a unit vector and then covert that to 2d cartesian space.
The discontinuity is bad b/c we are saying there is a relative importance between 2am-12pm and not 12pm-1am when there really is!
Technically, this means adding two new features x,y in cartesian space. Now hour 23 and 0 are right next to each other.
import numpy as np
df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['mnth_sin'] = np.sin((df.mnth-1)*(2.*np.pi/12))
df['mnth_cos'] = np.cos((df.mnth-1)*(2.*np.pi/12))
List
Consider a variable length list of items, for example, all the restaurants a users listed and you want to use them as features/predictors for a model.
Embeddings
The deep learning approach is to use a fixed-length vector to represent the pooled (averaged) representation of your item embeddings.
BOW
Build a fixed length feature vector of size $V$ and popular w/ item counts or frequenices
Tf/idf
A popular way to do frequency counts which discounts common words and weights more important words.
Skewed Distribution
If your predictor or response has a skewed distribution, sometimes it’s possible to run a log transform on it which might make it more normal. Why this is good, depends on if your underling estimator has normality assumptions.
SKLearn documentation has an interesting use case for this AND it has an actual estimator that does this!
Lat/Lng
Latitude and Longitude as numerical features alone don’t give much discriminative power. An interesting technique that I like is doing bucketed feature crosses. So first you discretize by some type of bucketing to to make lat and long ranges, then you do feature crosses on the ranges to make geo squares!
How do you handle cases where a point on the map is on the boundary of one of your grid squares? Reminds me of the time encoding task….
Permalink: feature-engineering-patterns
Tags: