Python Packages Specialize for Feature Engineering Techniques
Several Python packages specialize in feature engineering techniques, which can help automate tasks like encoding, scaling, generating interaction features, or extracting domain-specific features. Here are some popular feature engineering packages and examples of how to use them:
1. Feature-engine
Feature-engine provides a variety of feature engineering techniques, including encoding, variable transformation, feature creation, and feature selection. It integrates well with scikit-learn and allows you to create pipelines with ease.
Example with Feature-engine
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import MeanMedianImputer
from feature_engine.transformation import PowerTransformer
import pandas as pd
# Load your dataset
df = pd.read_csv("your_dataset.csv")
# Example pipeline for feature engineering
# 1. Fill missing values in numeric columns
imputer = MeanMedianImputer(imputation_method="mean", variables=["numerical_column"])
df_imputed = imputer.fit_transform(df)
# 2. Apply one-hot encoding to categorical columns
encoder = OneHotEncoder(variables=["categorical_column"])
df_encoded = encoder.fit_transform(df_imputed)
# 3. Apply power transformation to normalize skewed distributions
transformer = PowerTransformer(variables=["numerical_column"])
df_transformed = transformer.fit_transform(df_encoded)
df_transformed.head()
2. Featuretools
Featuretools is a library for automated feature engineering, especially useful for relational datasets (i.e., datasets with multiple tables). It uses a technique called “Deep Feature Synthesis” to automatically create new features based on relationships and aggregations.
Example with Featuretools
import featuretools as ft
import pandas as pd
# Load data
df = pd.read_csv("your_dataset.csv")
# Create an EntitySet and add the dataframe
es = ft.EntitySet(id="dataset")
es = es.entity_from_dataframe(entity_id="data", dataframe=df, index="index_column")
# Perform deep feature synthesis to automatically create new features
features, feature_defs = ft.dfs(entityset=es, target_entity="data", max_depth=2)
features.head()
3. tsfresh (for Time Series Data)
tsfresh
is a great package for extracting features from time series data. It provides a comprehensive set of feature extraction functions tailored for time series, like statistical metrics and frequency domain transformations.
Example with tsfresh
from tsfresh import extract_features
import pandas as pd
# Assuming your dataset is in a long format with columns "id", "time", and "value"
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"time": [1, 2, 3, 1, 2, 3],
"value": [10, 20, 15, 10, 20, 15]
})
# Extract features
extracted_features = extract_features(df, column_id="id", column_sort="time")
extracted_features.head()
4. Scikit-learn Pipelines with FeatureUnion
For basic feature engineering in scikit-learn, the FeatureUnion
module allows you to combine multiple feature engineering steps, like scaling, polynomial features, and encoding, into a single pipeline.
Example with Scikit-learn Pipelines
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline, FeatureUnion
import pandas as pd
# Load your dataset
df = pd.read_csv("your_dataset.csv")
# Example feature engineering pipeline
pipeline = Pipeline([
("features", FeatureUnion([
("poly_features", PolynomialFeatures(degree=2)),
("scaler", StandardScaler())
]))
])
# Fit and transform the dataset
X_transformed = pipeline.fit_transform(df[["numerical_column"]])
X_transformed[:5]
5. Kats (for Time Series)
Kats, developed by Facebook, offers advanced feature engineering for time series data, including trend detection, seasonal decomposition, anomaly detection, and feature extraction.
Example with Kats
from kats.tsfeatures.tsfeatures import TsFeatures
from kats.consts import TimeSeriesData
import pandas as pd
# Example time series data
df = pd.DataFrame({
"time": pd.date_range(start="2020-01-01", periods=100, freq="D"),
"value": range(100)
})
# Convert to Kats TimeSeriesData format
ts_data = TimeSeriesData(df)
# Initialize Kats features and calculate
ts_features = TsFeatures()
features = ts_features.transform(ts_data)
print(features)
These packages offer a wide range of feature engineering capabilities, from automated feature extraction to custom transformations for specific data types. You can choose the best one depending on your dataset and the type of features you want to generate.