How to Perform Stepwise Regression in Python
- Stepwise Regression in Python
-
Stepwise Regression With the
statsmodels
Library in Python -
Stepwise Regression With the
sklearn
Library in Python -
Stepwise Regression With the
mlxtend
Library in Python
This tutorial will discuss the methods to perform Stepwise regression in Python.
Stepwise Regression in Python
Stepwise regression is a method used in statistics and machine learning to select a subset of features for building a linear regression model. Stepwise regression aims to minimize the model’s complexity while maintaining a high accuracy level.
This method is particularly useful in cases where the number of features is large, and it’s unclear which features are important for the prediction.
Stepwise Regression With the statsmodels
Library in Python
The statsmodels
library provides the OLS()
class that can be used to perform stepwise regression. This function uses a combination of forward selection and backward elimination to select the best subset of features.
The function starts with an empty model and adds variables one by one based on the significance of their coefficients. Variables that are not significant are eliminated from the model.
Here is an example of how to use the stepwise
function in statsmodels
.
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Load the data
data = pd.read_csv("data.csv")
# Define the dependent and independent variables
x = data.drop("EstimatedSalary", axis=1)
y = data["EstimatedSalary"]
# Perform stepwise regression
result = sm.OLS(y, x).fit()
# Print the summary of the model
print(result.summary())
Output:
We first load the data in the above code example and define the dependent and independent variables. Then, we perform a stepwise regression using the OLS()
function from the statsmodels.formula.api
library and print a model summary, which includes information such as the coefficients of the variables, p-values, and R-squared value.
Stepwise Regression With the sklearn
Library in Python
The sklearn
library provides a RFE
(Recursive Feature Elimination) class for performing stepwise regression. This method starts with all features and recursively eliminates features based on their importance.
The RFE
method uses a specified estimator (such as a linear regression model) to estimate the importance of the features and recursively removes the least important feature at each iteration.
Here is an example of how to use the RFE
method in sklearn
.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# Load the data
data = pd.read_csv("data.csv")
# Define the dependent and independent variables
x = data.drop("EstimatedSalary", axis=1)
y = data["EstimatedSalary"]
# Create a linear regression estimator
estimator = LinearRegression()
# Create the RFE object and specify the number of
selector = RFE(estimator, n_features_to_select=5)
# Fit the RFE object to the data
selector = selector.fit(x, y)
# Print the selected features
print(x.columns[selector.support_])
Output:
Index(['Tenure', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Exited'], dtype='object')
We first load the data in the above code example and define the dependent and independent variables. Then, we create a linear regression estimator and an RFE object.
We set the number of features to select as 5, which means that the final model will only include the top 5 features according to their importance. Next, we fit the RFE object to the data and print the selected features.
It’s worth noting that the RFE()
method uses the specified estimator to compute the importance of the features, so it is important to use an appropriate estimator for the data. The RFE method can also be used with other estimators such as Random Forest or SVM.
Stepwise Regression With the mlxtend
Library in Python
The mlxtend
library provides the SFS
class for performing stepwise regression. This function uses a combination of forward selection and backward elimination to select the best subset of features.
This function also starts with an empty model and adds variables one by one based on the significance of their coefficients. Variables that are not significant are eliminated from the model.
Here is an example of how to use the stepwise
function in mlxtend
.
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import joblib
import sys
sys.modules["sklearn.externals.joblib"] = joblib
# Load the data
data = pd.read_csv("data.csv")
# Define the dependent and independent variables
x = data.drop("EstimatedSalary", axis=1)
y = data["EstimatedSalary"]
# Create a linear regression estimator
estimator = LinearRegression()
# Create the SFS object and specify the number of features to select
sfs = SFS(estimator, k_features=5, forward=True, floating=False, scoring="r2", cv=5)
# Fit the SFS object to the data
sfs = sfs.fit(x, y)
# Print the selected features
print(sfs.k_feature_idx_)
Output:
(1, 2, 4, 6, 7)
We first load the data in this example and define the dependent and independent variables. Then, we create a linear regression estimator and an SFS
object.
We set the number of features to select as 5
, which means that the final model will only include the top 5 features according to their importance. Next, we fit the SFS
object to the data and print the selected features.
It’s worth noting that the stepwise()
function of mlxtend
uses the specified estimator to compute the importance of the features, so it is important to use an appropriate estimator for the data. The function also allows us to set the direction of the selection process, the scoring metric, and the number of cross-validation folds to use.
In summary, stepwise regression is a powerful technique for feature selection in linear regression models. The statsmodels
, sklearn
, and mlxtend
libraries provide different methods for performing stepwise regression in Python, each with advantages and disadvantages.
The choice of method will depend on the problem’s specific requirements and the availability of the data. It is important to note that stepwise regression can be prone to overfitting, and using it in combination with other feature selection techniques and cross-validation is recommended.
Maisam is a highly skilled and motivated Data Scientist. He has over 4 years of experience with Python programming language. He loves solving complex problems and sharing his results on the internet.
LinkedIn