Multiple Regression in Python
-
Use the
statsmodel.api
Module to Perform Multiple Linear Regression in Python -
Use the
numpy.linalg.lstsq
to Perform Multiple Linear Regression in Python -
Use the
scipy.curve_fit()
Method to Perform Multiple Linear Regression in Python
This tutorial will discuss multiple linear regression and how to implement it in Python.
Multiple linear regression is a model which computes the relation between two or more than two variables and a single response variable by fitting a linear regression equation between them. It helps estimate the dependency or the change between dependent variables to the change in the independent variables. In standard multiple linear regression, all the independent variables are taken into account simultaneously.
Use the statsmodel.api
Module to Perform Multiple Linear Regression in Python
The statsmodel.api
module in Python is equipped with functions to implement linear regression. We will use the OLS()
function, which performs ordinary least square regression.
We can either import a dataset using the pandas
module or create our own dummy data to perform multiple regression. We bifurcate the dependent and independent variables to apply the linear regression model between those variables.
We create a regression model using the OLS()
function. Then, we pass the independent and dependent variables in this function and fit this model using the fit()
function. In our example, we have created some arrays to demonstrate multiple regression.
See the code below.
import statsmodels.api as sm
import numpy as np
y = [1, 2, 3, 4, 3, 4, 5, 3, 5, 5, 4, 5, 4, 5, 4, 5, 6, 0, 6, 3, 1, 3, 1]
X = [
[0, 2, 4, 1, 5, 4, 5, 9, 9, 9, 3, 7, 8, 8, 6, 6, 5, 5, 5, 6, 6, 5, 5],
[4, 1, 2, 3, 4, 5, 6, 7, 5, 8, 7, 8, 7, 8, 7, 8, 6, 8, 9, 2, 1, 5, 6],
[4, 1, 2, 5, 6, 7, 8, 9, 7, 8, 7, 8, 7, 4, 3, 1, 2, 3, 4, 1, 3, 9, 7],
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
print(reg_m(y, x).summary())
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.241
Model: OLS Adj. R-squared: 0.121
Method: Least Squares F-statistic: 2.007
Date: Wed, 16 Jun 2021 Prob (F-statistic): 0.147
Time: 23:57:15 Log-Likelihood: -40.810
No. Observations: 23 AIC: 89.62
Df Residuals: 19 BIC: 94.16
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.0287 0.135 -0.213 0.834 -0.311 0.254
x2 0.2684 0.160 1.678 0.110 -0.066 0.603
x3 0.1339 0.160 0.839 0.412 -0.200 0.468
const 1.5123 0.986 1.534 0.142 -0.551 3.576
==============================================================================
Omnibus: 9.472 Durbin-Watson: 2.447
Prob(Omnibus): 0.009 Jarque-Bera (JB): 7.246
Skew: -1.153 Prob(JB): 0.0267
Kurtosis: 4.497 Cond. No. 29.7
==============================================================================
The summary()
function allows us to print the results and coefficients of the regression. The R-Squared
, and Adjusted R-Squared
tell us about the efficiency of the regression.
Use the numpy.linalg.lstsq
to Perform Multiple Linear Regression in Python
The numpy.linalg.lstsq
method returns the least squares solution to a provided equation by solving the equation as Ax=B
by computing the vector x to minimize the normal ||B-Ax||
.
We can use it to perform multiple regression as shown below.
import numpy as np
y = [1, 2, 3, 4, 3, 4, 5, 3, 5, 5, 4, 5, 4, 5, 4, 5, 6, 0, 6, 3, 1, 3, 1]
X = [
[0, 2, 4, 1, 5, 4, 5, 9, 9, 9, 3, 7, 8, 8, 6, 6, 5, 5, 5, 6, 6, 5, 5],
[4, 1, 2, 3, 4, 5, 6, 7, 5, 8, 7, 8, 7, 8, 7, 8, 6, 8, 9, 2, 1, 5, 6],
[4, 1, 2, 5, 6, 7, 8, 9, 7, 8, 7, 8, 7, 4, 3, 1, 2, 3, 4, 1, 3, 9, 7],
]
X = np.transpose(X) # transpose so input vectors
X = np.c_[X, np.ones(X.shape[0])] # add bias term
linreg = np.linalg.lstsq(X, y, rcond=None)[0]
print(linreg)
Output:
[ 0.1338682 0.26840334 -0.02874936 1.5122571 ]
We can compare the coefficients for each variable with the previous method and notice that the result is the same. Here the final result is in a NumPy
array.
Use the scipy.curve_fit()
Method to Perform Multiple Linear Regression in Python
This model uses a function that is further used to calculate a model for some values, and the result is used with non-linear least squares to fit this function to the given data.
See the code below.
from scipy.optimize import curve_fit
import scipy
import numpy as np
def function_calc(x, a, b, c):
return a + b * x[0] + c * x[1]
y = [1, 2, 3, 4, 3, 4, 5, 3, 5, 5, 4, 5, 4, 5, 4, 5, 6, 0, 6, 3, 1, 3, 1]
X = [
[0, 2, 4, 1, 5, 4, 5, 9, 9, 9, 3, 7, 8, 8, 6, 6, 5, 5, 5, 6, 6, 5, 5],
[4, 1, 2, 3, 4, 5, 6, 7, 5, 8, 7, 8, 7, 8, 7, 8, 6, 8, 9, 2, 1, 5, 6],
[4, 1, 2, 5, 6, 7, 8, 9, 7, 8, 7, 8, 7, 4, 3, 1, 2, 3, 4, 1, 3, 9, 7],
]
popt, pcov = curve_fit(function_calc, x, y)
print(popt)
print(pcov)
Output:
[1.44920591 0.12720273 0.26001833]
[[ 0.84226681 -0.06637804 -0.06977243]
[-0.06637804 0.02333829 -0.01058201]
[-0.06977243 -0.01058201 0.02288467]]