How to Use the Predict Function on a Linear Regression Model in R
After we have built a linear regression model using the lm()
function, one of the things we can do with it is to predict values of the response (also called output or dependent) variable for new values of the feature (also called the input or independent) variables.
In this article, we will look at the basic arguments of R’s predict()
function in the context of linear regression. In particular, we will see that the function expects the input to be in a specific format with specific column names.
Use the predict()
Function on a Linear Regression Model in R
To demonstrate the predict()
function, we will first build a linear regression model with some sample data.
Observe the column names in the data frame, and note how they are used in the linear regression formula.
Example Code:
Feature = c(15:24)
set.seed(654)
Response = 2* c(15:24) + 5 + rnorm(10, 0,3)
DFR = data.frame(Response, Feature)
DFR
# The arguments' formula and data are named for clarity.
LR_mod = lm(formula = Response~Feature, data = DFR)
LR_mod
Output:
> Feature = c(15:24)
> set.seed(654)
> Response = 2* c(15:24) + 5 + rnorm(10, 0,3)
> DFR = data.frame(Response, Feature)
> DFR
Response Feature
1 32.71905 15
2 35.83089 16
3 44.06888 17
4 40.71729 18
5 43.28590 19
6 47.45182 20
7 50.19730 21
8 51.81954 22
9 53.22364 23
10 51.69406 24
> # The arguments' formula and data are named for clarity.
> LR_mod = lm(formula = Response~Feature, data = DFR)
> LR_mod
Call:
lm(formula = Response ~ Feature, data = DFR)
Coefficients:
(Intercept) Feature
2.096 2.205
Use predict()
to Predict the Response
Now that we have a linear regression model, we can use R’s predict()
function to predict values of the response corresponding to new values of the feature variables.
The predict()
function needs at least two arguments for a linear regression model.
- A model object.
- New data.
In this context, there are two important considerations that we need to take into account.
-
We need to provide the new data as a data frame. In our example, the feature is a single variable.
If we give a vector to the
predict()
function, we will get an error. -
If the column name of the feature variable in the new data frame differs from the name of the corresponding column in the original data frame, we get unexpected output.
Example Code:
# First, let us create new values for the feature variable.
NewFeature = c(20.5, 16.5, 22.5)
# If we provide a vector, we get an error.
predict(object = LR_mod, newdata = NewFeature)
# Make a data frame.
DFR2 = data.frame(NewFeature)
# Another error.
# R saw the correct number of rows in the new data but did not use them.
predict(LR_mod, newdata = DFR2)
Output:
> # First, let us create new values for the feature variable.
> NewFeature = c(20.5, 16.5, 22.5)
> # If we provide a vector, we get an error.
> predict(object = LR_mod, newdata = NewFeature)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
'data' must be a data.frame, environment, or list
> # Make a data frame.
> DFR2 = data.frame(NewFeature)
> # Another error.
> # R saw the correct number of rows in the new data but did not use them.
> predict(LR_mod, newdata = DFR2)
1 2 3 4 5 6 7 8 9 10
35.17674 37.38209 39.58745 41.79280 43.99816 46.20351 48.40887 50.61422 52.81958 55.02494
Warning message:
'newdata' had 3 rows but variables found have 10 rows
We must ensure two things to get the correct output from the predict()
function.
- We must pass a data frame to the
newdata
argument of thepredict()
function. This was done above after the first error. - The column name of the feature variables should be the same as those used in the original data frame to build the model. We will make this change in the following code segment.
It is also good practice to name the arguments.
With these aspects addressed, we get the expected output from the predict()
function.
Example Code:
# The feature column of the new data frame is given the same name as in the original data frame.
DFR3 = data.frame(Feature = NewFeature)
# Finally, predict() works as expected.
predict(LR_mod, newdata = DFR3)
Output:
> # The feature column of the new data frame is given the same name as in the original data frame.
> DFR3 = data.frame(Feature = NewFeature)
> # Finally, predict() works as expected.
> predict(LR_mod, newdata = DFR3)
1 2 3
47.30619 38.48477 51.71690