How to Implement the Train() Function in R

Sheeraz Gul Feb 12, 2024 R R Function
  1. Syntax of the train Function
  2. Install and Load Packages to Use the train Function
  3. Implement the train() Function in R for Basic Model Training for Regression
  4. Implement the train() Function in R for Classification With Random Forests
  5. Implement the train() Function in R for Model Tuning With Cross-Validation
  6. Conclusion
How to Implement the Train() Function in R

Machine learning has become an indispensable tool in data analysis, enabling data scientists to uncover patterns, make predictions, and derive insights from complex datasets. In machine learning libraries in R, the caret package stands out with its comprehensive set of tools for model training, tuning, and evaluation.

In this package lies the train function, a versatile tool that simplifies the process of implementing and using various machine learning algorithms. In this article, we’ll delve into the intricacies of the train function, exploring its capabilities and providing examples to illustrate its usage.

Syntax of the train Function

The train function is a central component of the caret package, short for Classification And REgression Training. The primary objective of train is to streamline the process of building and evaluating predictive models.

It is particularly useful when dealing with a variety of machine learning algorithms, as it provides a consistent interface for model training across different methods.

The basic syntax is as follows:

train(formula, data, method, ...)

Here’s a breakdown of the main parameters:

  • formula: A symbolic description of the model, specifying the relationship between the predictors and the response variable.
  • data: The dataset containing the variables specified in the formula.
  • method: The modeling method or algorithm to be used for training the model. This can be any algorithm supported by the caret package, such as lm for linear regression, rf for random forests, etc.
  • ...: Additional arguments that depend on the chosen modeling method. These additional arguments allow you to customize the model training process, such as specifying hyperparameters or control parameters.

It’s important to note that the train function is quite flexible and can be customized based on the specific needs of your modeling task. The exact arguments and their interpretation may vary depending on the modeling method chosen.

Install and Load Packages to Use the train Function

Before diving into the train function, ensure you have the required packages installed. Open your R console and execute the following commands:

install.packages("caret", dependencies = c("Depends", "Suggests"))

Once installed, load the packages into your workspace:

library(caret)

Implement the train() Function in R for Basic Model Training for Regression

Let’s start with a basic example of using the train function for linear regression.

To start, we load the BostonHousing dataset into our R environment using the data function. This dataset contains housing-related information for various neighborhoods, making it suitable for regression analysis.

data(BostonHousing)

With the dataset in place, we define our regression formula using the ~ operator. Here, we want to predict the median value of owner-occupied homes (medv) based on all other available features in the dataset.

This is expressed as medv ~ ., where the dot (.) indicates the use of all other variables for prediction.

formula <- medv ~ .

Now, we are ready to use the train function to train our linear regression model. We pass in the formula the dataset (BostonHousing) and specify the modeling method as lm (linear regression).

lm_model <- train(formula, data = BostonHousing, method = "lm")

The result is stored in the lm_model object, and we can print it to the console for a comprehensive overview of the trained model.

print(lm_model)

When you run this code, you should see output similar to the following:

Linear Regression

506 samples
 13 predictor

No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 506, 506, 506, 506, 506, 506, ...
Resampling results:

  RMSE      Rsquared   MAE
  4.843917  0.7243338  3.45056

Tuning parameter 'intercept' was held constant at a value of TRUE

This output provides valuable information about the trained linear regression model, including the Root Mean Squared Error (RMSE), R-squared, and Mean Absolute Error (MAE). These metrics offer insights into how well the model is performing on the training data.

Implement the train() Function in R for Classification With Random Forests

Now, let’s explore classification using the famous iris dataset and a random forest model.

Our first step is to load the iris dataset into our R environment. This dataset contains measurements of sepal and petal dimensions for three different species of iris flowers.

data(iris)

Next, we define our classification formula using the ~ operator. Here, our goal is to predict the species of iris flowers (Species) based on their sepal and petal dimensions.

We express this as Species ~ ., indicating the utilization of all other available variables for prediction.

formula <- Species ~ .

Now, we employ the train function to train our random forest classification model. We provide the formula for the dataset (iris) and specify the modeling method as rf (random forest).

rf_model <- train(formula, data = iris, method = "rf")

Once the model is trained, we can inspect the details by printing the rf_model object. This provides us with valuable information about the random forest model, including the number of trees, variable importance, and other relevant statistics.

print(rf_model)

You should observe an output similar to the following:

Random Forest

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica'

No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa
  2     0.9508815  0.9254858
  3     0.9513779  0.9262463
  4     0.9505901  0.9251102

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.

This output offers valuable information about the trained Random Forest classification model. It includes accuracy, kappa, and the optimal tuning parameter values determined during the training process.

Understanding these metrics is essential for assessing the model’s performance on the classification task.

Implement the train() Function in R for Model Tuning With Cross-Validation

One of the strengths of the train function is its ability to perform hyperparameter tuning with cross-validation. Let’s demonstrate this using the mtcars dataset and a support vector machine (SVM).

Firstly, we load the mtcars dataset into our R environment. This dataset contains information about various car models, making it suitable for regression analysis.

data(mtcars)

We define our regression formula, specifying that we want to predict miles per gallon (mpg) based on all other available variables in the dataset.

formula <- mpg ~ .

To perform model tuning with cross-validation, we need to set up control parameters using the trainControl function. In this example, we choose a repeated k-fold cross-validation method with 4 folds and save predictions for further analysis.

ctrl <- trainControl(
    method = "repeatedcv", number = 4, savePredictions = TRUE, verboseIter = TRUE,
    returnResamp = "all"
)

Now, we use the train function to train our SVM model with a radial kernel. We provide the regression formula, the dataset (mtcars), the modeling method as svmRadial, and the previously defined control parameters.

svm_model <- train(formula, data = mtcars, method = "svmRadial", trControl = ctrl)

Upon completion, we print the trained SVM model to the console. This output includes information about the tuning parameters, such as the cost and gamma values, as well as performance metrics across different folds and repetitions.

print(svm_model)