|
Clever Algorithms Welcome to CleverAlgorithms.com! |
Get notified of future announcements!
|
Please Note that this is an early access preview of the book.
Get notified when the book is released!
Stepwise RegressionStepwise Regression, Stepwise Selection, Stepwise Multiple Linear Regression TaxonomyStepwise Regression is a Model Selection method for selecting a regression model. It was originally proposed as an extension to Multiple Linear Regression called Stepwise Multiple Linear Regression. It has since been generalized to other regression models, such as Logistic Regression for Stepwise Logistic Regression. It is considered an improvement over Model Selection methods for Regression such as All Subsets, and is related to other Model Selection methods for Regression such as Ridge Regression. It is also related to Regularized Regression methods such as LASSO and the Least Angle Regression used to solve it. StrategyThe information processing objective of the technique is to result in a regression model that minimizes a specified measure, hypothesis test or criterion, for example F-tests in the original description, or more simply here as the minimum Sum Squared Residuals (SSR). A greedy hill-climbing algorithm that successively adds one of the remaining variables to the regression model that result in the largest reduction in the SSR and then refits the model. The procedure stops once no further improvements to the specified criterion can be achieved or there are no more variables to add to the model. This process alone is called Forward Model Selection. Conversely, the greedy algorithm may start with all variables and successively remove one that result in the largest decrease in the SSR. This approach alone is called Backward Selection. Stepwise Regression applies Forward Model Selection as well as Backward Model Selection, allowing variables that have become redundant through the addition of other variables, to be removed from the model later in the process. This procedure requires the specification of threshold parameters for the minimum change in the of the selected measure for adding (F-to-enter) and removing (F-to-remove) variables to and from the model. The Forward method is applied until it can no longer add any more variables, then the Backward method is applied. This sequence repeats until the model stabilizes. One may start with no variables and successively grow a model, called Forward Stepwise Regression, or start with all variables and successively prune the model, called Backward Stepwise Regression. Heuristics
Code ListingListing (below) provides a code listing of Stepwise Multiple Linear Regression method in R. Figure (below) provides a plot of the training dataset with the line of best fit highlighted.
The example uses the
The test problem is a four-dimensional dataset of 100 samples, where the # define 4 variable regression problem of 100 samples
regression_dataset <- function() {
x1 <- runif(100, 0, 10)
x2 <- runif(100, 1, 2) # random
x3 <- runif(100, 2, 3) # random
y <- x1 + rnorm(100) # dependent on x1
data <- data.frame(x1, x2, x3, y)
}
# get the data
data <- regression_dataset()
# split data in to train and test (67%/33%)
training_set <- sample(100,67)
train <- data[training_set,]
test <- data[(1:100)[-training_set],]
# create a linear regression model using ordinary least squares
base_model <- lm(
y~x1+x2+x3, # predict y given x1, x2 and x3
train, # training dataset
NULL, # no weighting on the variables
NULL, # no action on missing values
method="qr") # QR decomposition (efficient matrix calculation method)
# apply the Stepwise Regression procedure
selected_model <- step(
base_model, # the model on which to operate
y~x1+x2+x3, # parameter relationships
0, # estimate the scale for the AIC statistic
"both", # use forward and backward selection
1, # provide debug information during the execution
NULL, # no filter function for models
1000, # maximum steps to execute
2) # Use AIC as the test criterion (use log(n) for BIC)
# summarize the selected linear regression model
summary(selected_model)
# display the selected variables
names(selected_model$model)
# regression model diagnostic plots with the training data
par(mfrow=c(2,2))
plot(selected_model)
# plot the training data and the line of best fit
plot(selected_model$model)
abline(selected_model$coef, lty=5, col="red")
# make predictions for the test data
predictions <- predict.lm(selected_model, test[,1:3])
# compute mean squared error
print(mean((test$y-predictions)^2))
Example of Stepwise Multiple Linear Regression in R using the
lm and step functions in the stats core package.Download: stats_stepwise_linear_regression.R. Unit test available in the github project
Plot 2D training dataset with the line of best fit.
Other packages that provide stepwise selection include the ReferencesPrimary SourcesEfroymson is credited with Stepwise Multiple Linear Regression in which he used an F-test significance for the add and remove decisions [Efroymson1960]. Breaux provides an early and salient overview of Efroymson's Stepwise Multiple Regression in his technical report that promotes the method as computationally efficient over enumerating all subsets of regression models [Breaux1967]. More InformationHocking provides an early and detailed review of forward and backward variable selection methods and of Stepwise Multiple Linear Regression [Hocking1976]. Miller provides an analysis on the convergence of Stepwise Regression [Miller1996]. Whittingham et al. provide a modern enumeration of all of the known problems and reasons against using Stepwise Multiple Regression for modelling, and fret about its commonplace use in the field ecology [Whittingham2006]. Mundry and Nunn provide a similar review and focus on the limitations of the method for statistical inference [Mundry2009]. Faraway provides a good example of stepwise procedures for regression with examples in R [Faraway2002] (page 125). Bibliography |
Buy the paperback!Coming soon... Grab the PDFComing soon... ContributeMore in the SeriesCheck-out other books in the series.
|
||||
|
Please Note: This content was automatically generated from the book content and may contain minor differences. |