The MethodLinear regression is used to determine how an outcome variable, called the dependent variable, linearly depends on a set of known variables, called the independent variables. The dependent variable is typically denoted by
- Check the significance of the coefficients, and remove insignificant independent variables if desired.
- Check the
R2value of the model.
- Check the predictive ability of the model on out-of-sample data.
- Check for multicollinearity.
Linear Regression in RSuppose your training data frame is called "TrainingData", your dependent variable is called "DependentVar", and you have two independent variables, called "IndependentVar1" and "IndependentVar2". Then you can build a linear regression model in R called "RegModel" with the following command:
RegModel = lm(DependentVar ~ IndependentVar1 + IndependentVar2, data = TrainingData)
To see the
To check for multicollinearity, correlations can be computed with the cor() function:
cor(TrainingData$IndependentVar1, TrainingData$IndependentVar2) cor(TrainingData)
If your out-of-sample data, or test set, is called "TestData", you can compute test set predictions and the test set
TestPredictions = predict(RegModel, newdata=TestData) SSE = sum((TestData$DependentVar - TestPredictions)^2) SST = sum((TestData$DependentVar - mean(TrainingData$DependentVar))^2) Rsquared = 1 - SSE/SST In nutshell- Rsquared does three way comparision. SSE : Test data with respect to prediction from model, SST :Test data with respect of training data.
Tips and TricksQuick tip on getting linear regression predictions in R posted by HamsterHuey (this post is about Unit 2 / Unit 2, Lecture 1, Video 4: Linear Regression in R)
Suppose you have a linear regression model in R as shown in the lectures:
RunsReg = lm(RS ~ OBP + SLG, data=moneyball)
Then, if you need to calculate the predicted Runs scored for a single entity with (for example)
OBP = 0.4, SLG = 0.5, you can easily calculate it as follows:
predict(RunsReg, data.frame(OBP=0.4, SLG=0.5))
For a sequence of players/teams you can do the following:
predict(RunsReg, data.frame(OBP=c(0.4, 0.45, 0.5), SLG=c(0.5, 0.45, 0.4)))
Sure beats having to manually extract coefficients and then calculate the predicted value each time (although it is important to understand the underlying form of the linear regression equation.