The Method
Linear regression is used to determine how an outcome variable, called the dependent variable, linearly depends on a set of known variables, called the independent variables. The dependent variable is typically denoted by- Check the significance of the coefficients, and remove insignificant independent variables if desired.
- Check the
R2 value of the model. - Check the predictive ability of the model on out-of-sample data.
- Check for multicollinearity.
Linear Regression in R
Suppose your training data frame is called "TrainingData", your dependent variable is called "DependentVar", and you have two independent variables, called "IndependentVar1" and "IndependentVar2". Then you can build a linear regression model in R called "RegModel" with the following command:RegModel = lm(DependentVar ~ IndependentVar1 + IndependentVar2, data = TrainingData)
To see the summary(RegModel)
To check for multicollinearity, correlations can be computed with the cor() function:cor(TrainingData$IndependentVar1, TrainingData$IndependentVar2)
cor(TrainingData)
If your out-of-sample data, or test set, is called "TestData", you can compute test set predictions and the test set TestPredictions = predict(RegModel, newdata=TestData)
SSE = sum((TestData$DependentVar - TestPredictions)^2)
SST = sum((TestData$DependentVar - mean(TrainingData$DependentVar))^2)
Rsquared = 1 - SSE/SST
In nutshell- Rsquared does three way comparision. SSE : Test data with respect to prediction from model, SST :Test data with respect of training data.
Tips and Tricks
Quick tip on getting linear regression predictions in R posted by HamsterHuey (this post is about Unit 2 / Unit 2, Lecture 1, Video 4: Linear Regression in R)Suppose you have a linear regression model in R as shown in the lectures:
RunsReg = lm(RS ~ OBP + SLG, data=moneyball)
Then, if you need to calculate the predicted Runs scored for a single entity with (for example) OBP = 0.4, SLG = 0.5
, you can easily calculate it as follows:predict(RunsReg, data.frame(OBP=0.4, SLG=0.5))
For a sequence of players/teams you can do the following:predict(RunsReg, data.frame(OBP=c(0.4, 0.45, 0.5), SLG=c(0.5, 0.45, 0.4)))
Sure beats having to manually extract coefficients and then calculate
the predicted value each time (although it is important to understand
the underlying form of the linear regression equation.
No comments:
Post a Comment