## The Method

Linear regression is used to determine how an outcome variable, called the dependent variable, linearly depends on a set of known variables, called the independent variables. The dependent variable is typically denoted by y and the independent variables are denoted by x1,x2,xk, where k is the number of different independent variables. We are interested in finding the best possible coefficients β0,β1,β2,βk such that our predicted values:
y^=β0+β1x1+β2x2++βkxk
are as close as possible to the actual y values. This is achieved by minimizing the sum of the squared differences between the actual values, y, and the predictions y^. These differences, (yy^), are often called error terms or residuals. Once you have constructed a linear regression model, it is important to evaluate the model by going through the following steps:
• Check the significance of the coefficients, and remove insignificant independent variables if desired.
• Check the R2 value of the model.
• Check the predictive ability of the model on out-of-sample data.
• Check for multicollinearity.

## Linear Regression in R

Suppose your training data frame is called "TrainingData", your dependent variable is called "DependentVar", and you have two independent variables, called "IndependentVar1" and "IndependentVar2". Then you can build a linear regression model in R called "RegModel" with the following command:

RegModel = lm(DependentVar ~ IndependentVar1 + IndependentVar2, data = TrainingData)
 
To see the R2 of the model, the coefficients, and the significance of the coefficients, you can use the summary function:

summary(RegModel)
 
To check for multicollinearity, correlations can be computed with the cor() function:

cor(TrainingData$IndependentVar1, TrainingData$IndependentVar2)
cor(TrainingData)
 
If your out-of-sample data, or test set, is called "TestData", you can compute test set predictions and the test set R2 with the following commands:

TestPredictions = predict(RegModel, newdata=TestData)
SSE = sum((TestData$DependentVar - TestPredictions)^2) SST = sum((TestData$DependentVar - mean(TrainingData\$DependentVar))^2)
Rsquared = 1 - SSE/SST

In nutshell- Rsquared does three way comparision. SSE : Test data with respect to prediction from model, SST :Test data with respect of training data.


## Tips and Tricks

Quick tip on getting linear regression predictions in R posted by HamsterHuey (this post is about Unit 2 / Unit 2, Lecture 1, Video 4: Linear Regression in R)
Suppose you have a linear regression model in R as shown in the lectures:

RunsReg = lm(RS ~ OBP + SLG, data=moneyball)
 
Then, if you need to calculate the predicted Runs scored for a single entity with (for example) OBP = 0.4, SLG = 0.5, you can easily calculate it as follows:

predict(RunsReg, data.frame(OBP=0.4, SLG=0.5))
 
For a sequence of players/teams you can do the following:

predict(RunsReg, data.frame(OBP=c(0.4, 0.45, 0.5), SLG=c(0.5, 0.45, 0.4)))
 
Sure beats having to manually extract coefficients and then calculate the predicted value each time (although it is important to understand the underlying form of the linear regression equation.