Linear Regression is a statistical technique that correlates
the change in a variable (a series of data that recurs at fixed intervals) to
other variable/s. The representation of the relationship is called the linear
regression model. It is called linear because the relationship is linearly
additive. Below is an example of a linear regression model:
Where Y is the dependent
variable and x, z are independent variables. b and c are coefficient weights
that determine the strength of the relationship between the respective
independent variables and the dependent variable. Regression models try to
make the developed relationship as predictive of the data as possible by
selecting weights with which to combine variables such that the variance or
error between the predicted value and the actual value for each observation
in the series is minimized. This obviously means that the model will never
exactly predict actual values without error. In fact the modeling method
assumes that any normal occurence of data will have some random error
associated with it, the model only tries to minimize systematic error that
is present due to incomplete estimation or data biases.
Linear Regression Applications
Identifying Causal
Drivers of Demand: Linear Regression Models help in identifying the causal
factors that drive business demand and the proportional impact or contribution
of each factor. Linear Models can also
be used to forecast future business demand based on future values of the
causal factors (if they are known). Linear Model output also
helps in simulating the effect of changes in the causal factors on business
demand. One example is Marketing-Mix Modeling to measure ROI of Marketing
Investments. Linear Regression can be used to decompose total Sales in a
given time period into sales driven by different Marketing drivers in
addition to business and environmental drivers. A detailed Case Study
illustrating how the output of Linear Regression can be used to accomplish
this is available here.
Terminology
• Variables:
Variables are measurements of occurrences of a recurring event taken at
regular intervals or measurements of different instances of similar events
that can take on different possible values. E.g. the price of gasoline
recorded at monthly intervals, the height of children between the age of 6 and
10.
• Dependent Variable:
A variable whose value depends on the value of other variables in a model.
E.g. the price of corn oil, which is directly dependent on the price of corn.
• Independent
Variables: Variables whose value is not dependent on other variables in a
model. E.g. in the above example, the price of corn would be one of the
independent variables driving the price of corn oil. An independent variable
is specific to a model and a variable that is independent in one model can be
dependent in another.
• Model: A system
that represents the relationship between variables, both dependent and
independent.
Model Development Process
Model Identification:
Identifying a linear
regression model requires determining what the dependent variable is,
determining what independent variables should be included in the model and
what are the coefficient weights for each independent variable.
Coefficient weights are
identified by using a statistical method called Ordinary Least Squares (OLS),
which is available in all statistical packages and is also available in
Microsoft Excel in the Data Analysis suite.
Parameters
(Coefficients):
These are the
coefficient weights that measure the strength of the relationship between
independent and dependent variables. The two dimensions of coefficients are
direction and magnitude. The direction, indicated by the sign (+ve or –ve),
determine whether the dependent variable increases for an increase in the
independent variable (+ve) or decreases for an increase in the independent
variable (-ve). Generally the magnitude of coefficients can be compared only
if two independent variables have the same unit of measurement, otherwise the
variables need to be normalized to a standard scale to be compared
(statistical packages including Excel can directly output standardized
coefficients that can be compared to measure strength of the relationship
across different independent variables).
T-Statistics
T-Statistics aid in
determining whether an independent variable should be included in a model. A
variable is typically included in a model if it exceeds a pre-determined
threshold level or ‘critical value’. Thresholds are determined for different
levels of confidence. For e.g. to be 95% confident that a variable should be
included in a model, or in other words to tolerate only a 5% chance that a
variable doesn’t belong in a model, a T-statistic of greater than 1.98 (if the
coefficient is positive) or less than -1.98 (if the coefficient is negative)
is considered statistically significant.
Confidence T-Statistic
Critical Value
90% +/-1.66
95% +/-1.98
99% +/-2.63
Durbin-Watson
Statistic:
One peculiar feature of
data recorded over time, like monthly sales, is that it tends to be correlated
over time. For e.g. high sales months may be tend to be followed by high sales
months and low sales months by more low sales months. This may be caused
either by seasonal/cyclical trends or seasonal promotion, marketing or
competitive effects. Whatever the factor causing this correlation, correlated
errors violate one of the fundamental assumptions needed for least squares
regression- independence of errors or in other words random errors.
Durbin-Watson Statistic is a measure used to detect such correlations. Every
model has one measure for Durbin-Watson statistic. Durbin-Watson Statistic,
ranges in value from 0 to 4 with an ideal value of 2 indicating that errors
are not correlated (although values from 1.75 to 2.25 may be considered
acceptable). A value significantly below 2 indicates a positive correlation
and a value significantly greater than 2 suggests negative correlation. In
either case the model specification needs to be reviewed to identify variables
potentially omitted or redundant variables. Excel Regression package does not
estimate the Durbin-Watson statistic, but it can be easily calculated from the
output.
Multi-collinearity
Multi-collinearity is a
condition when independent variables included in a model are correlated with
each other. Multi-collinearity may result in redundant variables being
included in the model, which in itself is not such a terrible thing. The real
damage caused by multi-collinearity is that it causes large standard errors in
estimating the coefficients. In simpler terms it causes the estimated
t-statistics for correlated or multi-collinear variables to be insignificant,
thus resulting in significant variables to appear to be insignificant. Multi-collinearity
can be identified by the Variance Inflation factor (VIF), which is a statistic
calculated for each variable in a model. A VIF greater than 5 may suggest that
the concerned variable is multi-collinear with others in the model and may
need to be dropped. VIF cannot be calculated with the Excel Regression
package, it needs to be calculated using more sophisticated packages like SAS,
SPSS or S-Plus.
Multicollinearity can be controlled by
shrinkage techniques like Ridge Regressions, but a better strategy is to
combine collinear variables using techniques like Factor Analysis or
Principal Components Analysis.
R-Square
Any model is only as
good as it is able to predict the actual outcome with accuracy. R-Square is a
measure of how well the model is able to predict the changes in the actual
data. R-Square ranges between 0 and 1, with values over 0.5 indicating a good
fit between the predictions and actual data.
Mean Absolute Percent
Error (MAPE)
MAPE is a measure of how
high or low are the differences between the predictions and actual data. For
e.g. 15% MAPE means on average the predictions from a model will be 15% higher
or lower than actual.