MetriScient          a PORTAL BY JOY JOSEPH

Linear Regression Models

Linear Regression is a statistical technique that correlates the change in a variable (a series of data that recurs at fixed intervals) to other variable/s. The representation of the relationship is called the linear regression model. It is called linear because the relationship is linearly additive. Below is an example of a linear regression model:

Y= a + bx + cz

Where Y is the dependent variable and x, z are independent variables. b and c are coefficient weights that determine the strength of the relationship between the respective independent variables and the dependent variable. Regression models try to make the developed relationship as predictive of the data as possible by selecting weights with which to combine variables such that the variance or error between the predicted value and the actual value for each observation in the series is minimized. This obviously means that the model will never exactly predict actual values without error. In fact the modeling method assumes that any normal occurence of data will have some random error associated with it, the model only tries to minimize systematic error that is present due to incomplete estimation or data biases.

Linear Regression Applications

Identifying Causal Drivers of Demand: Linear Regression Models help in identifying the causal factors that drive business demand and the proportional impact or contribution of each factor. Linear Models can also be used to forecast future business demand based on future values of the causal factors (if they are known). Linear Model output also helps in simulating the effect of changes in the causal factors on business demand. One example is Marketing-Mix Modeling to measure ROI of Marketing Investments. Linear Regression can be used to decompose total Sales in a given time period into sales driven by different Marketing drivers in addition to business and environmental drivers. A detailed Case Study illustrating how the output of Linear Regression can be used to accomplish this is available here.

Terminology

• Variables: Variables are measurements of occurrences of a recurring event taken at regular intervals or measurements of different instances of similar events that can take on different possible values. E.g. the price of gasoline recorded at monthly intervals, the height of children between the age of 6 and 10.

• Dependent Variable: A variable whose value depends on the value of other variables in a model. E.g. the price of corn oil, which is directly dependent on the price of corn.

• Independent Variables: Variables whose value is not dependent on other variables in a model. E.g. in the above example, the price of corn would be one of the independent variables driving the price of corn oil. An independent variable is specific to a model and a variable that is independent in one model can be dependent in another.

• Model: A system that represents the relationship between variables, both dependent and independent.

Model Development Process

Model Identification:

Identifying a linear regression model requires determining what the dependent variable is, determining what independent variables should be included in the model and what are the coefficient weights for each independent variable.

Coefficient weights are identified by using a statistical method called Ordinary Least Squares (OLS), which is available in all statistical packages and is also available in Microsoft Excel in the Data Analysis suite.

Parameters (Coefficients):

These are the coefficient weights that measure the strength of the relationship between independent and dependent variables. The two dimensions of coefficients are direction and magnitude. The direction, indicated by the sign (+ve or –ve), determine whether the dependent variable increases for an increase in the independent variable (+ve) or decreases for an increase in the independent variable (-ve). Generally the magnitude of coefficients can be compared only if two independent variables have the same unit of measurement, otherwise the variables need to be normalized to a standard scale to be compared (statistical packages including Excel can directly output standardized coefficients that can be compared to measure strength of the relationship across different independent variables).

T-Statistics

T-Statistics aid in determining whether an independent variable should be included in a model. A variable is typically included in a model if it exceeds a pre-determined threshold level or ‘critical value’. Thresholds are determined for different levels of confidence. For e.g. to be 95% confident that a variable should be included in a model, or in other words to tolerate only a 5% chance that a variable doesn’t belong in a model, a T-statistic of greater than 1.98 (if the coefficient is positive) or less than -1.98 (if the coefficient is negative) is considered statistically significant.

Confidence T-Statistic Critical Value

90% +/-1.66

95% +/-1.98

99% +/-2.63

Durbin-Watson Statistic:

One peculiar feature of data recorded over time, like monthly sales, is that it tends to be correlated over time. For e.g. high sales months may be tend to be followed by high sales months and low sales months by more low sales months. This may be caused either by seasonal/cyclical trends or seasonal promotion, marketing or competitive effects. Whatever the factor causing this correlation, correlated errors violate one of the fundamental assumptions needed for least squares regression- independence of errors or in other words random errors. Durbin-Watson Statistic is a measure used to detect such correlations. Every model has one measure for Durbin-Watson statistic. Durbin-Watson Statistic, ranges in value from 0 to 4 with an ideal value of 2 indicating that errors are not correlated (although values from 1.75 to 2.25 may be considered acceptable). A value significantly below 2 indicates a positive correlation and a value significantly greater than 2 suggests negative correlation. In either case the model specification needs to be reviewed to identify variables potentially omitted or redundant variables. Excel Regression package does not estimate the Durbin-Watson statistic, but it can be easily calculated from the output.

Multi-collinearity

Multi-collinearity is a condition when independent variables included in a model are correlated with each other. Multi-collinearity may result in redundant variables being included in the model, which in itself is not such a terrible thing. The real damage caused by multi-collinearity is that it causes large standard errors in estimating the coefficients. In simpler terms it causes the estimated t-statistics for correlated or multi-collinear variables to be insignificant, thus resulting in significant variables to appear to be insignificant. Multi-collinearity can be identified by the Variance Inflation factor (VIF), which is a statistic calculated for each variable in a model. A VIF greater than 5 may suggest that the concerned variable is multi-collinear with others in the model and may need to be dropped. VIF cannot be calculated with the Excel Regression package, it needs to be calculated using more sophisticated packages like SAS, SPSS or S-Plus.

Multicollinearity can be controlled by shrinkage techniques like Ridge Regressions, but a better strategy is to combine collinear variables using techniques like Factor Analysis or Principal Components Analysis.

R-Square

Any model is only as good as it is able to predict the actual outcome with accuracy. R-Square is a measure of how well the model is able to predict the changes in the actual data. R-Square ranges between 0 and 1, with values over 0.5 indicating a good fit between the predictions and actual data.

Mean Absolute Percent Error (MAPE)

MAPE is a measure of how high or low are the differences between the predictions and actual data. For e.g. 15% MAPE means on average the predictions from a model will be 15% higher or lower than actual.

 © Copyright 2010 Joy V Joseph. All Rights Reserved.