Thursday, October 8, 2009

Multiple Regression Model - High Correlations like Dividing by Zero

In multiple regression models, statisticians will create formulas which incorporate different factors in order to predict a value. The example used in class is the popular model used by real estate agents to do market valuation of houses. A typical model will incorporate things like number of bed rooms, if there is a basement, square footage, home assessment value etc. However, you can very quickly identify that some of these factors are 'related' (a house with many bed rooms, a basement or high home assessment value will probably have more square footage).

However, we were shown a formula which only used AREA to predict value and one formula that used ASSESSMENT (home tax assessment value). Intuitively, you can tell that they are positively correlated to some degree and this was reflected in the formula.

MARKET1 = X + a AREA + e
MARKET2 = Y + b AREA + c ASSESS + e

Where 'e' is the error.

However, in the two formulas, b was less than a. What does that mean? Some of the 'correlated' value between AREA and ASSESS is encompassed both in 'a' and 'c'.

But what if they are VERY highly correlated (or if you deliberately choose one factor which was a linear construction of another factor for a correlation of 1), you can see that it is impossible to create a 'factor' as the two items will move in perfect harmony. Imagine AREA is perfectly correlated to ASSESS or

ASSESS = d AREA

then

MARKET1 = X + a AREA + e
MARKET2 = Y + b AREA + c ASSESS + e

However, if ASSESS = d AREA (perfectly linearly correlated)
then
MARKET2 = Y + b AREA + cd AREA + e
MARKET2 = Y + (b + cd) AREA + e
Since there are only one 'real' factor, this would mean that:
  • MARKET1 = MARKET2, the second model would be identical to the first model.
  • 'a' would be expressed as (b + cd)
  • X = Y
Another problem is that you can allocate b and d in any proportion you want to create MARKET2 (there is no meaningful way to allocate the weight of the factors b and cd). It's a very awkward solution and reminds me of divide by zero when it comes to math problems, that there is a problem when you take the limit of a factor to asymptotically approach a value. Mathematically, I believe that it would be described by these factors having no orthogonal components which causes a unique case.

Another interesting result is that if 'b' is not significantly different from 'c', it shows that correlation is probably low.

Also, if any of the 'weight' letters (except for 'e') is close to zero, it implies that these prediction factors don't actually have any bearing on model, but high numbers don't necessary imply importance (it really matters what the factor variables are measured in as it has a relative impact on the final prediction).

No comments: