Regression Assumptions 2: Multicollinearity and you

Peter Humburg

Linear Regression Assumptions

  • linear relationship between dependent and independent variables
  • Homoscedasticity

Linear Regression Assumptions

  • linear relationship between dependent and independent variables
  • Homoscedasticity
  • Normally distributed residuals

Linear Regression Assumptions

  • linear relationship between dependent and independent variables
  • Homoscedasticity
  • Normally distributed residuals
  • No (perfect) multicollinearity

?

What is Multicollinearity anyway?

  • Not pairwise correlation

  • Correlation between predictor and linear combination of all other predictors in model

What does that mean?

\[ Y = \beta_0 + \beta_1 X_1 + ... + \beta_i X_i + \beta_A A + \varepsilon \]

If there is perfect multicollinearity for predictor A, we can express A as a linear combination of the other predictors:

\[ A = \gamma_1 X_1 + ... + \gamma_i X_i \]

Where does it come from?

  • Several variable measuring related things
    • measures of depression and anxiety
    • measures of physical fitness
  • Variables derived from others
    • BMI, weight, and height
    • Age and Age\(^2\)

What does it do?

Multicollinearity does

  • increase standard errors
  • make estimates unstable and sensitive to other variables in the model
    • estimates of coefficients can change dramatically
    • hypothesis tests of coefficients may lead to different conclusions

We need some data!

  • Will construct a dataset with multicollinearity
  • It is very artificial, but at least we know what is going on

\[ Y = 0.1~X_1 + 0.2~X_2 + 0.3~X_3 + 0.4~X_4 + 0.5~X_5 + \varepsilon \]

\(A\) is linear combination of \(X_1\), …, \(X_5\)

y X1 X2 X3 X4 X5 A
11 17 18 8.6 9.5 1.0 10.2
12 22 15 9.3 6.6 3.7 10.7
11 20 15 8.1 4.3 2.4 9.7
14 23 15 9.3 3.9 4.6 10.7

Finding the linear combination

\[ A = \gamma_1 X_1 + ... + \gamma_5 X_5 \]

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00 0 -9.6e-01 0.34
X1 0.25 0 9.4e+15 0.00
X2 0.20 0 6.2e+15 0.00
X3 0.15 0 5.4e+15 0.00
X4 0.10 0 3.6e+15 0.00
X5 0.05 0 1.7e+15 0.00

Warning: essentially perfect fit

\[ A = 0.25~X_1 + 0.2~X_2 + 0.15~X_3 + 0.1~X_4 + 0.05~X_5 \]

Let’s see what happens!

What happens if \(X_1\), …, \(X_5\) and \(A\) are included in the same model?

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.60 1.55 0.39 0.70
X1 0.09 0.06 1.68 0.10
X2 0.18 0.07 2.57 0.01
X3 0.28 0.06 4.71 0.00
X4 0.40 0.06 6.91 0.00
X5 0.52 0.06 8.46 0.00
A NA NA NA NA

Multicollinearity! Can’t estimate all parameters.

Dropping redundant variables is fine

  • Can drop one variable from a multicollinear set without loosing anything
  • May want to think about which variable to drop
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.60 1.55 0.39 0.7
X1 -0.91 0.15 -6.02 0.0
X2 -0.63 0.13 -4.69 0.0
X3 -0.32 0.10 -3.10 0.0
X5 0.32 0.07 4.48 0.0
A 4.02 0.58 6.91 0.0

That looks different…

The model fit is unaffected

\(Y = \beta_1~X_1 + ... + \beta_5~X_5\)

\(Y = \beta_1~X_1 + \beta_2~X_2 + \beta_3~X_3 + \beta_5~X_5 + \beta_A~A\)

Interpretation may be difficult

Estimate Std. Error Pr(>|t|)
(Intercept) 0.60 1.55 0.70
X1 0.09 0.06 0.10
X2 0.18 0.07 0.01
X3 0.28 0.06 0.00
X4 0.40 0.06 0.00
X5 0.52 0.06 0.00
Estimate Std. Error Pr(>|t|)
(Intercept) 0.60 1.55 0.7
X1 -0.91 0.15 0.0
X2 -0.63 0.13 0.0
X3 -0.32 0.10 0.0
X5 0.32 0.07 0.0
A 4.02 0.58 0.0

\[ Y = -0.9~X_1 -0.6~X_2 - 0.3~X_3 + 0.3~X_5 + 4~A + \varepsilon \]

\[ Y = -0.9~X_1 -0.6~X_2 - 0.3~X_3 + 0.3~X_5 + 4~(0.25~X_1 + 0.2~X_2 + 0.15~X_3 + 0.1~X_4 + 0.05~X_5) + \varepsilon \]

\[ Y = -0.9~X_1 -0.6~X_2 - 0.3~X_3 + 0.3~X_5 + 1~X_1 + 0.8~X_2 + 0.6~X_3 + 0.4~X_4 + 0.2~X_5 + \varepsilon \]

\[ Y = 0.1~X_1 + 0.2~X_2 + 0.3~X_3 + 0.4~X_4 + 0.5~X_5 + \varepsilon \]

Results can be misleading

Estimate Std. Error Pr(>|t|)
(Intercept) 0.60 1.55 0.7
X1 -0.91 0.15 0.0
X2 -0.63 0.13 0.0
X3 -0.32 0.10 0.0
X5 0.32 0.07 0.0
A 4.02 0.58 0.0
  • Without understanding the relationship between \(A\) and the \(X_i\) this will be misleading
    • What is the effect of \(X_1\) on \(Y\)?
  • Usually don’t know how to decode \(A\)

This is the problem with multicollinearity

What to do?

  • Be aware of the issue!
  • No perfect solution
  • But there may be imperfect ones

Diagnosing multicollinearity

How do you quantify multicollinearity?

Remember this?

\[ A = \gamma_1 X_1 + ... + \gamma_5 X_5 \]

  • We can do that for all predictors
    • For each predictor \(X_i\), record \(R^2_i\)
    • Tells us about the variance of \(X_i\) explained by other predictors in the model
    • \(1-R^2_i\) is the proportion of variance unique to \(X_i\)

Use \(\frac{1}{1-R^2_i}\) as measure of multicollinearity.

This is the Variance Inflation Factor (VIF)

Calculating & Interpreting VIF

  • R: car::vif() or olsrr::ols_vif_tol()
  • SPSS: /STATISTICS=COLLIN
  • Stata: estat vif
  • Higher VIF means higher multicollinearity
  • VIF > 5 or VIF > 10 are often used as thresholds for serious multicollinearity

Checking VIF (in R)

library(olsrr)
fit <- lm(y ~ X1+X2+X3+X5+A, data=data_mult)
ols_vif_tol(fit)
  Variables Tolerance  VIF
1        X1     0.125  8.0
2        X2     0.238  4.2
3        X3     0.302  3.3
4        X5     0.706  1.4
5         A     0.065 15.5
  • What now?

Handling Multicollinearity

Some options

  • Drop variable with highest VIF

Option 1: Dropping variables

VIF: X1: 8, X2: 4.2, X3: 3.3, X5: 1.4, A: 15.5

Remove \(A\) from the model

Before


Call:
lm(formula = y ~ X1 + X2 + X3 + X5 + A, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6003     1.5518    0.39   0.6998    
X1           -0.9113     0.1513   -6.02  3.3e-08 ***
X2           -0.6280     0.1340   -4.69  9.4e-06 ***
X3           -0.3249     0.1048   -3.10   0.0026 ** 
X5            0.3238     0.0722    4.48  2.1e-05 ***
A             4.0212     0.5816    6.91  5.6e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

After


Call:
lm(formula = y ~ X1 + X2 + X3 + X5, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-3.655 -1.114  0.039  1.101  3.014 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.4270     1.7713    2.50  0.01416 *  
X1            0.0611     0.0680    0.90  0.37151    
X2            0.1686     0.0836    2.02  0.04668 *  
X3            0.2737     0.0722    3.79  0.00026 ***
X5            0.5875     0.0749    7.84  6.5e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 95 degrees of freedom
Multiple R-squared:  0.475, Adjusted R-squared:  0.453 
F-statistic: 21.5 on 4 and 95 DF,  p-value: 1.22e-12

VIF: X1: 1.1, X2: 1.1, X3: 1.1, X5: 1

Handling Multicollinearity

Some options

  • Drop variable with highest VIF
    • will reduce multicollinearity
    • may substantially reduce model fit
  • Use PCA to get uncorrelated predictors

Option 2: Use PCA

  • PCA will turn our correlated predictors into a new set of uncorrelated ones
pca <- prcomp(data_mult[,c("X1", "X2", "X3", "X5", "A")])
data_pc <- cbind(data_mult["y"], pca$x)
fit_pc <- lm(y ~ ., data=data_pc)

Before


Call:
lm(formula = y ~ X1 + X2 + X3 + X5 + A, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6003     1.5518    0.39   0.6998    
X1           -0.9113     0.1513   -6.02  3.3e-08 ***
X2           -0.6280     0.1340   -4.69  9.4e-06 ***
X3           -0.3249     0.1048   -3.10   0.0026 ** 
X5            0.3238     0.0722    4.48  2.1e-05 ***
A             4.0212     0.5816    6.91  5.6e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

After


Call:
lm(formula = y ~ ., data = data_pc)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.4041     0.1207  102.73  < 2e-16 ***
PC1           0.0487     0.0493    0.99    0.326    
PC2          -0.4072     0.0544   -7.49  3.7e-11 ***
PC3          -0.5037     0.0604   -8.34  6.1e-13 ***
PC4          -0.1561     0.0754   -2.07    0.041 *  
PC5          -4.1423     0.6169   -6.71  1.4e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

Making sense of the numbers

  • Principle components are difficult to interpret
  • Each PC is a linear combination of the original variables

Loadings

PC1 PC2 PC3 PC4 PC5
X1 0.86 0.22 -0.05 -0.39 0.23
X2 0.39 -0.38 -0.20 0.79 0.19
X3 0.02 -0.86 0.30 -0.39 0.14
X5 -0.14 -0.20 -0.93 -0.27 0.06
A 0.28 -0.17 -0.07 -0.01 -0.94

Call:
lm(formula = y ~ ., data = data_pc)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.4041     0.1207  102.73  < 2e-16 ***
PC1           0.0487     0.0493    0.99    0.326    
PC2          -0.4072     0.0544   -7.49  3.7e-11 ***
PC3          -0.5037     0.0604   -8.34  6.1e-13 ***
PC4          -0.1561     0.0754   -2.07    0.041 *  
PC5          -4.1423     0.6169   -6.71  1.4e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16
  • No inherent meaning
  • At least we know what we are getting

Handling Multicollinearity

Some options

  • Drop variable with highest VIF
    • will reduce multicollinearity
    • may substantially reduce model fit
  • Use PCA to get uncorrelated predictors
    • no multicollinearity
    • hard to interpret results
  • Use penalized regression

Option 3: Let the algorithm decide

  • Penalized regression can fit the model even if there is perfect multicollinearity
  • Will drop variables as needed
7 x 1 sparse Matrix of class "dgCMatrix"
               s1
(Intercept) 0.948
X1          .    
X2          0.086
X3          0.210
X4          0.353
X5          0.495
A           0.387

  • Estimates are biased
  • Interpretation still tricky

Handling Multicollinearity

Some options

  • Drop variable with highest VIF
    • will reduce multicollinearity
    • may substantially reduce model fit
  • Use PCA to get uncorrelated predictors
    • no multicollinearity
    • hard to interpret results
  • Use penalized regression
    • biased estimates
    • can help with variable selection
  • Remove the shared component from one of the predictors

Option 4: More regressions!

  • Can fit a regression model to identify shared variance between one predictor and the others
    • Just like when computing VIF

\[ A = \gamma_1 X_1 + \gamma_2 X_2 + \gamma_3 X_3 + \gamma_5 X_5 + \varepsilon_A \]

  • The residuals \(\varepsilon_A\) have no linear relationship with the other predictors

Option 4: More regressions!

Removing other predictors from \(A\)


Call:
lm(formula = A ~ X1 + X2 + X3 + X5, data = data_mult)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4338 -0.1455 -0.0276  0.1432  0.4621 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.95164    0.25575    3.72  0.00034 ***
X1           0.24183    0.00982   24.62  < 2e-16 ***
X2           0.19810    0.01208   16.40  < 2e-16 ***
X3           0.14885    0.01042   14.29  < 2e-16 ***
X5           0.06558    0.01082    6.06  2.7e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.21 on 95 degrees of freedom
Multiple R-squared:  0.935, Adjusted R-squared:  0.933 
F-statistic:  344 on 4 and 95 DF,  p-value: <2e-16

Using residuals of \(A\) instead of \(A\)


Call:
lm(formula = y ~ X1 + X2 + X3 + X5 + resA, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6003     1.5518    0.39    0.700    
X1            0.0611     0.0557    1.10    0.275    
X2            0.1686     0.0685    2.46    0.016 *  
X3            0.2737     0.0591    4.63  1.2e-05 ***
X5            0.5875     0.0613    9.58  1.5e-15 ***
resA          4.0212     0.5816    6.91  5.6e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

Handling Multicollinearity

Some options

  • Drop variable with highest VIF
    • will reduce multicollinearity
    • may substantially reduce model fit
  • Use PCA to get uncorrelated predictors
    • no multicollinearity
    • hard to interpret results
  • Use penalized regression
    • biased estimates
    • can help with variable selection
  • Remove the shared component from one of the predictors
    • need to be careful with interpretation
    • can work well if there is one troublesome predictor

Other things to consider

  • Think carefully before including variables you know to be related
  • Use orthogonal polynomials when including non-linear terms
    • Centering variables also helps here
  • If you only care about predictions, don’t worry about any of this!

Questions?