Regression Assumptions 2: Multicollinearity and you

Peter Humburg

Linear Regression Assumptions

linear relationship between dependent and independent variables
Homoscedasticity

Check out Gordana’s September seminar

Linear Regression Assumptions

linear relationship between dependent and independent variables
Homoscedasticity
Normally distributed residuals

Linear Regression Assumptions

linear relationship between dependent and independent variables
Homoscedasticity
Normally distributed residuals
No (perfect) multicollinearity

What is Multicollinearity anyway?

Not pairwise correlation

Correlation between predictor and linear combination of all other predictors in model

What does that mean?

\[ Y = \beta_0 + \beta_1 X_1 + ... + \beta_i X_i + \beta_A A + \varepsilon \]

If there is perfect multicollinearity for predictor A, we can express A as a linear combination of the other predictors:

\[ A = \gamma_1 X_1 + ... + \gamma_i X_i \]

Where does it come from?

Several variable measuring related things
- measures of depression and anxiety
- measures of physical fitness
Variables derived from others
- BMI, weight, and height
- Age and Age\(^2\)

What does it do?

Multicollinearity does

increase standard errors
make estimates unstable and sensitive to other variables in the model
- estimates of coefficients can change dramatically
- hypothesis tests of coefficients may lead to different conclusions

We need some data!

Will construct a dataset with multicollinearity
It is very artificial, but at least we know what is going on

\[ Y = 0.1~X_1 + 0.2~X_2 + 0.3~X_3 + 0.4~X_4 + 0.5~X_5 + \varepsilon \]

\(A\) is linear combination of \(X_1\), …, \(X_5\)

y	X1	X2	X3	X4	X5	A
11	17	18	8.6	9.5	1.0	10.2
12	22	15	9.3	6.6	3.7	10.7
11	20	15	8.1	4.3	2.4	9.7
14	23	15	9.3	3.9	4.6	10.7

Finding the linear combination

\[ A = \gamma_1 X_1 + ... + \gamma_5 X_5 \]

	Estimate	t value	Pr(>\|t\|)
(Intercept)	0.00	-9.6e-01	0.34
X1	0.25	9.4e+15	0.00
X2	0.20	6.2e+15	0.00
X3	0.15	5.4e+15	0.00
X4	0.10	3.6e+15	0.00
X5	0.05	1.7e+15	0.00

Warning: essentially perfect fit

\[ A = 0.25~X_1 + 0.2~X_2 + 0.15~X_3 + 0.1~X_4 + 0.05~X_5 \]

Let’s see what happens!

What happens if \(X_1\), …, \(X_5\) and \(A\) are included in the same model?

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.60	1.55	0.39	0.70
X1	0.09	0.06	1.68	0.10
X2	0.18	0.07	2.57	0.01
X3	0.28	0.06	4.71	0.00
X4	0.40	0.06	6.91	0.00
X5	0.52	0.06	8.46	0.00
A	NA	NA	NA	NA

Multicollinearity! Can’t estimate all parameters.

Dropping redundant variables is fine

Can drop one variable from a multicollinear set without loosing anything
May want to think about which variable to drop

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.60	1.55	0.39	0.7
X1	-0.91	0.15	-6.02	0.0
X2	-0.63	0.13	-4.69	0.0
X3	-0.32	0.10	-3.10	0.0
X5	0.32	0.07	4.48	0.0
A	4.02	0.58	6.91	0.0

That looks different…

The model fit is unaffected

\(Y = \beta_1~X_1 + ... + \beta_5~X_5\)

\(Y = \beta_1~X_1 + \beta_2~X_2 + \beta_3~X_3 + \beta_5~X_5 + \beta_A~A\)

Interpretation may be difficult

	Estimate	Std. Error	Pr(>\|t\|)
(Intercept)	0.60	1.55	0.70
X1	0.09	0.06	0.10
X2	0.18	0.07	0.01
X3	0.28	0.06	0.00
X4	0.40	0.06	0.00
X5	0.52	0.06	0.00

	Estimate	Std. Error	Pr(>\|t\|)
(Intercept)	0.60	1.55	0.7
X1	-0.91	0.15	0.0
X2	-0.63	0.13	0.0
X3	-0.32	0.10	0.0
X5	0.32	0.07	0.0
A	4.02	0.58	0.0

\[ Y = -0.9~X_1 -0.6~X_2 - 0.3~X_3 + 0.3~X_5 + 4~A + \varepsilon \]

\[ Y = -0.9~X_1 -0.6~X_2 - 0.3~X_3 + 0.3~X_5 + 4~(0.25~X_1 + 0.2~X_2 + 0.15~X_3 + 0.1~X_4 + 0.05~X_5) + \varepsilon \]

\[ Y = -0.9~X_1 -0.6~X_2 - 0.3~X_3 + 0.3~X_5 + 1~X_1 + 0.8~X_2 + 0.6~X_3 + 0.4~X_4 + 0.2~X_5 + \varepsilon \]

\[ Y = 0.1~X_1 + 0.2~X_2 + 0.3~X_3 + 0.4~X_4 + 0.5~X_5 + \varepsilon \]

Results can be misleading

	Estimate	Std. Error	Pr(>\|t\|)
(Intercept)	0.60	1.55	0.7
X1	-0.91	0.15	0.0
X2	-0.63	0.13	0.0
X3	-0.32	0.10	0.0
X5	0.32	0.07	0.0
A	4.02	0.58	0.0

Without understanding the relationship between \(A\) and the \(X_i\) this will be misleading
- What is the effect of \(X_1\) on \(Y\)?
Usually don’t know how to decode \(A\)

This is the problem with multicollinearity

What to do?

Be aware of the issue!
No perfect solution
But there may be imperfect ones

Diagnosing multicollinearity

How do you quantify multicollinearity?

Remember this?

\[ A = \gamma_1 X_1 + ... + \gamma_5 X_5 \]

We can do that for all predictors
- For each predictor \(X_i\), record \(R^2_i\)
- Tells us about the variance of \(X_i\) explained by other predictors in the model
- \(1-R^2_i\) is the proportion of variance unique to \(X_i\)

Use \(\frac{1}{1-R^2_i}\) as measure of multicollinearity.

This is the Variance Inflation Factor (VIF)

Calculating & Interpreting VIF

R: car::vif() or olsrr::ols_vif_tol()
SPSS: /STATISTICS=COLLIN
Stata: estat vif

Higher VIF means higher multicollinearity
VIF > 5 or VIF > 10 are often used as thresholds for serious multicollinearity

Checking VIF (in R)

library(olsrr)
fit <- lm(y ~ X1+X2+X3+X5+A, data=data_mult)
ols_vif_tol(fit)

  Variables Tolerance  VIF
1        X1     0.125  8.0
2        X2     0.238  4.2
3        X3     0.302  3.3
4        X5     0.706  1.4
5         A     0.065 15.5

What now?

Handling Multicollinearity

Some options

Drop variable with highest VIF

Option 1: Dropping variables

VIF: X1: 8, X2: 4.2, X3: 3.3, X5: 1.4, A: 15.5

Remove \(A\) from the model

Before


Call:
lm(formula = y ~ X1 + X2 + X3 + X5 + A, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6003     1.5518    0.39   0.6998    
X1           -0.9113     0.1513   -6.02  3.3e-08 ***
X2           -0.6280     0.1340   -4.69  9.4e-06 ***
X3           -0.3249     0.1048   -3.10   0.0026 ** 
X5            0.3238     0.0722    4.48  2.1e-05 ***
A             4.0212     0.5816    6.91  5.6e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

After


Call:
lm(formula = y ~ X1 + X2 + X3 + X5, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-3.655 -1.114  0.039  1.101  3.014 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.4270     1.7713    2.50  0.01416 *  
X1            0.0611     0.0680    0.90  0.37151    
X2            0.1686     0.0836    2.02  0.04668 *  
X3            0.2737     0.0722    3.79  0.00026 ***
X5            0.5875     0.0749    7.84  6.5e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 95 degrees of freedom
Multiple R-squared:  0.475, Adjusted R-squared:  0.453 
F-statistic: 21.5 on 4 and 95 DF,  p-value: 1.22e-12

VIF: X1: 1.1, X2: 1.1, X3: 1.1, X5: 1

Handling Multicollinearity

Some options

Drop variable with highest VIF
- will reduce multicollinearity
- may substantially reduce model fit

Use PCA to get uncorrelated predictors

Option 2: Use PCA

PCA will turn our correlated predictors into a new set of uncorrelated ones

pca <- prcomp(data_mult[,c("X1", "X2", "X3", "X5", "A")])
data_pc <- cbind(data_mult["y"], pca$x)
fit_pc <- lm(y ~ ., data=data_pc)

Before


Call:
lm(formula = y ~ X1 + X2 + X3 + X5 + A, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6003     1.5518    0.39   0.6998    
X1           -0.9113     0.1513   -6.02  3.3e-08 ***
X2           -0.6280     0.1340   -4.69  9.4e-06 ***
X3           -0.3249     0.1048   -3.10   0.0026 ** 
X5            0.3238     0.0722    4.48  2.1e-05 ***
A             4.0212     0.5816    6.91  5.6e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

After


Call:
lm(formula = y ~ ., data = data_pc)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.4041     0.1207  102.73  < 2e-16 ***
PC1           0.0487     0.0493    0.99    0.326    
PC2          -0.4072     0.0544   -7.49  3.7e-11 ***
PC3          -0.5037     0.0604   -8.34  6.1e-13 ***
PC4          -0.1561     0.0754   -2.07    0.041 *  
PC5          -4.1423     0.6169   -6.71  1.4e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

Making sense of the numbers

Principle components are difficult to interpret
Each PC is a linear combination of the original variables

Loadings

	PC1	PC2	PC3	PC4	PC5
X1	0.86	0.22	-0.05	-0.39	0.23
X2	0.39	-0.38	-0.20	0.79	0.19
X3	0.02	-0.86	0.30	-0.39	0.14
X5	-0.14	-0.20	-0.93	-0.27	0.06
A	0.28	-0.17	-0.07	-0.01	-0.94


Call:
lm(formula = y ~ ., data = data_pc)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.4041     0.1207  102.73  < 2e-16 ***
PC1           0.0487     0.0493    0.99    0.326    
PC2          -0.4072     0.0544   -7.49  3.7e-11 ***
PC3          -0.5037     0.0604   -8.34  6.1e-13 ***
PC4          -0.1561     0.0754   -2.07    0.041 *  
PC5          -4.1423     0.6169   -6.71  1.4e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

No inherent meaning
At least we know what we are getting

Handling Multicollinearity

Some options

Drop variable with highest VIF
- will reduce multicollinearity
- may substantially reduce model fit
Use PCA to get uncorrelated predictors
- no multicollinearity
- hard to interpret results

Use penalized regression

Option 3: Let the algorithm decide

Penalized regression can fit the model even if there is perfect multicollinearity
Will drop variables as needed

7 x 1 sparse Matrix of class "dgCMatrix"
               s1
(Intercept) 0.948
X1          .    
X2          0.086
X3          0.210
X4          0.353
X5          0.495
A           0.387

Estimates are biased
Interpretation still tricky

Handling Multicollinearity

Some options

Drop variable with highest VIF
- will reduce multicollinearity
- may substantially reduce model fit
Use PCA to get uncorrelated predictors
- no multicollinearity
- hard to interpret results
Use penalized regression
- biased estimates
- can help with variable selection

Remove the shared component from one of the predictors

Option 4: More regressions!

Can fit a regression model to identify shared variance between one predictor and the others
- Just like when computing VIF

\[ A = \gamma_1 X_1 + \gamma_2 X_2 + \gamma_3 X_3 + \gamma_5 X_5 + \varepsilon_A \]

The residuals \(\varepsilon_A\) have no linear relationship with the other predictors

Option 4: More regressions!

Removing other predictors from \(A\)


Call:
lm(formula = A ~ X1 + X2 + X3 + X5, data = data_mult)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4338 -0.1455 -0.0276  0.1432  0.4621 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.95164    0.25575    3.72  0.00034 ***
X1           0.24183    0.00982   24.62  < 2e-16 ***
X2           0.19810    0.01208   16.40  < 2e-16 ***
X3           0.14885    0.01042   14.29  < 2e-16 ***
X5           0.06558    0.01082    6.06  2.7e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.21 on 95 degrees of freedom
Multiple R-squared:  0.935, Adjusted R-squared:  0.933 
F-statistic:  344 on 4 and 95 DF,  p-value: <2e-16

Using residuals of \(A\) instead of \(A\)


Call:
lm(formula = y ~ X1 + X2 + X3 + X5 + resA, data = data_mult)

Residuals:
   Min     1Q Median     3Q    Max 
-2.750 -0.725 -0.042  0.828  2.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6003     1.5518    0.39    0.700    
X1            0.0611     0.0557    1.10    0.275    
X2            0.1686     0.0685    2.46    0.016 *  
X3            0.2737     0.0591    4.63  1.2e-05 ***
X5            0.5875     0.0613    9.58  1.5e-15 ***
resA          4.0212     0.5816    6.91  5.6e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 94 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.633 
F-statistic: 35.2 on 5 and 94 DF,  p-value: <2e-16

Handling Multicollinearity

Some options

Drop variable with highest VIF
- will reduce multicollinearity
- may substantially reduce model fit
Use PCA to get uncorrelated predictors
- no multicollinearity
- hard to interpret results
Use penalized regression
- biased estimates
- can help with variable selection

Remove the shared component from one of the predictors
- need to be careful with interpretation
- can work well if there is one troublesome predictor

Other things to consider

Think carefully before including variables you know to be related
Use orthogonal polynomials when including non-linear terms
- Centering variables also helps here
If you only care about predictions, don’t worry about any of this!

Questions?