When is Multicollinearity an Issue?

A causal perspective
Author

Matthew Reda

Multicollinearity arises when independent variables in a regression model are highly correlated, leading to challenges in estimating individual effects accurately. This issue can inflate standard errors, reduce the precision of coefficient estimates, and complicate the interpretation of results. Understanding when multicollinearity becomes problematic is crucial for developing robust regression models. This section explores the conditions under which multicollinearity affects model performance, discusses its implications, and provides strategies for detection and mitigation.

The DAG

Paid Media on Sales Seasonality Seasonality Demand Demand Seasonality->Demand OLV Impression OLV Impression Seasonality->OLV Impression Social Impression Social Impression Seasonality->Social Impression Search Query Search Query Seasonality->Search Query Demand->Search Query Paid Search Click Paid Search Click Demand->Paid Search Click Sales Sales Demand->Sales Auction Auction Paid Search Impression Paid Search Impression Auction->Paid Search Impression OLV Impression->Demand Video Platform Sentiment Video Platform Sentiment Video Platform Sentiment->OLV Impression Social Impression->Demand Search Query->Auction Search Query->Paid Search Impression Search Query->Paid Search Click Organic Search Organic Search Search Query->Organic Search Paid Search Impression->Paid Search Click Paid Search Click->Sales Organic Search->Sales Price Price Price->Demand Price->Sales
Figure 1: The causal graph reflects our assumptions about the data-generating process in this example. This model may need refinement, so it’s essential to review it with experts and stakeholders to enhance its structure and incorporate additional factors that could influence the variables of interest.

A Directed Acyclic Graph (DAG) serves as a model of causality, where directional relationships between nodes indicate that causes flow from one node to another along the arrows, forming a distinct pathway of influence. Each arrow represents a causal link, signaling how changes in one node could potentially impact another. In a DAG, the relationships between nodes (depicted by an arrow) are not restricted to being linear; they can incorporate complex, non-linear interactions.

Edges will be color-coded to emphasize whether they represent causal paths (green ), biasing paths (red ), or non-causal paths (black ). Refer to Figure 2 and Figure 3.

Nodes indicating exposures (factors for which we want to understand the effect) will have a green background (). The outcome variable of interest will have a blue background (), while variables adjusted for will have a light gray background (). Nodes with a dashed outline represent elements that are challenging or impossible to observe directly, while nodes with a solid outline and no fill represent variables for which data is available or can be acquired.

When High VIFs are an issue

While high VIFs are not inherently problematic, they serve as a useful diagnostic tool for identifying potential issues in regression results. Let’s explore this further with a slightly more complex example.

How effective is OLV at driving sales?

From the causal model, we can see that seasonality is part of the necessary adjustment set to accurately estimate the effect of OLV impressions on sales. Although seasonality is unobserved, we can reasonably approximate it using a periodic function (e.g., sine or cosine terms) to capture seasonal fluctuations over time. Let’s consider the following model were we adjust for seasonality (using fourier terms), price, olv impressions, and Video Platform Sentiment (olv_sentiment in the models). Note that Video Platform Sentiment is not a confounder, typically we would not adjust for it, but we include it here to illustrate the point that adjusting for variables that are not confounders can needlessly increase VIFs.

Paid Media on Sales OLV Impression OLV Impression Demand Demand OLV Impression->Demand Sales Sales Price Price Price->Sales Price->Demand Video Platform Sentiment Video Platform Sentiment Video Platform Sentiment->OLV Impression Seasonality Seasonality Seasonality->OLV Impression Seasonality->Sales Search Query Search Query Seasonality->Search Query Social Impression Social Impression Seasonality->Social Impression Seasonality->Demand Paid Search Impression Paid Search Impression Search Query->Paid Search Impression Paid Search Click Paid Search Click Search Query->Paid Search Click Organic Search Organic Search Search Query->Organic Search Auction Auction Search Query->Auction Paid Search Impression->Paid Search Click Social Impression->Demand Paid Search Click->Sales Organic Search->Sales Demand->Sales Demand->Search Query Demand->Paid Search Click Auction->Paid Search Impression
Figure 5: The causal graph for the relationship between OLV Impressions and Sales. Adjusting for Seasonality and Price closes all biasing paths, allowing for an unbiased estimate of the effect of OLV Impressions on Sales. Including Video Platform Sentement does nothing to reduce bias and only adds variance of our estimate.
Paid Media on Sales OLV Impression OLV Impression Demand Demand OLV Impression->Demand Sales Sales Price Price Price->Sales Price->Demand Seasonality Seasonality Seasonality->OLV Impression Seasonality->Sales Search Query Search Query Seasonality->Search Query Social Impression Social Impression Seasonality->Social Impression Seasonality->Demand Paid Search Impression Paid Search Impression Search Query->Paid Search Impression Paid Search Click Paid Search Click Search Query->Paid Search Click Organic Search Organic Search Search Query->Organic Search Auction Auction Search Query->Auction Paid Search Impression->Paid Search Click Social Impression->Demand Video Platform Sentiment Video Platform Sentiment Video Platform Sentiment->OLV Impression Paid Search Click->Sales Organic Search->Sales Demand->Sales Demand->Search Query Demand->Paid Search Click Auction->Paid Search Impression
Figure 6: Adjusting for Seasonality and Price closes all biasing paths, allowing for an unbiased estimate of the effect of OLV Impressions on Sales.

To capture the seasonal fluctuations in the data, we can use Fourier terms to approximate the unobserved seasonality variable. These terms are periodic functions that can model the cyclical patterns in the data, allowing us to adjust for the influence of seasonality on sales and OLV impressions. By including these terms in the regression model, we can effectively control for the confounding effect of seasonality and obtain an unbiased estimate of the effect of OLV impressions on sales.

I will use a model helper to generated the fourier

Figure 7: Fourier seasonality
Table 4: VIF for the regression model with Fourier components with the addition of OLV sentiment
               VIF
hill_olv     : 70.91
log_price    : 57.59
olv_sentiment: 4.43
fourier_sin_0: 1.01
fourier_sin_1: 1.01
fourier_sin_2: 1.0
fourier_cos_0: 1.42
fourier_cos_1: 1.01
fourier_cos_2: 1.01
Table 5: VIF for the regression model with Fourier components without the addition of OLV sentiment
               VIF
hill_olv     : 18.98
log_price    : 18.84
fourier_sin_0: 1.01
fourier_sin_1: 1.0
fourier_sin_2: 1.0
fourier_cos_0: 1.14
fourier_cos_1: 1.0
fourier_cos_2: 1.0

Fourier components provide an orthogonal basis for the space of periodic functions with a given period \(L\) . This orthogonality helps to address multicollinearity, which is why the VIFs in Table 5 and Table 4 are all near 1 for the fourier components!

Table 6: The over adjusted regression model with Fourier components with the addition of OLV sentiment
OLS Regression Results
Dep. Variable: sales R-squared: 0.888
Model: OLS Adj. R-squared: 0.881
Method: Least Squares F-statistic: 128.3
Date: Tue, 19 Nov 2024 Prob (F-statistic): 9.51e-65
Time: 19:17:22 Log-Likelihood: 3.4816
No. Observations: 156 AIC: 13.04
Df Residuals: 146 BIC: 43.54
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 10.7020 0.324 32.982 0.000 10.061 11.343
hill_olv 0.7217 0.433 1.665 0.098 -0.135 1.578
log_price -1.0720 0.127 -8.433 0.000 -1.323 -0.821
olv_sentiment 0.0183 0.161 0.114 0.909 -0.299 0.336
fourier_sin_0 0.0449 0.028 1.591 0.114 -0.011 0.101
fourier_sin_1 0.0214 0.028 0.768 0.444 -0.034 0.077
fourier_sin_2 0.0166 0.028 0.596 0.552 -0.038 0.072
fourier_cos_0 -0.8321 0.041 -20.141 0.000 -0.914 -0.750
fourier_cos_1 0.1959 0.028 6.963 0.000 0.140 0.251
fourier_cos_2 0.0084 0.028 0.303 0.763 -0.046 0.063
Omnibus: 0.997 Durbin-Watson: 1.689
Prob(Omnibus): 0.607 Jarque-Bera (JB): 0.912
Skew: -0.186 Prob(JB): 0.634
Kurtosis: 2.964 Cond. No. 53.9


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Table 7: The correctly adjusted regression model with Fourier components and no OLV sentiment
OLS Regression Results
Dep. Variable: sales R-squared: 0.888
Model: OLS Adj. R-squared: 0.882
Method: Least Squares F-statistic: 145.4
Date: Tue, 19 Nov 2024 Prob (F-statistic): 7.68e-66
Time: 19:18:03 Log-Likelihood: 3.4747
No. Observations: 156 AIC: 11.05
Df Residuals: 147 BIC: 38.50
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 10.6756 0.226 47.201 0.000 10.229 11.123
hill_olv 0.7678 0.157 4.898 0.000 0.458 1.078
log_price -1.0712 0.126 -8.468 0.000 -1.321 -0.821
fourier_sin_0 0.0444 0.028 1.597 0.112 -0.011 0.099
fourier_sin_1 0.0215 0.028 0.773 0.441 -0.033 0.076
fourier_sin_2 0.0167 0.028 0.603 0.548 -0.038 0.071
fourier_cos_0 -0.8289 0.030 -27.407 0.000 -0.889 -0.769
fourier_cos_1 0.1953 0.028 7.081 0.000 0.141 0.250
fourier_cos_2 0.0086 0.028 0.313 0.755 -0.046 0.063
Omnibus: 1.098 Durbin-Watson: 1.688
Prob(Omnibus): 0.577 Jarque-Bera (JB): 0.987
Skew: -0.195 Prob(JB): 0.610
Kurtosis: 2.979 Cond. No. 26.6


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Table 8: The correctly adjusted regression model with Fourier components and no OLV sentiment
OLS Regression Results
Dep. Variable: sales R-squared: 0.063
Model: OLS Adj. R-squared: 0.051
Method: Least Squares F-statistic: 5.166
Date: Tue, 19 Nov 2024 Prob (F-statistic): 0.00674
Time: 19:27:23 Log-Likelihood: -162.03
No. Observations: 156 AIC: 330.1
Df Residuals: 153 BIC: 339.2
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 10.1355 0.637 15.919 0.000 8.878 11.393
log_price -0.8167 0.355 -2.299 0.023 -1.519 -0.115
hill_olv_predicted 0.9943 0.463 2.148 0.033 0.080 1.909
Omnibus: 23.961 Durbin-Watson: 0.227
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11.734
Skew: 0.487 Prob(JB): 0.00283
Kurtosis: 2.075 Cond. No. 26.5


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Here, we observe that although the VIF values are considerably lower than those reported in Table 1 in the previous example (1.1.1), the impact on our ability to accurately measure the effect of OLV impressions on sales is significantly more detrimental.

Comparing Table 7 to Table 6 we see that the inclusion of Video Platform Sentiment (represented as olv_sentement in the model), did not bias the estimate of the total effect of OLV impressions on sales. However, it did lead to an increase in p-values, pushing them above our significance threshold of 0.05.

Appendix A - Paid Search Data

Appendix B - OLV Data

Reuse

Citation

BibTeX citation:
@online{reda,
  author = {Reda, Matthew},
  title = {When Is {Multicollinearity} an {Issue?}},
  url = {https://redam94.github.io/common_regression_issues/multicollinearity.html},
  langid = {en}
}
For attribution, please cite this work as:
Reda, Matthew. n.d. “When Is Multicollinearity an Issue?” https://redam94.github.io/common_regression_issues/multicollinearity.html.