Common Regression Issues

This repository serves more as a knowledge repository than a traditional codebase. Hopefully, you’ll find some of these tips insightful, along with a selection of helpful utility functions for generating synthetic data.

Author

Matthew Reda

Please hop on over to the site, there I cover several topics that I’ve encountered while modelling. Please feel free to raise an issue if there is a topic that you think should be discussed, or if there are any changes/clarifications/errors you find on the site.

Sampling Error

This section delves into the challenges posed by inaccuracies in independent variables within regression models. It highlights how such errors can lead to biased and inconsistent parameter estimates, even when sampling designs are unbiased and respondent answers are accurate. Through practical examples, like surveys assessing brand advertisement recall and its effect on sales, the section demonstrates the potential distortions sampling errors can introduce into regression outcomes. To address these issues, it explores strategies such as employing latent variable models and carefully considering the observation process to mitigate the negative impacts of sampling errors on regression analyses.

Multicollinearity

This section examines the challenges posed by multicollinearity in regression models, particularly from a causal perspective. It utilizes Directed Acyclic Graphs (DAGs) to illustrate how multicollinearity can complicate the estimation of causal effects, using examples like assessing the impact of paid search impressions on sales. The section emphasizes the importance of identifying and adjusting for confounding variables to obtain unbiased estimates, highlighting that high Variance Inflation Factors (VIFs) can inflate standard errors and reduce estimate precision. It also discusses strategies to mitigate multicollinearity’s impact, such as focusing on relevant adjustment variables and refining causal models in collaboration with experts and stakeholders.

The Illusion of Significance

Statistical models drive millions in spending decisions, yet beneath their precise-looking numbers lurks a dangerous problem. This post examines how the practice of selecting variables based on p-values creates a statistical house of cards; especially in Marketing Mix Modeling (MMM) and budget optimization. I show why common techniques like stepwise regression inevitably produce overconfident models with biased estimates that violate the very statistical principles they claim to uphold. These methodological flaws can result in money flowing to the wrong marketing channels based on illusory performance metrics. I demonstrate why Bayesian approaches offer a more honest alternative by naturally tempering overconfidence, incorporating what we already know, and providing intuitive uncertainty measures. Through techniques like spike-and-slab priors (or regularized horseshoe priors) and Bayesian Model Averaging (BMA), analysts can move beyond arbitrary significance thresholds toward probability-based decision-making. While Bayesian methods do require more computational horsepower and thoughtful prior specification, modern software has made them increasingly accessible. Using simulated examples inspired by real-world marketing and economic modeling, I show how Bayesian methods produce more reliable insights that lead to smarter budget allocation decisions.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{reda,
  author = {Reda, Matthew},
  title = {Common {Regression} {Issues}},
  url = {https://redam94.github.io/common_regression_issues/index.html},
  langid = {en}
}

For attribution, please cite this work as:

Reda, Matthew. n.d. “Common Regression Issues.” https://redam94.github.io/common_regression_issues/index.html.