Least Squares: Your Go-To Method, But Is It Always the Best?

When tasked with fitting a line to data, least squares regression is often the first technique analysts reach for. Its simplicity, ease of implementation, and widespread acceptance make it a cornerstone of statistical modeling. But is it always the best choice? While least squares has served as a trusted ally for decades, it’s not without its limitations—especially when dealing with complex datasets. By leaning solely on this method, you might be missing opportunities for better performance and clearer insights. Let’s unpack the challenges and explore smarter alternatives that could elevate your models.

Least square The goal of least squares regression is to minimize the sum of squared residuals (errors) between observed values

$$\mathrm{Objective:}\ \mathbf{min}\_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left( y_i - \widehat{y}_i \right)^2$$

Least squares thrives on data—sometimes to its detriment. When the number of variables exceeds the number of data points, it becomes overly flexible, crafting a model that adheres perfectly to the training data. At first glance, this sounds ideal. But in reality, it spells trouble: Why? The model memorizes noise and irrelevant patterns in the training data, leading to poor generalization on new, unseen data. Real-World Impact: You might find your perfectly tuned model producing wildly inaccurate predictions when faced with fresh inputs. This is called overfitting

What is overfitting? Imagine a tailor that is too obsessed with every little detail—measuring your posture, the way you breathe, how you sit, or even the slight slouch in your shoulders. The result? A suit or dress so specific to those exact measurements that it only looks good if you stand or move in one exact way. If you gain or lose weight or move differently, it no longer fits well.

Subset Selection: Choose the Best Predictors Types: Best Subset Selection: Considers all possible subsets of predictors (computationally expensive). Stepwise Selection: Iteratively adds or removes predictors based on improvement in model fit (e.g., Akaike Information Criterion).

2- Shrinkage Methods: Regularize Coefficients Shrinkage methods add a penalty term to the least squares objective to control overfitting.

Ridge Regression:

$$\mathrm{Objective:}\ \min_{\beta} \left( \sum_{i=1}^{n}\left(y_i - \widehat{y}_i\right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)$$

Effect: Shrinks coefficients toward zero but doesn’t set them exactly to zero. Use Case: When all predictors are expected to contribute somewhat.

Lasso Regression:

Adds an L1 norm penalty:

$$\mathrm{Objective:}\ \min_{\beta} \left( \sum_{i=1}^{n} \left( y_i - \widehat{y}_i \right)^2 + \lambda \sum_{j=1}^{p} \left| \beta_j \right| \right)$$

Effect: Encourages sparsity by setting some coefficients exactly to zero.

Use Case: When feature selection and interpretability are critical. In both methods, λ controls the strength of the penalty, with larger λ leading to greater regularization.

3- Dimension Reduction: Simplify Complexity Dimension reduction transforms predictors into a smaller set of uncorrelated components.

Principal Component Analysis (PCA): Projects data onto k orthogonal components

Z=XP Where P is the matrix of eigenvectors of:

$$X^{\top X}$$

Use Case: Reducing dimensionality while preserving most of the data’s variability.