Regression & Linear Models

The foundation of Machine Learning algorithms.

Almost all Machine Learning algorithms build on these principles. Even complex Neural Networks are essentially layers of linear regressions followed by non-linear activations. Understanding regression is understanding how machines "learn" relationships from data.

1. Simple & Multiple Linear Regression

Linear regression models the relationship between a dependent variable $y$ and one or more independent variables $X$ using a linear function.

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon $$

"y equals beta-nought plus beta-one x-one ... plus epsilon (error)."

Why is this used in ML?

It is the simplest form of **Supervised Learning** for regression tasks (predicting continuous values). The weights ($\beta$) represent the *importance* of each feature.

Code Implementation


# Scikit-Learn: Linear Regression
import numpy as np
from sklearn.linear_model import LinearRegression

# Synthetic Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 3.9, 6.1, 8.2, 10.1]) # Roughly y = 2x

model = LinearRegression()
model.fit(X, y)

# Learned Parameters
intercept = model.intercept_
slope = model.coef_[0]
# Result: Intercept: -0.01, Slope: 2.03
    

2. Least Squares Estimation

The method used to find the best-fitting line. It minimizes the sum of the squares of the vertical differences (residuals) between the observed data and the fitted line.

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

"Mean Squared Error equals one over n times the sum of squared differences between actual y and predicted y."

Why is this used in ML?

This introduces the concept of a **Loss Function**. Training a model means minimizing this Loss Function via optimization (like the analytical Ordinary Least Squares solution or Gradient Descent).

3. Logistic Regression

Despite the name, this is used for **Classification**. It applies the Sigmoid function to the linear output to squash predictions between 0 and 1 (probabilities).

$$ P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} $$

"Probability of y being 1 given X equals 1 divided by (1 plus e to the negative linear predictor)."

Why is this used in ML?

It is the baseline for binary classification. It introduces **non-linearity** (the Sigmoid activation), which is the building block of Neural Networks.

Code Implementation


# Scikit-Learn: Logistic Regression
import numpy as np
from sklearn.linear_model import LogisticRegression

# Binary Data (0 or 1)
X = np.array([[1], [2], [10], [11]])
y = np.array([0, 0, 1, 1])

clf = LogisticRegression()
clf.fit(X, y)

# Prediction for new data
prob = clf.predict_proba([[5]])[0][1]
# Probability of Class 1 for X=5: 0.329
    

4. Regularization: Lasso, Ridge, Elastic Net

Techniques to prevent **Overfitting** by adding a penalty term to the Loss Function.

  • Lasso (L1): Adds absolute value of coefficients. Can reduce weights to zero (Feature Selection).
  • Ridge (L2): Adds squared value of coefficients. Shrinks weights but keeps them non-zero.
  • Elastic Net: Combines L1 and L2 penalties.
$$ Loss + \lambda \sum |\beta_j| \text{ (Lasso)} \quad \text{vs} \quad Loss + \lambda \sum \beta_j^2 \text{ (Ridge)} $$

Why is this used in ML?

Regularization is crucial when you have many features or little data. It simplifies the model, making it generalized better to unseen data (Bias-Variance Tradeoff).

Code Implementation


# Scikit-Learn: Ridge (L2) Regularization
from sklearn.linear_model import Ridge

# Noisy small dataset
X_rng = np.random.rand(10, 1)
y_rng = 2 * X_rng + 0.5 + np.random.randn(10, 1) * 0.5 

ridge = Ridge(alpha=1.0) # alpha is the regularization strength (lambda)
ridge.fit(X_rng, y_rng)

coef = ridge.coef_[0][0]
# Ridge Coefficient: 1.3033
    

5. Model Assumptions

Linear regression relies on several key assumptions to be valid:

  • Linearity: The relationship between X and y is linear.
  • Independence: Observations are independent of each other.
  • Homoscedasticity: The variance of errors is constant across all levels of X.
  • Normality: The errors follows a normal distribution.

6. Residual Analysis

Analyzing the residuals ($y - \hat{y}$) helps debug the model. If assumptions are violated, residuals will show patterns instead of random noise.

Why is this used in ML?

"Debugging" in ML often means checking residuals. If residuals have a pattern (e.g., a curve), it means your linear model missed non-linear information, suggesting you might need a more complex model or feature engineering.

7. References & Further Reading