Probability Distributions

Common discrete and continuous distributions used in Machine Learning.

1. Discrete Distributions

Distributions where the random variable typically takes on integer values.

1.1 Bernoulli Distribution

Models a single experiment with two possible outcomes (Success/Failure) with probability $p$ and $1-p$.

$$ P(X=k) = p^k (1-p)^{1-k} \quad \text{for } k \in \{0, 1\} $$

"Probability of X equals k is p to the power of k times (1 minus p) to the power of (1 minus k), where k is 0 or 1."

Why is this used in ML?

Think of Binary Classification problems, like determining if an email is "Spam" or "Not Spam". The model tries to predict the probability $p$ that an input belongs to the positive class (1). Every single prediction is essentially modeling a Bernoulli trial: a weighted coin flip where the model outputs the 'weight' (probability).

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: p=0.6
# PMF at k=1 (Success)
prob_success = stats.bernoulli.pmf(1, p=0.6)
# Result: 0.6
        

1.2 Binomial Distribution

Models the number of successes in $n$ independent Bernoulli trials.

$$ P(X=k) = \binom{n}{k} p^k (1-p)^{n-k} $$

"Probability of X equals k is 'n choose k' times p to the power of k times (1 minus p) to the power of (n minus k)."

Why is this used in ML?

In A/B Testing, we show Version A to $n$ users and count how many click (successes). The Binomial distribution allows us to calculate if the number of clicks is statistically significant or just due to chance, helping us decide which website version is truly better.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: n=10 trials, p=0.5 (Fair coin)
# Probability of exactly 5 heads
prob_5_heads = stats.binom.pmf(5, n=10, p=0.5)
# Result: 0.2461
        

1.3 Geometric Distribution

Models the number of trials needed to get the first success.

$$ P(X=k) = (1-p)^{k-1}p $$

"Probability of X equals k is (1 minus p) to the power of (k minus 1) times p."

Why is this used in ML?

Used in User Behavior Analysis. For example, "How many ads does a user need to see before they finally click one?". Understanding this helps in estimating marketing costs and Customer Acquisition Cost (CAC) by modeling the "distance" to a conversion.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: p=0.2
# Prob of success on exactly the 3rd trial
prob_3rd_try = stats.geom.pmf(3, p=0.2)
# Result: 0.128
        

1.4 Poisson Distribution

Models the number of events occurring in a fixed interval of time or space.

$$ P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} $$

"Probability of X equals k is lambda to the power of k times e to the power of negative lambda, divided by k factorial."

Why is this used in ML?

Critical for Anomaly Detection. If a server typically receives 5 requests/minute (Average rate $\lambda=5$), we can calculate the probability of seeing 100 requests. If that probability is infinitesimally small, the system flags it as an anomaly (e.g., a DDoS attack), because it falls far outside the expected Poisson distribution behavior.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: lambda=3 (avg events per interval)
# Prob of exactly 5 events
prob_5_events = stats.poisson.pmf(5, mu=3)
# Result: 0.1008
        

2. Continuous Distributions

Distributions where the random variable can take any value within a range.

2.1 Uniform Distribution

All outcomes in the range $[a, b]$ are equally likely.

$$ f(x) = \frac{1}{b-a} \quad \text{for } a \le x \le b $$

"The probability density f of x is 1 divided by (b minus a), for x between a and b."

Why is this used in ML?

Weight Initialization: When training a Neural Network, we can't start with all weights at zero (the model won't learn). We pick random small numbers from a Uniform distribution. This ensures fairness—every neuron gets a random starting point within the same range, breaking symmetry without favoring any specific direction.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: a=0, b=10
# PDF at x=5
pdf_at_5 = stats.uniform.pdf(5, loc=0, scale=10) # scale = b-a
# Result: 0.1
        

2.2 Normal (Gaussian) Distribution

The bell curve. Symmetric, defined by mean $\mu$ and std dev $\sigma$.

$$ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} $$

"The probability density f of x is one over sigma times root 2 pi, times e raised to negative one-half times quantity (x minus mu over sigma) squared."

Why is this used in ML?

The Gaussian Assumption is everywhere. In Linear Regression, we assume the errors (residuals) follow a Normal distribution. This means we expect most predictions to be slightly off by a small amount (near the peak), and very few predictions to be massively wrong (the tails). Standardizing data (Z-score) forces features into this shape so algorithms converge faster.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: mean=0, std=1 (Standard Normal)
# PDF at x=0 (Peak)
pdf_at_peak = stats.norm.pdf(0, loc=0, scale=1)
# Result: 0.3989
        

2.3 Exponential Distribution

Models the time between events in a Poisson process.

$$ f(x) = \lambda e^{-\lambda x} \quad \text{for } x \ge 0 $$

"The probability density f of x is lambda times e to the negative lambda x."

Why is this used in ML?

Survival Analysis & Churn Prediction. We model "Time to Failure" or "Time to Churn". Instead of just predicting if a customer will leave, specific models use the Exponential distribution to predict when they might leave. This "decay" curve is perfect for modeling events that become more likely as time passes, or constant hazard rates.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: scale=1/lambda. If lambda=0.5, scale=2.
# PDF at x=2
pdf_time = stats.expon.pdf(2, scale=2)
# Result: 0.1839
        

2.4 Gamma Distribution

Generalization of the Exponential distribution. Models wait times for $k$ events.

$$ f(x) \propto x^{k-1}e^{-x/\theta} $$

"The probability density f of x is proportional to x to the (k minus 1) times e to the negative x over theta."

Why is this used in ML?

Used in Process Optimization. While Exponential models the wait for one event, Gamma models the wait time for $k$ events to happen sequentially. In Bayesian ML, it's used as a "conjugate prior" for the precision of a Normal distribution—essentially helping us estimate how "uncertain" or "noisy" our data is.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: a=2 (shape, k), scale=2 (theta)
pdf_val = stats.gamma.pdf(3, a=2, scale=2)
# Result: 0.1673
        

2.5 Beta Distribution

Defined on $[0, 1]$. Used for modeling probabilities of probabilities.

$$ f(x) \propto x^{\alpha-1}(1-x)^{\beta-1} $$

"The probability density f of x is proportional to x to the (alpha minus 1) times (1 minus x) to the (beta minus 1)."

Why is this used in ML?

Reinforcement Learning (Thompson Sampling). Imagine a slot machine (Multi-Armed Bandit). You want to know the probability of winning ($p$). We use a Beta distribution to represent our "belief" about $p$. Initially, it's flat (we know nothing). As we play and win/lose, we update the Beta shape. It peaks around the true winning rate. It's the standard way to model "Probability of a Probability".

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: alpha=2, beta=2 (Symmetric bell-like on [0,1])
# PDF at x=0.5
pdf_mid = stats.beta.pdf(0.5, a=2, b=2)
# Result: 1.5
        

2.6 Chi-square ($\chi^2$) Distribution

Sum of squared standard normal variables.

$$ Q = \sum_{i=1}^{k} Z_i^2 \sim \chi^2_k $$

"Q is the sum of k squared standard normal variables Z-sub-i, which follows a Chi-square distribution with k degrees of freedom."

Why is this used in ML?

Feature Selection. We use the Chi-square test to check independence between categories. For example, is "Color" related to "Sales"? We compare the Actual counts of Red/Blue items sold vs the Expected counts if there was no relationship. A high Chi-square score means they are related, so "Color" is a useful feature to keep for training.

Code Implementation


# Library: Scipy
from scipy import stats # Explicit Import

# Setup: df=2 (degrees of freedom)
# PDF at x=1
pdf_val = stats.chi2.pdf(1, df=2)
# Result: 0.3033
        

3. References & Further Reading