Probability Basics

Core probability theory for modeling uncertainty in Machine Learning.

1. Core Probability Theory

Probability theory provides a framework for modeling uncertainty. In Machine Learning, we deal with uncertain events (noisy data, stochastic processes, model predictions), making probability the language of ML.

$$ 0 \leq P(E) \leq 1 $$

"The probability of an event E is between 0 and 1, inclusive."

2. Probability Rules

Addition Rule

For any two events A and B:

$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$

Multiplication Rule

For any two events A and B:

$$ P(A \cap B) = P(A|B)P(B) $$

Bayes’ Theorem

Describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

"Probability of A given B equals Probability of B given A times Probability of A, divided by Probability of B."

Why is this used in ML?

Bayes’ Theorem is the foundation of Naive Bayes Classifiers and Bayesian Inference. It allows us to update model beliefs as we acquire new data.

Code Implementation


# Calculate P(Disease | Positive Test)
# Given: P(D)=0.01, P(Pos|D)=0.99, P(Pos|~D)=0.05

p_d = 0.01
p_pos_given_d = 0.99
p_pos = 0.0594

p_d_given_pos = (p_pos_given_d * p_d) / p_pos
# Result: 0.1667
    

3. Conditional Probability

The measure of the probability of an event occurring, given that another event has already occurred.

$$ P(A|B) = \frac{P(A \cap B)}{P(B)} $$

4. Independence

Two events A and B are independent if the occurrence of one does not affect the probability of occurrence of the other.

$$ P(A \cap B) = P(A)P(B) \iff P(A|B) = P(A) $$

Why is this used in ML?

The Naive Bayes assumption is that features are conditionally independent given the class label, which simplifies computation significantly.

5. Random Variables

A variable whose possible values are numerical outcomes of a random phenomenon.

Discrete Random Variables

Can take on a countable number of distinct values (e.g., outcome of a die roll).

$$ \sum P(X=x) = 1 $$

Continuous Random Variables

Can take on an infinite number of possible values (e.g., height, time). Defined by a Probability Density Function (PDF).

$$ \int_{-\infty}^{\infty} f(x) dx = 1 $$

Code Implementation


from scipy import stats

from scipy import stats
# Library: Scipy

# 1. Normal Distribution (Continuous)
# PDF at x=0 for Standard Normal (mean=0, std=1)
norm_pdf_0 = stats.norm.pdf(0)
# Result: 0.3989

# CDF at x=0 (Probability that X <= 0)
norm_cdf_0 = stats.norm.cdf(0)
# Result: 0.5

# 2. Binomial Distribution (Discrete)
# PMF: 10 trials, p=0.5, prob of exactly 5 heads
binom_pmf_5 = stats.binom.pmf(5, n=10, p=0.5)
# Result: 0.2461
    

6. Expectation, Variance, & Covariance

Expectation (Expected Value)

The long-run average value of repetitions of the experiment.

$$ E[X] = \sum x P(x) \quad \text{(Discrete)} $$
$$ E[X] = \int x f(x) dx \quad \text{(Continuous)} $$

Variance

Measures the spread of the random variable involved.

$$ \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 $$

Covariance

Measure of the joint variability of two random variables.

$$ \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] $$

Code Implementation


import numpy as np

import numpy as np
# Library: Numpy

# Rolling a fair die (1-6), p=1/6
val = np.array([1, 2, 3, 4, 5, 6])
prob = np.array([1/6] * 6)

# Expected Value E[X] = sum(x * p(x))
ev = np.sum(val * prob)
# Result: 3.5

# Variance Var(X) = E[X^2] - (E[X])^2
var = np.sum((val**2) * prob) - ev**2
# Result: 2.92
    

7. Joint & Marginal Distributions

Joint Probability Distribution

Gives the probability that two or more random variables fall within a particular range or discrete set of values simultaneously.

$$ P(X=x, Y=y) $$

Marginal Probability Distribution

The probability distribution of a subset of the collection of random variables, obtained by summing (discrete) or integrating (continuous) over the other variables.

$$ P(X=x) = \sum_{y} P(X=x, Y=y) $$

Why is this used in ML?

Understanding joint and marginal distributions is crucial for Generative Models (like GANs and VAEs) which try to learn the joint probability distribution of the data.

Code Implementation


import pandas as pd

import pandas as pd
# Library: Pandas

# Scenario: Weather vs Commute Mode
data = {
    'Weather': ['Sunny', 'Sunny', 'Rainy', 'Sunny', 'Rainy', 'Rainy', 'Sunny', 'Rainy', 'Sunny', 'Rainy'],
    'Commute': ['Walk', 'Bus',  'Bus',   'Walk',  'Car',   'Bus',   'Walk',  'Car',   'Bus',   'Car']
}
df = pd.DataFrame(data)

# Joint Probability Table
joint_probs = pd.crosstab(df['Weather'], df['Commute'], normalize=True)
# Commute  Bus  Car  Walk
Weather                
Rainy    0.2  0.3   0.0
Sunny    0.2  0.0   0.3

# Marginal Probability (Weather)
# Sum across columns (axis=1)
marginal_weather = joint_probs.sum(axis=1)
# Result: 
# {'Rainy': 0.5, 'Sunny': 0.5}

# Marginal Probability (Commute)
# Sum across rows (axis=0)
marginal_commute = joint_probs.sum(axis=0)
# Result: 
# {'Bus': 0.4, 'Car': 0.3, 'Walk': 0.3}
    

8. References & Further Reading