Statistics for Machine Learning

Mathematical foundations for data science algorithms.

Current Datasets

We are using two correlated datasets ($X$ and $Y$) to demonstrate these concepts.

Dataset X: [12, 15, 12, 18, 20, 22, 12, 25, 30]

Dataset Y: [11, 14, 13, 19, 21, 24, 11, 26, 31]

1. Mean (Average)

The central value of a finite set of numbers.

\mu = \frac{1}{N} \sum_{i=1}^{N} x_i

"Mu (mean) equals one divided by N times the sum of x-sub-i where i goes from 1 to N."

Why is this used in ML?

The mean is the most common measure of central tendency. It is used in Data Normalization (e.g., Mean Normalization) to center data around zero, which speeds up Gradient Descent convergence.

Code Implementation


# Numpy
mean_val = np.mean(data_x)
# Result: 18.444444444444443

# Pandas
df['x'].mean()
# Result: 18.444444444444443

# Scipy
stats.tmean(data_x)
# Result: 18.444444444444443

2. Median

The middle value separating the higher half from the lower half of a data sample.

\text{Median} = \begin{cases} x_{(n+1)/2} & \text{if } n \text{ is odd} \\ \frac{x_{n/2} + x_{(n/2)+1}}{2} & \text{if } n \text{ is even} \end{cases}

"The Median is the value at position (n+1)/2 if n is odd, or the average of the two middle values if n is even."

Why is this used in ML?

The median is robust to outliers. In Data Preprocessing, we often replace missing values (imputation) with the median instead of the mean if the feature containing missing values has outliers (skewed distribution).

Code Implementation


# Numpy
median_val = np.median(data_x)
# Result: 18.0

# Pandas
df['x'].median()
# Result: 18.0

# Scipy
stats.scoreatpercentile(data_x, 50)
# Result: 18.0

3. Mode

The value that appears most often in a set of data values.

Why is this used in ML?

The mode is crucial for Categorical Data Imputation. If a categorical feature has missing values, we typically fill them with the most frequent category (the mode).

Code Implementation


# Scipy
mode_val = stats.mode(data_x)
# Result: 12

# Pandas
df['x'].mode()
# Result: 12

# Numpy (via Unique)
vals, counts = np.unique(data_x, return_counts=True)
mode = vals[np.argmax(counts)]
# Result: 12

4. Range

The difference between the largest and smallest values.

\text{Range} = \max(x) - \min(x)

"Range equals the maximum value of x minus the minimum value of x."

Why is this used in ML?

Range gives a quick sense of the data spread. It is fundamental in Min-Max Scaling (Normalization), which scales data to a fixed range (usually 0 to 1) using the formula $x' = \frac{x - \min(x)}{\max(x) - \min(x)}$.

Code Implementation


# Python (Manual)
r = max(data_x) - min(data_x)
# Result: 18

# Numpy
r = np.ptp(data_x)
# Result: 18

# Pandas
r = df['x'].max() - df['x'].min()
# Result: 18

5. Percentiles & Quantiles

Values below which a certain percentage of data falls. Quartiles divide data into four equal parts (25%, 50%, 75%).

P_k = \text{value satisfying } P(\text{data} \le k\%)

"P-sub-k is the value where k percent of the data lies below it."

Why is this used in ML?

Percentiles help understand distribution and identify outliers. Box Plots visualize the 25th ($Q1$) and 75th ($Q3$) percentiles. They are also used in Quantile Binning to handle skewed features.

Code Implementation


# Numpy (25th & 75th Percentiles)
p25, p75 = np.percentile(data_x, [25, 75])
# Result: 12.0, 22.0

# Pandas
p25 = df['x'].quantile(0.25)
# Result: 12.0

# Scipy
stats.scoreatpercentile(data_x, 25)
# Result: 12.0

6. Interquartile Range (IQR)

A measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles.

\text{IQR} = Q_3 - Q_1

"IQR equals the third quartile (75th percentile) minus the first quartile (25th percentile)."

Why is this used in ML?

IQR is the standard method for Outlier Detection. A common rule is that any data point falling below $Q1 - 1.5 \times \text{IQR}$ or above $Q3 + 1.5 \times \text{IQR}$ is considered an outlier and often removed or capped.

Code Implementation


# Numpy
iqr = np.percentile(data_x, 75) - np.percentile(data_x, 25)
# Result: 10.0

# Scipy
stats.iqr(data_x)
# Result: 10.0

# Pandas
iqr = df['x'].quantile(0.75) - df['x'].quantile(0.25)
# Result: 10.0

7. Variance

A measure of dispersion that represents how spread out the data points are from the mean.

\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2

"Sigma squared (Variance) equals the average of the squared differences between each data point (x-sub-i) and the mean (mu)."

Why is this used in ML?

Variance helps us understand the spread of data. In Principal Component Analysis (PCA), we look for directions (components) that maximize variance to retain the most information while reducing dimensionality.

Code Implementation


# Python (Manual)
mean = sum(data_x) / len(data_x)
variance = sum((x - mean) ** 2 for x in data_x) / (len(data_x) - 1)
# Result: 41.02777777777778

# Numpy (Sample Variance, ddof=1)
var_val = np.var(data_x, ddof=1)
# Result: 41.02777777777778

# Pandas
df['x'].var()
# Result: 41.02777777777778

# Scipy
stats.tvar(data_x)
# Result: 41.027777777777786

8. Standard Deviation

The square root of the variance, quantifying the amount of variation of a set of data values.

\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}

"Sigma (Standard Deviation) is the square root of the variance."

Why is this used in ML?

Standard Deviation is the basis for Z-Score Standardization ($z = \frac{x - \mu}{\sigma}$). This scales features so they have $\mu=0$ and $\sigma=1$, ensuring features with larger ranges don't dominate objective functions in algorithms like SVMs and KNN.

Code Implementation


# Python (Manual)
mean = sum(data_x) / len(data_x)
variance = sum((x - mean) ** 2 for x in data_x) / (len(data_x) - 1)
std_dev = variance ** 0.5
# Result: 6.405292950191878

# Numpy
std_val = np.std(data_x, ddof=1)
# Result: 6.405292950191878

# Pandas
df['x'].std()
# Result: 6.405292950191878

# Scipy
stats.tstd(data_x)
# Result: 6.405292950191879

9. Covariance

A measure of the joint variability of two random variables ($X$ and $Y$). Positive covariance means they move together.

\text{cov}(X,Y) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)

"Covariance of X and Y equals the average of the product of differences of X from its mean and Y from its mean."

Why is this used in ML?

Covariance indicates the direction of relationship. It is central to Multivariate Gaussian Distributions. In PCA, the eigenvectors of the Covariance Matrix determine the principal components.

Code Implementation


# Python (Manual)
n = len(data_x)
mean_x, mean_y = sum(data_x) / n, sum(data_y) / n
cov = sum((data_x[i] - mean_x) * (data_y[i] - mean_y) for i in range(n)) / (n - 1)
# Result: 45.55555555555556

# Numpy (Returns Covariance Matrix)
cov_matrix = np.cov(data_x, data_y)
cov_val = cov_matrix[0][1]
# Result: 45.55555555555556

# Pandas
cov_val = df['x'].cov(df['y'])
# Result: 45.55555555555556

# Scipy (Derived from Pearson R)
r, _ = stats.pearsonr(data_x, data_y)
cov = r * stats.tstd(data_x) * stats.tstd(data_y)
# Result: 45.55555555555555

10. Correlation

A normalized measure of the strength and direction of the linear relationship between two variables, ranging from -1 to +1.

r = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y}

"r (Correlation) equals the covariance of X and Y divided by the product of the standard deviation of X and the standard deviation of Y."

Why is this used in ML?

Correlation is superior to covariance for Feature Selection because it is scale-invariant. Pearson measures linear relationships, while Spearman measures monotonic relationships (rank-based), which is useful for non-linear data.

Pearson vs Spearman

Pearson: Assumes linearity and normal distribution.

Spearman: Non-parametric, works on rank order.

Code Implementation


# Python (Manual)
# Assumes manual_cov, manual_std_x, and manual_std_y are calculated
corr = manual_cov / (manual_std_x * manual_std_y)
# Result: 0.9923963192185303

# Numpy (Pearson Matrix)
corr_matrix = np.corrcoef(data_x, data_y)
# Result: 0.9923963192185306

# Pandas
p_corr = df['x'].corr(df['y'], method='pearson')
s_corr = df['x'].corr(df['y'], method='spearman')
# Pearson: 0.9923963192185306 | Spearman: 0.9873144969898835

# Scipy
stats.pearsonr(data_x, data_y)
stats.spearmanr(data_x, data_y)
# Pearson: 0.9923963192185301 | Spearman: 0.9873144969898835

Statistics for Machine Learning

Current Datasets

1. Mean (Average)

Why is this used in ML?

Code Implementation

2. Median

Why is this used in ML?

Code Implementation

3. Mode

Why is this used in ML?

Code Implementation

4. Range

Why is this used in ML?

Code Implementation

5. Percentiles & Quantiles

Why is this used in ML?

Code Implementation

6. Interquartile Range (IQR)

Why is this used in ML?

Code Implementation

7. Variance

Why is this used in ML?

Code Implementation

8. Standard Deviation

Why is this used in ML?

Code Implementation

9. Covariance

Why is this used in ML?

Code Implementation

10. Correlation

Why is this used in ML?

Pearson vs Spearman

Code Implementation

11. References & Further Reading