Ch 8: Unsupervised Learning - Intermediate¶

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb in Jupyter.

Chapter 8: Unsupervised Learning¶

Notebook 02 - Intermediate: Advanced Clustering¶

Beyond K-Means: hierarchical clustering, density-based methods, and Gaussian mixtures for real-world data shapes.

What you'll learn: - Hierarchical (agglomerative) clustering and dendrograms - DBSCAN for density-based clustering - Gaussian Mixture Models (GMMs) - Comparing clustering algorithms on different data shapes

Time estimate: 2.5 hours

Try it yourself: Experiment with different linkage methods (single, complete, average, ward) on the hierarchical clustering example. Change eps and min_samples in DBSCAN to see how they affect cluster formation.

Common mistakes: Using K-Means on non-convex shapes (e.g., moons), ignoring the k-distance graph when tuning DBSCAN, or assuming spherical clusters when data is elliptical.

1. Hierarchical Clustering¶

Hierarchical clustering builds a tree of clusters instead of requiring a fixed number of clusters up front. The agglomerative (bottom-up) approach proceeds as follows:

Start — treat every data point as its own single-point cluster.
Merge — find the two closest clusters and merge them into one.
Repeat — keep merging until only a single cluster remains (or until a stopping criterion is met).

The result is a hierarchy that can be visualised as a dendrogram — a tree diagram showing the order and distance of each merge.

Linkage criteria¶

"Distance between two clusters" can be measured in several ways:

Linkage	Definition	Tendency
Single	Minimum distance between any pair of points across two clusters	Produces elongated, chain-like clusters
Complete	Maximum distance between any pair of points across two clusters	Produces compact, roughly equal-sized clusters
Average	Mean distance between all pairs of points across two clusters	Compromise between single and complete
Ward	Minimises the total within-cluster variance at each merge	Tends to produce equally sized, spherical clusters

Ward linkage is the most commonly used default and works well when clusters are roughly spherical.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

np.random.seed(42)

# Generate synthetic data with 4 well-separated clusters
X_hier, y_hier = make_blobs(
    n_samples=200, centers=4, cluster_std=0.8, random_state=42
)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left panel — raw data
axes[0].scatter(X_hier[:, 0], X_hier[:, 1], s=30, alpha=0.7, edgecolors='k', linewidths=0.3)
axes[0].set_title('Raw Data (200 points, 4 clusters)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Right panel — dendrogram using Ward linkage
Z_ward = linkage(X_hier, method='ward')
dendrogram(
    Z_ward,
    truncate_mode='lastp',
    p=30,
    leaf_rotation=90,
    leaf_font_size=8,
    ax=axes[1],
    color_threshold=12
)
axes[1].set_title('Dendrogram (Ward Linkage, truncated to 30 leaves)')
axes[1].set_xlabel('Cluster (size)')
axes[1].set_ylabel('Merge Distance')
axes[1].axhline(y=12, color='r', linestyle='--', label='Cut at distance = 12')
axes[1].legend()

plt.tight_layout()
plt.show()

The dendrogram shows the full merge history. By drawing a horizontal cut line we decide how many clusters to keep — each vertical line that crosses the cut corresponds to one cluster.

Comparing linkage methods¶

Let's visualise how the four linkage types partition the same dataset.

linkage_methods = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(1, 4, figsize=(20, 4.5))

for ax, method in zip(axes, linkage_methods):
    Z = linkage(X_hier, method=method)
    labels = fcluster(Z, t=4, criterion='maxclust')
    scatter = ax.scatter(
        X_hier[:, 0], X_hier[:, 1],
        c=labels, cmap='viridis', s=30, alpha=0.7, edgecolors='k', linewidths=0.3
    )
    ax.set_title(f'{method.capitalize()} linkage')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.suptitle('Agglomerative Clustering — 4 Linkage Methods (k=4)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

# Scikit-learn's AgglomerativeClustering with Ward linkage
agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
agg_labels = agg.fit_predict(X_hier)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(
    X_hier[:, 0], X_hier[:, 1],
    c=y_hier, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3
)
axes[0].set_title('Ground-Truth Labels')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

axes[1].scatter(
    X_hier[:, 0], X_hier[:, 1],
    c=agg_labels, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3
)
axes[1].set_title('AgglomerativeClustering (Ward, k=4)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print(f"Cluster sizes: {np.bincount(agg_labels)}")

2. DBSCAN¶

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach to clustering:

It does not require the number of clusters in advance.
It defines clusters as dense regions separated by sparse regions.
Points that don't belong to any dense region are labelled as noise (label = -1).

Key parameters¶

Parameter	Meaning
`eps` (ε)	Maximum distance between two points for them to be considered neighbours
`min_samples`	Minimum number of points within ε-distance to form a dense region

Point types¶

Core point — has at least min_samples neighbours within ε.
Border point — within ε of a core point but doesn't have enough neighbours itself.
Noise point — neither core nor border; isolated outliers.

DBSCAN can discover clusters of arbitrary shape and naturally identifies outliers — something centroid-based methods like K-Means cannot do.

from sklearn.datasets import make_moons
from sklearn.cluster import KMeans, DBSCAN

np.random.seed(42)

# Generate two moons (non-convex dataset)
X_moons, y_moons = make_moons(n_samples=500, noise=0.08, random_state=42)

# Apply DBSCAN and K-Means
db_moons = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)
km_moons = KMeans(n_clusters=2, random_state=42, n_init=10).fit(X_moons)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7)
axes[0].set_title('Ground Truth')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)
axes[1].scatter(km_moons.cluster_centers_[:, 0], km_moons.cluster_centers_[:, 1],
                marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)
axes[1].set_title('K-Means (k=2) — Fails on non-convex shapes')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=db_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)
axes[2].set_title('DBSCAN (eps=0.2) — Correctly separates crescents')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')

plt.suptitle('K-Means vs DBSCAN on the Moons Dataset', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

3. Choosing DBSCAN Parameters¶

Picking eps and min_samples can be tricky. A practical heuristic:

Set min_samples ≈ 2 × number of features (a reasonable default).
For each point compute the distance to its k-th nearest neighbour (k = min_samples).
Sort these distances and plot them — the k-distance graph.
Look for the "elbow" — the point where the curve bends sharply upward. The distance at that elbow is a good candidate for eps.

from sklearn.neighbors import NearestNeighbors

k = 5  # same as min_samples
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_moons)
distances, _ = nn.kneighbors(X_moons)

k_distances = np.sort(distances[:, k - 1])[::-1]

plt.figure(figsize=(10, 5))
plt.plot(k_distances, linewidth=1.5)
plt.axhline(y=0.2, color='r', linestyle='--', label='eps = 0.2 (our choice)')
plt.title(f'k-Distance Graph (k={k}) — Elbow Indicates Good eps')
plt.xlabel('Points (sorted by descending k-distance)')
plt.ylabel(f'Distance to {k}-th Nearest Neighbour')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Effect of different eps values on DBSCAN results
eps_values = [0.05, 0.1, 0.2, 0.3, 0.5]
fig, axes = plt.subplots(1, len(eps_values), figsize=(22, 4))

for ax, eps in zip(axes, eps_values):
    db = DBSCAN(eps=eps, min_samples=5).fit(X_moons)
    labels = db.labels_
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = (labels == -1).sum()

    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]

    for k_label, col in zip(sorted(unique_labels), colors):
        if k_label == -1:
            col = [0, 0, 0, 1]  # black for noise
        mask = labels == k_label
        ax.scatter(X_moons[mask, 0], X_moons[mask, 1], c=[col], s=15, alpha=0.7)

    ax.set_title(f'eps={eps}\n{n_clusters} clusters, {n_noise} noise')
    ax.set_xlabel('Feature 1')

axes[0].set_ylabel('Feature 2')
plt.suptitle('Effect of eps on DBSCAN (min_samples=5)', fontsize=14, y=1.05)
plt.tight_layout()
plt.show()

Observations: - eps too small (0.05) → most points classified as noise; many tiny clusters. - eps just right (0.2) → two clean crescent clusters with very little noise. - eps too large (0.5) → everything merges into a single cluster.

The k-distance graph helps you find that sweet spot without trial and error.

4. Gaussian Mixture Models¶

A Gaussian Mixture Model assumes that the data is generated from a mixture of a finite number of Gaussian (normal) distributions with unknown parameters.

GMM vs K-Means¶

Aspect	K-Means	GMM
Cluster assignment	Hard — each point belongs to exactly one cluster	Soft — each point has a probability for every cluster
Cluster shape	Spherical (Voronoi cells)	Elliptical (full covariance matrices)
Outlier handling	None — every point is assigned	Naturally down-weights low-probability points
Output	Cluster label	Probability vector over all clusters

GMMs are fit using the Expectation-Maximisation (EM) algorithm: 1. E-step — compute the probability that each point belongs to each Gaussian component. 2. M-step — update each component's mean, covariance, and weight to maximise log-likelihood. 3. Repeat until convergence.

from sklearn.mixture import GaussianMixture

np.random.seed(42)

# Create elongated / elliptical clusters that challenge K-Means
n_per_cluster = 200
cov1 = [[2.0, 1.5], [1.5, 1.5]]
cov2 = [[1.5, -1.2], [-1.2, 1.5]]
cov3 = [[0.5, 0.0], [0.0, 2.5]]

cluster1 = np.random.multivariate_normal([0, 0], cov1, n_per_cluster)
cluster2 = np.random.multivariate_normal([5, 5], cov2, n_per_cluster)
cluster3 = np.random.multivariate_normal([8, 0], cov3, n_per_cluster)

X_gmm = np.vstack([cluster1, cluster2, cluster3])
y_gmm_true = np.array([0]*n_per_cluster + [1]*n_per_cluster + [2]*n_per_cluster)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Ground truth
axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_gmm_true, cmap='tab10', s=15, alpha=0.6)
axes[0].set_title('Ground Truth (Elliptical Clusters)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# K-Means
km_gmm = KMeans(n_clusters=3, random_state=42, n_init=10).fit(X_gmm)
axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=km_gmm.labels_, cmap='tab10', s=15, alpha=0.6)
axes[1].scatter(km_gmm.cluster_centers_[:, 0], km_gmm.cluster_centers_[:, 1],
                marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)
axes[1].set_title('K-Means (k=3) — Spherical assumption')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

# GMM
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X_gmm)
gmm_labels = gmm.predict(X_gmm)
axes[2].scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=15, alpha=0.6)
axes[2].set_title('GMM (3 components) — Elliptical fit')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')

plt.suptitle('K-Means vs GMM on Elliptical Clusters', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

# Visualise GMM probability contours
x_min, x_max = X_gmm[:, 0].min() - 2, X_gmm[:, 0].max() + 2
y_min, y_max = X_gmm[:, 1].min() - 2, X_gmm[:, 1].max() + 2
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
grid_points = np.column_stack([xx.ravel(), yy.ravel()])

log_prob = gmm.score_samples(grid_points)
log_prob = log_prob.reshape(xx.shape)

fig, ax = plt.subplots(figsize=(10, 7))
ax.contourf(xx, yy, np.exp(log_prob), levels=30, cmap='YlOrRd', alpha=0.6)
ax.contour(xx, yy, np.exp(log_prob), levels=10, colors='darkred', linewidths=0.5, alpha=0.5)
ax.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=10, alpha=0.7,
           edgecolors='k', linewidths=0.2)

for i in range(gmm.n_components):
    ax.scatter(gmm.means_[i, 0], gmm.means_[i, 1],
               marker='+', s=300, c='black', linewidths=3)

ax.set_title('GMM Probability Density Contours')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()

Model selection with BIC and AIC¶

How many Gaussian components should we use? We can use information criteria:

BIC (Bayesian Information Criterion) — penalises model complexity more heavily.
AIC (Akaike Information Criterion) — lighter penalty.

Lower is better for both. We fit GMMs with different numbers of components and pick the one with the lowest BIC (or AIC).

n_components_range = range(1, 10)
bic_scores = []
aic_scores = []

for n in n_components_range:
    gmm_test = GaussianMixture(n_components=n, covariance_type='full', random_state=42)
    gmm_test.fit(X_gmm)
    bic_scores.append(gmm_test.bic(X_gmm))
    aic_scores.append(gmm_test.aic(X_gmm))

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(list(n_components_range), bic_scores, 'bo-', label='BIC', linewidth=2)
ax.plot(list(n_components_range), aic_scores, 'rs--', label='AIC', linewidth=2)
ax.axvline(x=3, color='green', linestyle=':', alpha=0.7, label='True number of components (3)')
ax.set_xlabel('Number of Components')
ax.set_ylabel('Score (lower is better)')
ax.set_title('GMM Model Selection: BIC and AIC')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Best BIC at n_components = {np.argmin(bic_scores) + 1}")
print(f"Best AIC at n_components = {np.argmin(aic_scores) + 1}")

5. Algorithm Comparison¶

Let's put all four algorithms head-to-head on three different data geometries:

Blobs — well-separated spherical clusters
Moons — two interleaving crescents
Varied-variance blobs — spherical clusters with very different densities

from sklearn.preprocessing import StandardScaler

np.random.seed(42)

n_samples = 500

# Dataset 1: standard blobs
X_blobs, y_blobs = make_blobs(n_samples=n_samples, centers=3, cluster_std=1.0, random_state=42)

# Dataset 2: moons
X_moons2, y_moons2 = make_moons(n_samples=n_samples, noise=0.07, random_state=42)

# Dataset 3: varied-variance blobs
X_varied, y_varied = make_blobs(
    n_samples=n_samples, centers=3, cluster_std=[0.5, 2.5, 1.0], random_state=42
)

datasets = [
    ('Blobs', X_blobs, {'n_clusters': 3, 'eps': 1.0}),
    ('Moons', X_moons2, {'n_clusters': 2, 'eps': 0.2}),
    ('Varied', X_varied, {'n_clusters': 3, 'eps': 1.5}),
]

fig, axes = plt.subplots(3, 4, figsize=(22, 15))

for row, (name, X, params) in enumerate(datasets):
    X_scaled = StandardScaler().fit_transform(X)
    n_c = params['n_clusters']
    eps = params['eps']

    # K-Means
    km = KMeans(n_clusters=n_c, random_state=42, n_init=10).fit(X_scaled)
    # Agglomerative
    agg = AgglomerativeClustering(n_clusters=n_c, linkage='ward').fit(X_scaled)
    # DBSCAN
    db = DBSCAN(eps=eps, min_samples=5).fit(X_scaled)
    # GMM
    gm = GaussianMixture(n_components=n_c, random_state=42).fit(X_scaled)

    results = [
        ('K-Means', km.labels_),
        ('Agglomerative', agg.labels_),
        ('DBSCAN', db.labels_),
        ('GMM', gm.predict(X_scaled)),
    ]

    for col, (algo_name, labels) in enumerate(results):
        ax = axes[row, col]
        unique_labels = set(labels)
        n_clust = len(unique_labels) - (1 if -1 in unique_labels else 0)

        noise_mask = labels == -1
        ax.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1],
                   c=labels[~noise_mask], cmap='viridis', s=12, alpha=0.7)
        if noise_mask.any():
            ax.scatter(X_scaled[noise_mask, 0], X_scaled[noise_mask, 1],
                       c='red', marker='x', s=15, alpha=0.5, label='noise')
            ax.legend(fontsize=8)

        if row == 0:
            ax.set_title(algo_name, fontsize=13, fontweight='bold')
        ax.set_ylabel(f'{name}' if col == 0 else '', fontsize=12)
        ax.text(0.02, 0.98, f'{n_clust} cluster(s)',
                transform=ax.transAxes, fontsize=9, va='top',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))

plt.suptitle('Algorithm Comparison Across Data Geometries', fontsize=16, y=1.01)
plt.tight_layout()
plt.show()

Summary¶

When to use each algorithm¶

Algorithm	Best for	Weaknesses	Must specify k?
K-Means	Large datasets with spherical clusters	Cannot handle non-convex shapes; sensitive to outliers	Yes
Agglomerative Clustering	Small-to-medium datasets; exploring hierarchy	O(n³) time complexity; hard to scale	Yes (or cut dendrogram)
DBSCAN	Arbitrary shapes; datasets with noise/outliers	Sensitive to `eps`; struggles with varying densities	No
Gaussian Mixture Model	Elliptical clusters; need soft assignments	Assumes Gaussian components; sensitive to initialisation	Yes

Rules of thumb¶

Start simple: try K-Means first. If results look poor, consider the data geometry.
Non-convex shapes? → Use DBSCAN.
Elliptical or overlapping clusters? → Use GMM.
Need a hierarchy or dendrogram? → Use Agglomerative Clustering.
Noisy data with outliers? → DBSCAN naturally handles noise.
Need probability estimates? → GMM provides soft assignments.

Generated by Berta AI | Created by Luigi Pascal Rondanini