Problem
The same algorithm in two languages can have wildly different real-world cost. K-means is a small, well-defined kernel that exercises memory layout, vectorization, and parallelism — a clean target for a head-to-head.
Approach
- Python side: hand-rolled NumPy implementation (to control the assignment + update loops directly) plus scikit-learn as the industrial baseline.
- Rust side: from-scratch implementation over a
Vec<DataPoint>array-of-structs layout, with Rayon for optional data-parallel assignment/update (a flat-matrix, BLAS-backedndarrayrewrite is the obvious next step for large n). - Iso-algorithm design: every engine uses the same k-means++ initialization, and scikit-learn runs at
n_init=1— so any difference is the engine, not the seeding. - All run on the same synthetic Gaussian-blob datasets across 9 sizes (1k–256k), 3 feature counts, and 2 cluster budgets, with 3 repeats per cell.
- Metric: end-to-end CLI wall time (load CSV → fit k = 1..K → write CSV), sampled peak RSS, and ARI/NMI against ground-truth labels.
Headline results (2026 rerun)
- Hand-rolled Rust is a median ~4.5x faster than pure-Python NumPy and ~11x leaner in peak memory — biggest at small/medium n, and the memory win never erodes with scale.
- scikit-learn is slower than NumPy until ~128k and overtakes Rust only where both n and k are large (the big-GEMM corner) — its BLAS, not a better algorithm.
- Rayon adds at most ~1.3x over serial Rust, and over-subscribing all cores hurts at small n.
- With a matched k-means++ init, every engine essentially ties on accuracy (median ARI 1.0) — the earlier quality gap was an initialization gap, not an algorithm gap.
Links
- Repo: github.com/nilesh-patil/bench-kmeans-rust
- Blog post: Where the 15x went: benchmarking a Rust k-means rewrite — the full write-up, figures, and the crossover story.
- Related: Distributed K-Means Clustering in Python