Problem

The same algorithm in two languages can have wildly different real-world cost. K-means is a small, well-defined kernel that exercises memory layout, vectorization, and parallelism — a clean target for a head-to-head.

Approach

  • Python side: hand-rolled NumPy implementation (to control the assignment + update loops directly) plus scikit-learn as the industrial baseline.
  • Rust side: from-scratch implementation over a Vec<DataPoint> array-of-structs layout, with Rayon for optional data-parallel assignment/update (a flat-matrix, BLAS-backed ndarray rewrite is the obvious next step for large n).
  • Iso-algorithm design: every engine uses the same k-means++ initialization, and scikit-learn runs at n_init=1 — so any difference is the engine, not the seeding.
  • All run on the same synthetic Gaussian-blob datasets across 9 sizes (1k–256k), 3 feature counts, and 2 cluster budgets, with 3 repeats per cell.
  • Metric: end-to-end CLI wall time (load CSV → fit k = 1..K → write CSV), sampled peak RSS, and ARI/NMI against ground-truth labels.

Headline results (2026 rerun)

  • Hand-rolled Rust is a median ~4.5x faster than pure-Python NumPy and ~11x leaner in peak memory — biggest at small/medium n, and the memory win never erodes with scale.
  • scikit-learn is slower than NumPy until ~128k and overtakes Rust only where both n and k are large (the big-GEMM corner) — its BLAS, not a better algorithm.
  • Rayon adds at most ~1.3x over serial Rust, and over-subscribing all cores hurts at small n.
  • With a matched k-means++ init, every engine essentially ties on accuracy (median ARI 1.0) — the earlier quality gap was an initialization gap, not an algorithm gap.