Nilesh Patil

Adaptive OpenMP scheduling in SimuCell3D for tissue mechanics simulations

2026-04-16T19:30:00+05:30

Biology often looks like chemistry, but many of its hardest questions are mechanical : how a sheet of cells bends into a tube, how a spheroid opens into a fluid-filled vesicle, or how a growing tumor pushes against the tissue around it. These processes depend not just on genes and signaling, but on force, geometry, material properties, and the way cells physically interact. Cell-based simulations give us a way to study those rules directly, but detailed 3D models have usually forced a tradeoff: simplify the shape, or simulate only a small number of cells.

SimuCell3D is an open-source C++ engine built to push past that limitation. It models each cell as a triangulated mesh and simulates tissue growth and deformation at subcellular resolution, including proliferation, extracellular matrix, fluid cavities, nuclei, and the uneven mechanical properties of polarized epithelia. It can work with spheroids, vesicles, sheets, tubes, and more irregular geometries imported from microscopy images, making it useful for inferring biomechanical parameters from realistic tissue structures.

The ETH Iber lab released v1.0 in 2024. This post is a field report from the fork I’ve been building on top of it. Apart from a few bug fixes in the original code,I focus on a practical performance bottleneck : load imbalance in the simulator’s OpenMP-parallel force and contact loops. I’ll walk through where the imbalance came from, how I measured it, and what it took to make the scheduler adapt so large, high-detail 3D tissue simulations become more tractable.

On the surface, the Static ( = v1.0 ) binary looked like it was using all of the CPU. The top showed all eight cores pegged - utilization near 800%, the picture of a machine working hard. A tracing profiler told a different story: on the worst steps, close to a third of that “busy” time was threads idling at a barrier while one of them finished an oversized slice.

That gap is invisible to top because it reports aggregate CPU time: 800% across eight cores looks identical whether one thread sprints while seven idle or all eight share the load evenly. It simply can’t see the imbalance. A per-thread view can. If you train models, you already know this failure by a different name: a data-parallel step where one worker drew all the long sequences and the rest sit idle at the all-reduce barrier. Same physics, different address space.

TL;DR. Static OpenMP scheduling handed equal-sized index ranges to every thread regardless of per-cell cost - so whichever thread drew the heavy cells kept them. I built a workload estimator, wired it to per-loop schedule selection, and let it adapt as the tissue grows. Result: ~2.0–2.5× speedups at matched cell counts across the best-supported range (50–250 cells), with consistent wins to ~2.3–2.4× through 1,000 - 10,000 cells on thinner data, and a suggestive ~3× at the largest sizes. SimuCell3D is a C++ 3D tissue-mechanics simulator developed by the Iber lab at ETH Zürich.

The threads were mostly standing around - waiting to pickup computational work

In SimuCell3D the bulk of work is computing forces, contacts, and time steps for cells in a 3D tissue. I held the Static binary under a tracing profiler and measured. During contact detection - the phase that dominates the runtime - a measurable fraction of each parallel region was lost to threads waiting at the barrier while a straggler finished an oversized share. The simulation always completed. The imbalance just silently ate throughput as complexity of the tissue increased.

Measured thread_imbalance_pct across the benchmark run, Static vs Adaptive, plotted against tissue size (log scale). Adaptive sits consistently below Static across the whole range - Static : mean 14.6%, max 31.1%, median 16%; Adaptive : mean 10.7%, max 16%, median 13%. The v1 outliers reaching 31% are the straggler events that static scheduling manufactures by handing all the heavy cells to one thread.

The bottleneck was in the OpenMP directives such that Static mode leans entirely on fixed scheduling: three bare #pragma omp parallel for in solver.cpp with no schedule() clause (on GCC/Clang, the default is static with equal-sized contiguous chunks), an explicit schedule(static) in time_integration.cpp, and another bare parallel loop in the contact model. Static scheduling slices the cells into equal index ranges up front - thread 0 takes cells 0–15, thread 1 takes 16–31, and so on - and never rebalances. For a tissue simulation that assumption fails on the second cell.

The reason is geometric : Each cell is a triangulated mesh, and meshes drift apart in cost. A freshly divided child has fewer faces than its parent; a growing cell has more; a cell wedged in heavy contact spends far longer in its per-face loops than an isolated one. Static scheduling hands out cells by index, not by cost - so whichever thread happened to draw the heavy cells on step one keeps drawing them for the rest of the run.

A single number for “how uneven is it?”

Before changing anything, we need a number to optimize - a fast signal that tells the scheduler “right now the work is lumpy, switch to a finer-grained strategy.”

On this experimental workload, adaptive-mode CoV peaked at 0.16 and never reached the 0.4 or 0.6 chunk-band thresholds built into the scheduler. The chunk divisor stayed at 4 - the coarsest setting - for the entire experiment. The high-CoV dynamic-chunking regime the system is designed for was never exercised on this run. This is modeled as a honest baseline for understanding what the CoV machinery does and why.

Coefficient of variation ( CoV = σ/μ ) measures how spread out a distribution is relative to its mean. When every cell costs the same, CoV is zero and static scheduling is optimal. As some cells grow much heavier than others, CoV climbs and dynamic scheduling starts to win. The catch: we can’t measure CoV by running the loop - that’s the work we’re trying to schedule. We need to estimate each cell’s cost before the loop starts. The estimator in src/solver.cpp is a weighted sum of structural features per cell: a base_cost proportional to face count; a contact_cost scaled by a compile-time constant for the active contact model (0.25 for node–face springs, 0.28 for node–node coupling, 0.32 for face–face coupling, selected by a #if CONTACT_MODEL_INDEX switch so exactly one is live in a given binary); an integration_cost of 0.65 × base_cost for dynamic cells; plus smaller terms for polarization, growth, and mesh quality.

The coefficients were fit by hand against measured per-cell timings and rounded to two decimals. None of this is precise, and it doesn’t need to be at this point. The accurate alternative - profiling every cell’s real cost each step - would cost more than the imbalance it’s trying to remove. A cheap structural proxy wins as long as it stays roughly monotonic with real cost and runs fast. Both hold: the estimator is O(N_cells) and returns a single float.

Measured workload CoV vs tissue size across experimental run. Dashed lines mark the 0.4 and 0.6 chunk-size-band thresholds built into calculate_optimal_chunk_size(). Adaptive-mode CoV: mean 0.11, median 0.13, max 0.16. v1 CoV: mean 0.15, median 0.16, max 0.31. Neither mode reached either band. On this workload, the adaptive chunk divisor stayed at 4 (coarsest setting) throughout the entire run.

Three scheduling modes, chosen by how lumpy the work is

static hands each thread a fixed pile up front (fast, but one thread can get stuck with all the slow cells); dynamic gives everyone a shared queue they pull from as they finish (no idle threads, but a small per-grab cost); guided is the middle ground - big grabs first, shrinking toward the tail. If you’ve ever tuned dynamic batching or a work-stealing pool, this is the same trade-off: granularity versus coordination overhead.

The function calculate_optimal_chunk_size() turns the CoV directly into a choice of granularity via a divisor: 4 for CoV ≤ 0.4 (mild imbalance, coarse chunks), 10 for 0.4 < CoV ≤ 0.6 (moderate), 20 for CoV > 0.6 (high imbalance, fine chunks). Then chunk = max(1, min(num_cells / (num_threads × divisor), 100)):

// CoV → chunk granularity (src/solver.cpp, paraphrased)
int divisor = cov <= 0.4 ? 4      // mild imbalance → coarse chunks
            : cov <= 0.6 ? 10     // moderate
            :              20;    // high imbalance → fine chunks
// below 100 cells, always uses divisor=4 (hardcoded fast path)
int chunk = std::max(1, std::min(num_cells / (num_threads * divisor), 100));

The chunk-size formula is only half of mode selection. The actual path the code takes in adaptive mode is a two-component process :

First, lookup_benchmark_mode (src/solver.cpp:218–245) consults a 13-entry static table keyed on cell-count range alone - not on thread count or CoV - mapping each range to a fixed schedule mode. The lookup table entries carry hardcoded rationale strings from earlier profiling.
Second, multi_factor_heuristic (src/solver.cpp:250–314) runs independently, using CoV as its primary input: CoV > 0.6 → dynamic; CoV < 0.15 and tasks_per_thread ≥ 4 → static; 512 ≤ cells ≤ 4096 at moderate CoV → guided.

When the two paths disagree, the benchmark table always wins. The heuristic’s suggestion is logged in verbose output but never applied (src/solver.cpp:725–740). You build a multi-factor heuristic and then hardcode it to lose to a lookup table - the reasoning is that observed speedup data is more trustworthy than a computed estimate. It’s an unusual design choice, and it’s worth knowing.

The less obvious insight is that different loop categories have different workload shapes, so they should not share one schedule. initialize_per_loop_schedules() (src/solver.cpp:884–909) sets four fixed structural assignments:

Contact detection → omp_sched_dynamic, chunk = max(1, base_chunk). Most irregular; costs vary with mesh density and contact geometry.
Time integration → omp_sched_guided, chunk = max(1, base_chunk × 2). More uniform per-cell cost; guided’s shrinking-grab schedule captures most benefit without dynamic’s per-grab overhead.
Mesh updates (face-type classification) → omp_sched_static, chunk = 0 (equal distribution). The one loop category where cost per cell genuinely is uniform - static maximises cache locality here.
Cell division → omp_sched_dynamic, chunk = max(1, base_chunk / 2). Rarest and most variable phase; finer-grained dynamic avoids stragglers on the few iterations it fires.

Note: 8 of 10 #pragma omp parallel for loops in src/ carry schedule(runtime), enabling this late-binding approach. Two bare loops without a schedule clause remain in cell_divider.cpp:27 and poisson_sampling.cpp:142; these are not yet wired into the adaptive machinery.

The last piece is adaptation over time. Every 50 iterations (COV_UPDATE_INTERVAL = 50), two functions run back-to-back (src/solver.cpp:1083 and 1086). update_workload_heterogeneity() handles CoV recalculation for non-adaptive modes, gated by a combined condition at line 1153: if CoV has changed by more than 20% and the mode is not adaptive, it then checks whether the implied chunk size would also shift by more than 20% (line 1161) before calling omp_set_schedule(). In adaptive mode that outer gate skips both operations entirely. adaptive_schedule_update() (src/solver.cpp:1211–1296) handles all schedule changes in adaptive mode: it is called every 50 iterations, re-evaluates the simulation phase, and only if the phase has changed does it call omp_set_schedule() and reset recent_division_count_.

A three-phase design, two phases exercised

In 41.3 hours and across all three committed benchmark runs, grep GROWTH across all five performance-diagnostics logs returns zero matches. GROWTH never fired.

The scheduler tracks three simulation phases:

INITIALIZATION (fewer than 10 cells): dynamic. Thread count exceeds task count; any fixed assignment starves threads.
GROWTH (division_rate > 0.01, where division_rate = recent_division_count_ / (num_cells × 50)): dynamic if CoV > 0.4, else guided. Cell counts and costs are changing fast.
HOMEOSTASIS (all other cases): static if cell count > 1,000, else guided. The tissue has settled; costs are stable enough that static’s cache locality pays off at scale.

Detected simulation phase across the full 41.3h run (adaptive mode). INITIALIZATION fires for the first ~10,600 iterations while the tissue has fewer than 10 cells. Then the tissue transitions directly to HOMEOSTASIS and stays there for the remainder - growing from 10 to 6,091 cells. GROWTH never fired. Across all three committed benchmark runs, grep GROWTH across all five performance-diagnostics logs returns zero matches.

This is an honest observation about the workload, not a design flaw. The growth-from-1-cell scenario using parameters_paper_exact.xml divides slowly and steadily enough that recent_division_count_ never exceeded 0.01 × num_cells × 50 - the GROWTH trigger threshold - so the phase stayed in HOMEOSTASIS throughout. A faster-dividing parameter set - or a scenario that starts from a small fixed tissue and forces rapid expansion - would exercise GROWTH. On this workload, the adaptive scheduler spent 105 samples in INITIALIZATION and 348 samples in HOMEOSTASIS, never touching the phase the GROWTH branch was written for.

The numbers

At run’s end adaptive shows 0.030 IPS versus v1’s 0.044 IPS - adaptive looks slower. It’s not: it’s managing 2.66× more cells (6,091 vs 2,288). The right comparison is at matched cell counts.

Iterations per second vs cell count (log-log), from the 41.3h benchmark run. Adaptive sits above v1 at every matched cell count. v1 data terminates at ~2,288 cells; adaptive continues to 6,091. The apparent “slower” IPS at run’s end for adaptive is because it’s managing 2.66× more cells - at matched cell counts, adaptive is consistently faster.

At matched cell counts the picture is clear: adaptive is faster across the board, and the gap widens as the tissue grows. Speedups:

1.43× at 10 cells;
2.07× at 50 cells;
Climbing to ~2.3–2.5× through the 100–500 cell range. The 50–250 cell band is best supported statistically (30, 35, and 22 v1 samples respectively).
At 500 cells there are only 6 v1 samples; at 1,000 cells there are 5; and the 3.05× at 2,000 cells rests on 2 v1 samples, so treat that as suggestive rather than firm.

Adaptive is solidly ~2.0–2.5× faster across the range where the data is dense, with a plausible upward trend at the largest sizes that needs more data to pin down.

The scaling exponent (time per iteration ~ N^α) tells a related story. Adaptive: α = 1.136 (R² = 0.999). v1: α = 1.213 (R² = 0.998). Lower is closer to linear; adaptive’s ~6% better exponent means the gap widens gradually as the tissue grows, which matches what the throughput curve shows.

The biological signal is essentially a non-event, which is what you want: median pressure deviation between the two modes is 1.71% across 823 matched iterations. The adaptive scheduler changes how threads pick up work - not what the physics computes.

Where the runtime actually goes

To understand why adaptive is faster, it helps to look at where the time goes.

100" /> Phase-time fractions (cells > 100) from the 41.3h run. v1: contact detection 81.6%, polarization + internal forces 14.1%, time integration 2.6%, mesh refinement 1.8%. Adaptive: contact detection 52.5%, polarization + internal forces 35.7%, time integration 7.4%, mesh refinement 4.4%. Contact detection drops from 82% to 52% of mean iteration time - by far the biggest shift. The second-largest phase is polarization and internal forces, not time integration.

Contact detection dominates v1 at 81.6% of iteration time. In adaptive mode it drops to 52.5% - not just because of scheduling, but because the contact detection improvements (USPG rewrite, Morton sorting, SAP switching) run alongside the scheduler. The second-most-expensive phase is polarization and internal forces (14.1% → 35.7%), which becomes more visible in adaptive mode precisely because contact detection has gotten faster. Time integration accounts for only 2.6–7.4% of iteration time - a minor contributor, not a dominant phase.

This is worth stating plainly for causal clarity: the load-imbalance improvement (mean thread_imbalance_pct from 14.6% to 10.7%) is real and consistent across the full run. But the biggest lever in the speedup numbers is the contact-detection work - faster algorithms plus better scheduling of an inherently irregular phase. I don’t have a clean ablation between the USPG/Morton changes and the scheduler; the 41.3h run exercised both together. Turning off Morton sorting and re-running is the measurement this section is missing.

Three changes that compounded the gains

Faster contact detection :

Contact detection is both the most irregular phase and the most expensive, so speeding it up multiplies with the scheduling win. Its spatial-lookup containers (an unbounded uniform grid, “USPG”) switched from std::forward_list to std::vector - better cache behaviour and fewer pointer chases. Morton sorting of faces before USPG insertion was added separately : faces near each other in space land near each other in memory and the cache stops thrashing. Above 500 cells, a different broad-phase algorithm - Sweep-and-Prune - switches on automatically (ADAPTIVE_SAP_CELL_THRESHOLD = 500). SAP projects objects onto axes and finds overlapping intervals; it scales better than a uniform grid at sparse large-N scenes and produces the same exact candidate pairs - it is not a coarser approximation. The contact_detection_algorithm XML parameter accepts uspg, sweep_and_prune, or adaptive; the default is uspg unless explicitly set.

Better CI and memory-safety tooling :

v1.0’s CI ran one Release build and ctest -C Release. The fork adds Debug builds, AddressSanitizer and UndefinedBehaviorSanitizer, and a clang-format check (commit a2ca28e). Latency profiling was added separately in commit b0aac1a (2026-02-15). The sanitizers earned their place immediately: they caught a heap-use-after-free in local_mesh_refiner::split_edge - a reference left dangling after a vector reallocation - that had survived careful manual review (commit d5e2112).

A correctness sweep :

Alongside the performance work: a division-by-zero guard in mat33::inverse, NaN suppression in vec3::angle, a fix for a cell_lst.size() data race in the parallel cell-division loop (cell_divider.cpp), and null checks in parameter_reader. The parallel division bug is the kind that only shows up under the heavier thread utilization the new schedules produce, which is why it mattered to fix it before trusting any benchmark.

What’s still open

A few honest gaps remain:

Thread Sanitizer isn’t in CI yet. Only ASan and UBSan run. The cell-division parallel section still produces enough benign-looking races that TSan is noisy, and that noise needs triaging before it can gate the build.
The Python wrapper hasn’t caught up. The new --schedule= and --diagnostics-csv= CLI flags exist on the C++ binary but aren’t exposed through the pybind11 wrapper - simucell3d_wrapper only forwards simulation parameters, cell list, thread count, and verbosity. Python users can’t reach the new knobs yet. Tracked for the next release.
assert() in hot paths. 306 assert() calls still live in production src/ (all runtime assert(), none static_assert). Several have been converted to proper runtime checks in the critical paths; the rest are a slower migration.
The high-CoV machinery is unexercised on committed workloads. The 0.4/0.6 chunk-band thresholds and the GROWTH phase detector are implemented and confirmed in the source, but neither fired on the 41.3h paper-exact run. A faster-dividing scenario would be the right workload to exercise them.

Three things this taught me

One: the most consequential scheduling parameter wasn’t the integrator or time step - it was the order in which threads picked up work. The right choice costs almost nothing at runtime. The wrong choice costs 2×+, silently, because CPU utilization stays pinned at 100% even while thread utilization tanks. top tells you nothing useful here; a tracing profiler tells you everything.

Two: instrumentation before optimization. It would have been easy to flip schedule(static) to schedule(dynamic) and call it done. Building the full measurement stack took longer - but it produced the uncomfortable observation that CoV never triggered the fancy band machinery on this workload, that GROWTH never fired, that the high-CoV branches are correct but untested on these parameters. Knowing those gaps is more useful than assuming everything worked.

Three: fix correctness before you trust a benchmark. The sanitizers found a memory bug that survived manual review; the parallel-division race only surfaces under the heavier thread utilization the new schedules cause. In the right order, you catch those before they quietly contaminate your numbers.

Code: github.com/nilesh-patil/simucell3d (tag v2.0, branch main)
Reference: Runser et al., SimuCell3D, Nature Computational Science (2024)
Project page: SimuCell3D on Side Projects

Distributed K-Means Clustering in Python

2020-05-20T15:30:00+05:30

Introduction

K-means clustering is one of the most widely used unsupervised machine learning algorithms for partitioning data into k-clusters. While the algorithm is conceptually simple and computationally efficient for moderate-sized datasets, it faces significant challenges when dealing with large datasets where each iteration can contain millions or billions of data points.

In this post, we’ll explore how to implement distributed k-means clustering in Python using popular frameworks like PySpark and Dask, enabling us to handle massive datasets that don’t fit into memory on a single machine.

K-Means Algorithm Overview

Before diving into distributed implementations, let’s quickly review the standard k-means algorithm:

Initialization: Sample k-cluster centroids randomly
Assignment Step: Assign each data point to the nearest centroid
Update Step: Recalculate centroids as the mean of assigned points
Verify : Check for stop condition
Repeat : steps 2-3 until convergence or maximum iterations reached

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Standard scikit-learn k-means for small datasets
def standard_kmeans_example():
    # Generate sample data
    np.random.seed(42)
    X = np.random.randn(1000, 2)
    
    # Apply k-means
    kmeans = KMeans(n_clusters=3, random_state=42)
    labels = kmeans.fit_predict(X)
    
    # Plot results
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
    plt.scatter(kmeans.cluster_centers_[:, 0], 
                kmeans.cluster_centers_[:, 1], 
                c='red', marker='x', s=200)
    plt.title('Standard K-Means Clustering')
    plt.show()
    
    return kmeans, labels

Challenges with Large-Scale Data

When dealing with large datasets, traditional k-means implementations face several challenges:

Memory Constraints: Large datasets may not fit into memory of a single machine
Computational Complexity: $O(n \times k \times d \times i)$ time complexity becomes prohibitive where n is the number of data points, k is the number of clusters, d is the dimensionality, and i is the number of iterations
I/O Bottlenecks: Reading massive datasets from disk creates slow data transfer between disk and memory
Scalability: Single-machine limitations prevent processing datasets beyond hardware capacity

Distributed K-Means Implementations

Distributed K-Means with PySpark

Apache Spark provides good support for distributed k-means clustering through its MLlib library. Here’s how to implement it:

Setting Up PySpark Environment

from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans as SparkKMeans
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql.functions import col
import pyspark.sql.functions as F

# Initialize Spark session
def create_spark_session():
    spark = SparkSession.builder \
        .appName("DistributedKMeans") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    return spark

Implementing Distributed K-Means

def distributed_kmeans_pyspark(spark, data_path, k=3, max_iter=100):
    """
    Perform distributed k-means clustering using PySpark
    
    Parameters:
    -----------
    spark : SparkSession
        Active Spark session
    data_path : str
        Path to the dataset (CSV, Parquet, etc.)
    k : int
        Number of clusters
    max_iter : int
        Maximum number of iterations
        
    Returns:
    --------
    model : KMeansModel
        Trained k-means model
    predictions : DataFrame
        DataFrame with cluster assignments
    """
    
    # Load data
    df = spark.read.option("header", "true").csv(data_path)
    
    # Convert string columns to numeric (if needed)
    numeric_cols = [col_name for col_name in df.columns 
                    if col_name not in ['id', 'label']]
    
    for col_name in numeric_cols:
        df = df.withColumn(col_name, col(col_name).cast("double"))
    
    # Create feature vector
    assembler = VectorAssembler(
        inputCols=numeric_cols,
        outputCol="features"
    )
    df_vectorized = assembler.transform(df)
    
    # Initialize and train k-means model
    kmeans = SparkKMeans(
        k=k,
        maxIter=max_iter,
        seed=42,
        featuresCol="features",
        predictionCol="cluster"
    )
    
    model = kmeans.fit(df_vectorized)
    
    # Make predictions
    predictions = model.transform(df_vectorized)
    
    # Display cluster centers
    centers = model.clusterCenters()
    print(f"Cluster Centers:")
    for i, center in enumerate(centers):
        print(f"Cluster {i}: {center}")
    
    # Calculate Within Set Sum of Squared Errors (WSSSE)
    wssse = model.summary.trainingCost
    print(f"Within Set Sum of Squared Errors: {wssse}")
    
    return model, predictions

# Example usage
def run_pyspark_example():
    spark = create_spark_session()
    
    # Generate sample distributed dataset
    sample_data = spark.range(0, 100000).select(
        F.rand(seed=42).alias("feature1"),
        F.rand(seed=43).alias("feature2"),
        F.rand(seed=44).alias("feature3")
    )
    
    # Save to temporary location for demonstration
    sample_data.write.mode("overwrite").option("header", "true").csv("/tmp/sample_data")
    
    # Run distributed k-means
    model, predictions = distributed_kmeans_pyspark(
        spark, "/tmp/sample_data", k=5, max_iter=50
    )
    
    # Show sample predictions
    predictions.select("feature1", "feature2", "feature3", "cluster").show(20)
    
    spark.stop()
    return model, predictions

Advanced PySpark K-Means with Custom Initialization

def advanced_kmeans_pyspark(spark, df, k=3, init_method="k-means++"):
    """
    Advanced k-means implementation with custom initialization strategies
    """
    
    if init_method == "k-means++":
        # PySpark uses k-means|| (k-means parallel) as its scalable
        # initialisation — not k-means++.  Both aim for good seeding but
        # k-means|| runs in O(log k) passes rather than k sequential passes.
        kmeans = SparkKMeans(k=k, initMode="k-means||", initSteps=2)
    elif init_method == "random":
        kmeans = SparkKMeans(k=k, initMode="random")
    
    # Add convergence tolerance
    kmeans.setTol(1e-4)
    
    # Train model
    model = kmeans.fit(df)
    
    # Evaluate model performance
    silhouette_evaluator = ClusteringEvaluator()
    predictions = model.transform(df)
    silhouette = silhouette_evaluator.evaluate(predictions)
    
    print(f"Silhouette Score: {silhouette}")
    
    return model, predictions

Distributed K-Means with Dask

Dask provides another robust framework for distributed computing in Python, with a more Pythonic API and specifically targeting numerical computing applications :

Setting Up Dask Environment

import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client, as_completed
from dask_ml.cluster import KMeans as DaskKMeans
import numpy as np

def create_dask_client(n_workers=4):
    """
    Create a Dask client for distributed computing
    """
    client = Client(n_workers=n_workers, threads_per_worker=2)
    print(f"Dask dashboard available at: {client.dashboard_link}")
    return client

Implementing Distributed K-Means with Dask

def distributed_kmeans_dask(data_path, k=3, max_iter=100, chunk_size="100MB"):
    """
    Perform distributed k-means clustering using Dask
    
    Parameters:
    -----------
    data_path : str
        Path to the dataset
    k : int
        Number of clusters
    max_iter : int
        Maximum number of iterations
    chunk_size : str
        Size of data chunks for processing
        
    Returns:
    --------
    model : DaskKMeans
        Trained k-means model
    labels : dask.array
        Cluster assignments
    """
    
    # Load data with Dask
    df = dd.read_csv(data_path)
    
    # Convert to numpy array for clustering
    # Exclude non-numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    X = df[numeric_cols].to_dask_array(lengths=True)
    
    # Initialize Dask k-means
    kmeans = DaskKMeans(
        n_clusters=k,
        max_iter=max_iter,
        random_state=42,
        init_max_iter=3  # For k-means++ initialization
    )
    
    # Fit the model
    print("Training distributed k-means model...")
    kmeans.fit(X)
    
    # Predict cluster labels
    labels = kmeans.predict(X)
    
    # Get cluster centers
    centers = kmeans.cluster_centers_
    print(f"Cluster centers shape: {centers.shape}")
    
    # Calculate inertia (within-cluster sum of squares)
    inertia = kmeans.inertia_
    print(f"Inertia: {inertia}")
    
    return kmeans, labels

# Example with synthetic data generation
def generate_large_dataset_dask(n_samples=1000000, n_features=10, n_centers=5):
    """
    Generate a large synthetic dataset using Dask
    """
    from sklearn.datasets import make_blobs
    
    # Generate data in chunks
    chunk_size = 100000
    chunks = []
    
    for i in range(0, n_samples, chunk_size):
        current_chunk_size = min(chunk_size, n_samples - i)
        X_chunk, _ = make_blobs(
            n_samples=current_chunk_size,
            centers=n_centers,
            n_features=n_features,
            random_state=42 + i,
            cluster_std=1.5
        )
        chunks.append(da.from_array(X_chunk, chunks=(current_chunk_size, n_features)))
    
    # Concatenate chunks
    X_large = da.concatenate(chunks, axis=0)
    return X_large

def run_dask_example():
    """
    Complete example of distributed k-means with Dask
    """
    # Create Dask client
    client = create_dask_client(n_workers=4)
    
    try:
        # Generate large synthetic dataset
        print("Generating large synthetic dataset...")
        X = generate_large_dataset_dask(n_samples=500000, n_features=8, n_centers=4)
        
        # Apply k-means clustering
        print("Applying distributed k-means...")
        kmeans = DaskKMeans(n_clusters=4, random_state=42)
        
        # Fit and predict
        labels = kmeans.fit_predict(X)
        
        # Compute results
        unique_labels = da.unique(labels).compute()
        print(f"Unique cluster labels: {unique_labels}")
        
        # Calculate cluster statistics
        centers = kmeans.cluster_centers_
        print(f"Cluster centers:\n{centers}")
        
        return kmeans, labels
        
    finally:
        # Clean up
        client.close()

Incremental K-Means with Dask

Note (corrected): dask_ml.cluster.KMeans does not expose partial_fit — calling it raises AttributeError. For true incremental / streaming k-means, use dask_ml.wrappers.Incremental wrapping scikit-learn’s MiniBatchKMeans, which does support partial_fit.

def incremental_kmeans_dask(data_stream, k=3, batch_size=10000):
    """
    Implement incremental k-means for streaming data using Dask's Incremental
    wrapper around scikit-learn's MiniBatchKMeans.

    dask_ml.cluster.KMeans does NOT have partial_fit; use Incremental instead.
    """
    from sklearn.cluster import MiniBatchKMeans
    from dask_ml.wrappers import Incremental

    # Wrap MiniBatchKMeans (which supports partial_fit) with Dask's Incremental
    base = MiniBatchKMeans(n_clusters=k, random_state=42)
    kmeans = Incremental(base)

    # Process data in batches — Incremental delegates to partial_fit internally
    for batch in data_stream:
        kmeans.partial_fit(batch)

        # Track convergence via the underlying estimator
        if hasattr(kmeans.estimator_, 'inertia_'):
            print(f"Current inertia: {kmeans.estimator_.inertia_}")

    return kmeans

Performance Comparison

Let’s compare the performance of different implementations:

import time
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

def performance_comparison():
    """
    Compare performance of different k-means implementations
    """
    # Generate test data
    sizes = [10000, 50000, 100000]
    results = {}
    
    for size in sizes:
        print(f"\nTesting with {size} samples...")
        X, _ = make_blobs(n_samples=size, centers=5, n_features=10, random_state=42)
        
        # Scikit-learn (single-threaded)
        start_time = time.time()
        sklearn_kmeans = KMeans(n_clusters=5, random_state=42)
        sklearn_kmeans.fit(X)
        sklearn_time = time.time() - start_time
        
        # Dask (if data fits in memory)
        start_time = time.time()
        X_dask = da.from_array(X, chunks=(10000, 10))
        dask_kmeans = DaskKMeans(n_clusters=5, random_state=42)
        dask_kmeans.fit(X_dask)
        dask_time = time.time() - start_time
        
        results[size] = {
            'sklearn': sklearn_time,
            'dask': dask_time,
            'speedup': sklearn_time / dask_time
        }
        
        print(f"Scikit-learn: {sklearn_time:.2f}s")
        print(f"Dask: {dask_time:.2f}s")
        print(f"Speedup: {sklearn_time/dask_time:.2f}x")
    
    return results

Best Practices

1. Data Preprocessing

def preprocess_for_clustering(df):
    """
    Best practices for data preprocessing
    """
    # Handle missing values
    df = df.fillna(df.mean())
    
    # Standardize features
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df)
    
    # Remove outliers (optional)
    from scipy import stats
    z_scores = np.abs(stats.zscore(df_scaled))
    df_clean = df_scaled[(z_scores < 3).all(axis=1)]
    
    return df_clean

2. Optimal Number of Clusters

def find_optimal_k_distributed(X, max_k=10):
    """
    Sweep k from 1..max_k and record the within-cluster sum of squares (inertia)
    for each fit. Returns the k-range and the inertia curve for elbow inspection.
    """
    inertias = []
    k_range = list(range(1, max_k + 1))

    for k in k_range:
        kmeans = DaskKMeans(n_clusters=k, random_state=42)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)

    # Plot elbow curve
    import matplotlib.pyplot as plt
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, inertias, 'bo-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method for Optimal k')
    plt.show()

    return k_range, inertias


def find_elbow_point(k_range, inertias):
    """
    Pick the k whose point on the inertia curve is farthest (in perpendicular
    distance) from the straight line connecting the first and last points.
    A simple, robust heuristic for elbow detection that avoids a manual eyeball.
    """
    import numpy as np
    points = np.array(list(zip(k_range, inertias)), dtype=float)
    line_vec = points[-1] - points[0]
    line_vec_norm = line_vec / np.linalg.norm(line_vec)
    vec_from_first = points - points[0]
    scalar_proj = vec_from_first @ line_vec_norm
    proj_points = np.outer(scalar_proj, line_vec_norm) + points[0]
    distances = np.linalg.norm(points - proj_points, axis=1)
    return int(k_range[int(np.argmax(distances))])

3. Memory Management

def optimize_memory_usage():
    """
    Tips for optimizing memory usage in distributed k-means
    """
    
    # 1. Use appropriate chunk sizes
    chunk_size = "100MB"  # Adjust based on available memory
    
    # 2. Use float32 instead of float64 when possible
    dtype = np.float32
    
    # 3. Persist intermediate results strategically
    # df.persist()  # Only when data will be reused multiple times
    
    # 4. Use garbage collection
    import gc
    gc.collect()
    
    return {
        'chunk_size': chunk_size,
        'dtype': dtype
    }

Conclusion

Distributed k-means clustering is essential for handling large-scale datasets that exceed single-machine capabilities. Both PySpark and Dask offer robust solutions:

PySpark MLlib is ideal when:

Working with very large datasets (>1TB)
Integration with existing Spark ecosystem
Need for production-grade fault tolerance

Dask is preferred when:

Working with Python-centric workflows
Need for interactive development
Integration with existing NumPy/Pandas code

Key Takeaways:

Preprocessing is crucial for distributed clustering success
Chunk size optimization significantly impacts performance
Initialization methods (k-means++) are important for convergence
Monitoring convergence and performance metrics is essential
Memory management becomes critical at scale

The choice between frameworks depends on your specific use case, data size, and existing infrastructure. Both approaches can handle datasets that would be impossible to process on a single machine, making k-means clustering accessible for big data applications.

# Final example: Complete pipeline
def complete_distributed_kmeans_pipeline(data_path, framework='dask'):
    """
    Complete pipeline for distributed k-means clustering
    """
    if framework == 'dask':
        client = create_dask_client()
        try:
            # Load and preprocess data
            df = dd.read_csv(data_path)
            X = preprocess_for_clustering(df.values)
            
            # Find optimal k
            k_range, inertias = find_optimal_k_distributed(X)
            optimal_k = find_elbow_point(k_range, inertias)
            
            # Train final model
            kmeans = DaskKMeans(n_clusters=optimal_k, random_state=42)
            labels = kmeans.fit_predict(X)
            
            return kmeans, labels
        finally:
            client.close()
    
    elif framework == 'pyspark':
        spark = create_spark_session()
        try:
            # k is selected here in the same way as the Dask branch, so the
            # PySpark path also runs the elbow scan instead of relying on a
            # name carried over from another branch.
            df_spark = spark.read.csv(data_path, header=True, inferSchema=True)
            X = df_spark.toPandas().values
            k_range, inertias = find_optimal_k_distributed(X)
            optimal_k = find_elbow_point(k_range, inertias)

            model, predictions = distributed_kmeans_pyspark(
                spark, data_path, k=optimal_k
            )
            return model, predictions
        finally:
            spark.stop()

Galactic Morphology using Deep-Learning

2017-07-26T01:09:55+05:30

Introduction

Astronomy has historically been one of the most data intensive fields & a major chunk of this data is collected as images collected by a number of telescopes - terrestrial as well as in space. A BIG data-project which aims to collate this data from various sources to form a coherent picture of the universe is Sloan Digital Sky Survey.

To quote the project website:

The Sloan Digital Sky Survey has created the most detailed three-dimensional maps of the Universe ever made, with deep multi-color images of one third of the sky, and spectra for more than three million astronomical objects. Learn and explore all phases and surveys — past, present, and future — of the SDSS.

A citizen science project called Galaxy zoo was launched in 2007, through this project thousands of volunteers classified 100k+ images of galaxies. A flow-chart of questions asked to volunteers shown on the project website is as follows.

Data Description

The dataset consists of 100k+ jpeg images and the corresponding score vector for each image. The score vector has 37 values where each value represents the weighted score from volunteers in the project.

The important point to remember is that the scores aren’t probability score per se. They are weighted scores & so they vary from 0 to 1 but all sub scores for a question don’t necessarily sum up to 1 as a rule.

Each image is of size 424×424×3 and each value is between 0–255. A good practice is to rescale the data. In our experiments, we normalize the images by computing $\mu_\text{channel}$ and $\sigma_\text{channel}$ over the full dataset, then normalizing each channel using its corresponding $\mu$ and $\sigma$. The channels and cell values do not represent the physical aspect of data collection — they are standardized to the accepted image format range of 0–255 — so the normalization is more of a hack for better gradient updates than a domain-knowledge-based modification.

A few sample images from the dataset:

These images are read in as numpy arrays in python with the following representation:

Fully Convolutional Classifier

A convolutional network takes in your image array as input extracts features from this array which best represents the task at hand & then gives out a classification/regression output. Standard classification models use one-vs-rest scheme to represent output for an elegant representation of the classification task. In this form, the correct class is assigned 1 while other possible classes in the dataset are assigned 0. The output vector is of length c, where c is total number of classes in the dataset.

In the Galaxy Morphology classification task, we use standard .jpeg images to learn the shape attributes as a vector of length 37 which describes its properties. We set it up as a regression task in this case, since our ground truth is a weighted version of votes gathered from volunteers.

The model takes in a normalized array representing input image. This array passes through the following layers stacked after each other:

Convolutional layer : It consists of a set of learnt features. In terms of standard modeling terminology, the features that a model uses are usually handcrafted i.e. some form of transformations of the raw input data. In images, the kernels that form the convolutional layer are expected to learn optimal features for the task at hand, instead of features crafted by a domain expert. Since the output from this convolutional layer is learnt w.r.t output, the features being generated at each step should ideally be the optimal representation of input provided at that step.
Pooling layer : The pooling layer reduce size of incoming representation by selecting from a set of appropriate downsampling functions. The max-pooling layer chooses maximum from a given volume of array as an appropriate representation of the focus. Similarly, average-pooling takes average of the volume.
Activation : Activation functions are used to introduce non-linearity in the model. This layer applies a given function to each element of input array. Standard activation functions used in models are ‘relu’, ‘softmax’, ‘sigmoid’, ‘tanh’ etc.
Dropout: a regularization technique developed to reduce overfitting in deep neural networks. At training time, the activations of a randomly chosen fraction of neurons are set to zero; at prediction time, the weights learnt for those units are multiplied by the keep-probability p. Below, (a) shows a standard fully-connected network and (b) shows the same network with dropout applied — at each training step a different subset of activations is zeroed, which prevents any single neuron from dominating the learned representation.
Batch normalization: a major problem during training is that the distribution of inputs to each successive layer shifts as the parameters of preceding layers update. This internal covariate shift forces small learning rates and slow convergence. Batch normalization addresses it by normalizing each channel’s activations to $\mu_x = 0$ and $\sigma_x = 1$ across the current mini-batch, then applying a learned affine transform:
\[X_\text{out} = \gamma \cdot \frac{X_\text{in} - \mu_X}{\sigma_X} + \beta\]
Here $\mu_X$ and $\sigma_X$ are computed channelwise. The effect is that subsequent layers see a more stable input distribution, which permits larger learning rates and faster training.

Setup

The layers above are stacked into a module and we experiment with several network structures, starting from well-established architectures and moving toward more recently published ones. The final layers are Dense (fully connected) layers; their output is compared with the expected output to compute the loss for each observation. The loss is back-propagated to update the convolutional kernel weights and the dense layer weights, with the expectation that, given enough data, the network learns features that are optimal for the task at hand.

In our setup, features are extracted successively from the raw galaxy image and a regression head matches the human-generated score vector for the same image. The score is a vector of length 37, where each entry encodes one physical aspect of the galaxy — together describing its morphology.

References

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15.
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning.

Characterizing & analyzing networks : NYC taxi data

2017-03-15T01:09:55+05:30

Introduction

Transportation networks offer a fascinating opportunity to identify local population’s travel habits, daily routines and a usage-data driven way to augment city-planning decisions. In our analysis, we focus on exploring travel patterns of New York City residents using 146+ million taxi trips taken for the year 2015. The complete code, visualizations & reports are available in the Github Repository

Prior work

GPS based transportation networks have been studied in detail for traffic flow analysis and determining social dynamics^[1]. Bike sharing datasets have been used for clustering locations based on the usage profile^[2] and predicting bike demand^[3] GPS based taxi datasets have been used to identify mobility patterns in Shanghai, China^[4].

In^[4], the trip distribution has been characterized as combination of 3 independent types and non-negative matrix factorization has been used to identify 3 patterns from 1.58 million trips in Shanghai, China. This is the core of our approach, as we are attempting to characterize taxi usage in New York City with a similar dataset. Prior analysis has also been done to produce transportation network graphs from street geometries^[5] and subway maps^[6].

Data Description

a) Raw data :

New York City’s Taxi & Limousine Commission has made the taxi-trips dataset available for public use since 2009^[7]. We used this dataset for our analysis. The complete dataset contains approximately 150 million trips per year & each row represents one trip, with features for starting and stopping point, distance travelled, taxi-charge, time taken etc. We are using 2015 dataset which contains 146,112,990 trips in total. For combining the geolocation data to census tracts, we used the extremely helpful NYC landuse dataset^[8].

b) Data transformation :

We use the following variables for each trip:

Trip starting timestamp
Start point (Lat/Long)
Trip stopping timestamp
Stop point(Lat/Long)
Charges

Using these trips, we build our directed graph with start and stop location for a given trip as nodes. As an additional condition, we only used locations with high number of trips (more than 500 in the year for a given pair of locations). From prior work we summarized that rounding off location coordinates to 2 decimal points is also an option and given our difficulties analyzing the dataset with 40,000+ nodes, we are now in the process of reducing our network by 2 separate approaches:

One node for 5 Manhattan blocks
Using 6 million most frequent trips to get 1275 most frequently travelled edges.
In our current network, each node represents 200m x 200m around it and each edge represents the total number of trips between two nodes in a given year.

We create features for month, day, weekday, period of the day etc from the timestamp. An issue that we faced during our initial analysis was that due to geography of the network itself, we had a network where multiple nodes represented the same location due to multiple entrances (e.g. Penn station has multiple entries and exits & our network had multiple nodes representing essentially the same real world location)

Same location : multiple close nodes

We finally decided to merge our dataset with US Census Bureau census tracts which removed the above problem. We have 580+ nodes in the final network and are worked with analyzing census tract as node and number of trips between two census tracts represented as an edge.

Exploratory Analysis

1 . Trips taken in each month(fig-i) peaks between March-May and drops substantially during June onwards. This can be attributed directly to the weather pattern, as commuters are expected to avoid walking long distances during low temperatures or rainy weather.

(fig-i) Trips per month, 2015

2 . For our full dataset, the degree distribution is shown in first plot whereas the second plot shows degree distribution for graph generated using edges with at least 500 trips in 2015.

(fig-ii) Degree distribution, full vs. filtered network

3 . The heat-map (fig iii) shows relative trip-density for each hour of the day in 2015. From this information, we summarize that the busiest hours are 6AM to 9AM. We attribute most of this traffic to users on their daily work-commute whereas there is a remarkable increase in density between 12AM to 4AM on weekends. We are looking for an approach to perform similar analysis on the network structure generated from our subset, to create temporal traffic density visualization.

(fig-iii) Trip density by hour of day and day of week

4 . We analyzed the cost vs duration relationship for the trips and found interesting abnormal number of constant cost trips. We are attributing these trips to:

Tips being rounded off to nearest 5/10
Traffic delays within the same trips leading to delays

Full Network analysis

The full network is approximately represented on its actual geographic location and we have mapped out-degree to node size & trips as edge thickness. We observed that:

The suburbs are served less than Manhattan, Upper east/west and downtown
Transportation hubs are also network hubs and office areas are the next closest central nodes
Surprisingly, east village & lower east side is also least connected of the complete network, even though these areas are not geographically separated like the suburbs

When we divide our nodes into two subcategories by number of trips greater than or equal to 500 and less than 500 respectively, and plot them in-degree/out-degree against the total number of trips, two stark contrast appear. The top two graphs on Figure 5 speak for nodes with number of trips greater than or equal to 500, with blue graph marks the in-degree to trips ratio and the red graph for out-degree. The bottle two graphs are for nodes with number of trips less than 500.

For number of trips>=500, most of the outliers on the far right of the graphs have their physical locations in Madison Square Garden, Penn Station such inner city attractions. This means that a large number of people coming to these attractions from relatively small number of places and most of these in-coming places are located in Manhattan (e.g. 250,000 trips coming from about 200 places. On average 1250 trips from a single locale).

For number of trips >=500, most of the outliers on the far right of the graphs have their physical locations in Airports (LaGuardia as well as JFK), and their trips versus degree ratio is much smaller, meaning that a small amount of people coming from all sorts of places. And we can easily identify places with low connectivity by looking at the “tail” of the plot, and we found out that the smaller this ratio is, the farther away the node is from Manhattan.

We divided the network into 3 communities, using multilevel community detection in igraph. The above plot has these communities mapped by size to number of trips leaving each node in the whole year. The 3 communities can be described as follows:

The Blue labels represent community A, and the nodes neatly fall into Manhattan and adjoining New Jersey areas which turn out to be most well connected nodes. They are well connected to Manhattan (Manhattan as well as New Jersey locations) and the other two communities (Only Manhattan locations)
The Green labels represent community B, and it represents the locations with highest taxi connectivity to north parts of the city which in turn is due to least city transport connectivity (Bus/Metro etc) – towards north in general
The red labels represent community C, representing north NYC, Queens & Bronx which we understand are in the same community due to least connectivity towards south in general.
We wanted to show determine if the Suburb structure as determined by Dash & Rae^[11] using national dataset holds true at a local level, and that’s why this result is interesting – based on our primary exploration our inference is, that within a city, it doesn’t hold up.

We plotted a snapshot of the trips leaving major NYC areas and this shows, Manhattan is the most connected of all, whereas most trips from Lower east side, East village & Brooklyn end up towards northern sides of NYC. With a small fraction ending up within the community itself.

Final comments

Cities are different from other networks in the sense that minor rerouting is usually pretty straightforward i.e. a detour of one block is easy to take and usually does not lead to significant change in cost, time or length of the route.
A city like New York won’t have a single critical location apart from the structural hubs (Metro hubs, airports & bus hubs) – this is pretty apparent from the degree centrality analysis. The closest thing to a central location in NYC is its avenues, specifically Broadway & 6th avenue. Broadway runs North to South whereas 6th Avenue runs South to North (one-way routes).
Another interesting observation from the analysis is that East village & below is similar to suburbs in terms of taxi usage. This is surprising because as stated by Tobler, the first law of geography is “everything is related to everything else, but near things are more related than distant things.”^[12] & this first law is the foundation of spatial dependence and spatial autocorrelation utilized specifically for the inverse distance weighting method for spatial interpolation^[13].

References

P. S. Castro, D. Zhang, C. Chen, S. Li, and G. Pan. From taxi gps traces to social and community dynamics: A survey.ACM Comput. Surv, December 2013.
C. Etienne and O. Latifa. Model-based count series clustering for bike sharing system usage mining. ACM Trans. Intell. Syst. Technol., July 2014
D. Singhvi, S. Singhvi, P. Frazier, S. Henderson, E. Mahony, D. Shmoys, and D. Woodard. Predicting bike usage for new york city’s bike sharing system. In AAAI Workshops, 2015.
Peng C, Jin X, Wong K-C, Shi M, Lio P (2012) Collective Human Mobility Pattern from Taxi Trips in Urban Area. PLoS ONE 7(4): e34487.doi:10.1371/journal.pone.0034487
P. Crucitti, V. Latora, and S. Porta. Centrality measures in spatial net-works of urban streets. PHYSICAL REVIEW E, 73(3):036125, 2006.
Derrible S (2012) Network Centrality of Metro Systems. PLoS ONE 7(7): e40575. doi:10.1371/journal.pone.0040575
NYC taxi data: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
NYC land use: https://www1.nyc.gov/site/planning/index.page
W. Cui, H. Zhou, H. Qu, P. C. Wong and X. Li, “Geometry-Based Edge Clustering for Graph Visualization,” in IEEE Transactions on visualization and Computer Graphics, vol. 14, no. 6, pp. 1277-1284, Nov.-Dec. 2008 doi: 10.1109/TVCG.2008.135
Holten, D, & Wijk, J 2009, ‘Force-Directed Edge Bundling for Graph Visualization’, Computer Graphics Forum, 28, 3, pp. 983-990, Business Source Premier, EBSCOhost, viewed 12 November 2016.
Dash Nelson G, Rae A (2016) An Economic Geography of the United States: From Commutes to Megaregions. PLoSONE11(11):e0166083.doi:10.1371/journal.pone.0166083
Tobler W. A computer movie simulating urban growth in the Detroit region. Economic Geography 1970;46: 234–240.
https://en.wikipedia.org/wiki/Tobler’s_first_law_of_geography

Working with numpy

2017-03-04T11:40:55+05:30

NumPy is a Python library that provides fast computation over arrays (vectors, matrices, tensors). It is faster than the equivalent base-Python because operations are vectorized, and the resulting code stays close to the mathematical notation of the underlying operation — without the bookkeeping and overhead of element-wise loops.

import numpy as np

Vectors

Create vectors by generating different sequence of numbers

# sequence of integers between given bounds

w = np.arange(10,25, step=1)

# 10 random integers between given bounds

x = np.random.randint(low=0,high=10,size=10)

# 10 real numbers drawn from standard normal distribution

y = np.random.randn(10)

# A vector of length 10 with all zeroes

z = np.zeros(10)

# Another convenient way to generate a vetor or even an array of zeroes is as follows:

z = np.zeros_like(y)

# generate sequence of numbers between given bounds & fixed step

s = np.arange(start=15, stop=35, step=2)

print(x)

[8 5 8 2 2 6 1 0 2 5]

Single vector operations

Sum of a sequence

\[\text{Sum} = \displaystyle\sum_{i=1}^{n} x_i\]

x.sum()

Adding a constant to each element of a vector

\[x_{i,\text{new}} = x_i + c\]

c = 2

X_new = x+c
print(X_new)

[ 2  2 11  4  5  8  7  6  3  2]

Multiplying every element of a vector by a constant

\[x_{i,\text{new}} = x_i \cdot c\]

c = 5

X_new = x*c
print(X_new)

[ 0  0 45 10 15 30 25 20  5  0]

Reverse a vector

S_new = s[::-1]
print(S_new)

[33 31 29 27 25 23 21 19 17 15]

Calculate basic statistical measures

Mean ($\mu$)

x = np.random.randint(low=0,high=1000,size=100)

x.mean(dtype=np.float32)

475.54001

Standard deviation ($\sigma$)

x.std(dtype=np.float32)

298.57318

Variance ($\sigma^2$)

x.var(dtype=np.float32)

89145.938

Subset a vector

Index for maximum & minimum values in a sequence

x.argmax()

x.argmin()

Subset using index

# 2nd to 5th element (excluding 5th)

x[2:5]

array([ 31,   1, 561])

Matrices

Create a matrix

Get a matrix of particular shape by providing numbers

x = np.array([
            [1,2,3,4],
            [1,2,3,4],
            [1,2,3,4]
         ])
x

array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]])

Transpose of a matrix

y = np.array([
            [1,2,3,4],
            [1,2,3,4],
            [1,2,3,4]
         ]).T
y

array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])

Get a matrix of particular shape

np.zeros(shape=(4,5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

np.ones(shape=(4,5))

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

np.random.rand(4,5)

array([[ 0.09429882,  0.34480325,  0.11695385,  0.96194279,  0.53927071],
       [ 0.78844899,  0.7351646 ,  0.43960103,  0.20815778,  0.50149201],
       [ 0.26338585,  0.89077065,  0.20248855,  0.90770632,  0.91826611],
       [ 0.62807109,  0.48525764,  0.55865624,  0.88327996,  0.51471048]])

np.ones_like(x)

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])

Matrix operations

Multiply a matrix by constant

5 * x

array([[ 5, 10, 15, 20],
       [ 5, 10, 15, 20],
       [ 5, 10, 15, 20]])

Multiply a matrix by another

y@x  # New matrix maultiplication operator in python3.5+ !

array([[ 3,  6,  9, 12],
       [ 6, 12, 18, 24],
       [ 9, 18, 27, 36],
       [12, 24, 36, 48]])

np.dot(y,x) # numpy based dot product

array([[ 3,  6,  9, 12],
       [ 6, 12, 18, 24],
       [ 9, 18, 27, 36],
       [12, 24, 36, 48]])

x*y.T # elementwise multiplication or hadamard product of two matrices with same shape

array([[ 1,  4,  9, 16],
       [ 1,  4,  9, 16],
       [ 1,  4,  9, 16]])

Subset a matrix

x[:,:]

array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]])

x[:,1:3]

array([[2, 3],
       [2, 3],
       [2, 3]])

Human activity recognition

2017-02-16T01:09:55+05:30

Introduction

A modern smartphone comes equipped with variety of sensors from motion detectors to optical calibrators. The data collected by these sensors is valuable for better aligning the applications on the phone with user’s lifestyle. In this project, we have focused on using data collected from motion sensors to build a model which identifies type of activity being performed with minimal computation involved. The end goal is to create a model which can classify the activity being performed with high accuracy without sacrificing the limited computational resources available on a single phone.

The project is hosted here: Github

Data Collection and Preparation

We used the data provided by Human Activity Recognition research project, which built this database from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The complete data & related papers can be accessed at: UCI ML repository page

Data was collected for 30 volunteers whose age was between 19-48 years. Each record in the data represents information about features like acceleration along x,y,z axes, velocity along a,y,z axes, 561 attributes derived from these basic measurements, identifier variable for the user & the activity being performed.

There are 6 categories of activities being performed:

standing
sitting
laying
walking
walking-downstairs
walking-upstairs

The raw data has separate text files for most of the variable groups; we used the dataset packaged as a single .RData file. In this dataset, a single column(‘subject’) is used to identify a user and the last column(‘activity’) was used to identify the activity being performed when the measurements were taken. All other attributes are available in the same column oriented data format. This is important to know, because, the values in the dataset have been normalized.

Exploratory Analysis

a) High dimensionality:

The dataset contains 561 features and we started out by exploring how these are related to each other & whether there are some which can be safely ignored for our problem.

b) Correlation Check:

We built a correlation matrix for all 561 variables in one got to identify any apparent patterns in the relationships. We see that most of these features are highly correlated with each other and it’s a good decision to drop most of these highly correlated features since we can get the same information from some other feature with high correlation to a group of them.

Fig 01. Correlation Matrix between all 561 features

c) Variance Check:

We checked our variable for zero or low variance so that they can be removed before running any analysis. Variables which do not change have low variance and ‘ll eventually have smaller impact on the classification model itself.

d) Missing value Check:

We checked for any missing values in our columns, which might lead to errors in any future analysis but didn’t find any and so proceeded with the complete dataset.

e) Visual exploration:

We also started out with basic visual exploration of the dataset by plotting distributions for the variables for each category, but given the large number involved, we dropped the idea. Though, in general there are two distinct major groups which we can see through the distributions as shown in:

Representative distribution

Model

The first step was to create a train & test set. We split our data into two sets in 7:3 ratios by random sampling without replacement. This ensures that our train & test sets are representative of the complete dataset. Another approach to do it would be to do this sampling for each output class. In our case, the result wasn’t significantly different.

For modelling, we used the following techniques on our training set:

SVM – Support vector machines
Random forest (Final Model)

To determine stability of the model being used, we use OOB score calculated during model building phase as representative of the validation set & optimized our model to increase this score. For determining true performance, we used a separate test set which was not included in any of our variable selection, model training or validation phases. A high accuracy on this independent test set is proof that the model is not overfitting our training data & hence, should generalize well.

We used Random forest variable importance scores to determine the final variables to build our submission model. This process of variable selection was done iteratively & various parameters were tested. To maintain reproducibility, we set RandomState=42 at the beginning of the code so as to have uniform train/test sets & variables every time we run this code.

We started out with all 561 variables & reduced the total features to 5 in our final model. The focus of our process was to follow a algorithmic approach instead of a domain knowledge based model building process & hence we relied on oob score & variable importance to determine the optimal number of features, trees to be used & which features to use.

Step-by-step process

1. Set RandomState = 42

2. Split data into two sets:

   - train(70%)

   - test(30%)

3. Using training set & ALL features, build a random forest ensemble

4. From variable importance measure generated during the previous step,
   rank features according to their importance in differentiating between
   categories

Variable importance scores

5. Determine optimal number of trees & variables by iterating over 0-150
   trees & for 1-25 variables

Determining optimal number of variables & trees for training

6. For the final step, we use 5 of the most important measures determined
   in this fashion & number of trees = 50.
7. Using oob score during training phase & accuracy score from the test
   set as final step, we freeze this model for final submission

The only major assumption in our choice of algorithm (RandomForest) is that random forests don’t usually overfit training set. This assumption breaks down when the training dataset is extremely biased, but in our case its relatively balanced & hence we choose it over other algorithms.

Analysis and Results

1. Important Features:

Using the previously described feature selection, we determined that the following features were important for building our classification model:

angle(X,gravityMean)
tGravityAcc-mean()-Y
tGravityAcc-min()-X
tGravityAcc-max()-X
tBodyAcc-mad()-X

The final model had importance scores are as shown in the figure:

Importance scores for final selected features

2. We used SVM & RandomForest for the final model & their accuracy scores along with confusion matrices are as shown:

Accuracy Scores :

	Random-Forest	SVM
Train	94.50% (oob)	83.48%
Test	94.37%	82.37%

Confusion matrix :

Confusion matrix for test dataset

Given the high accuracy we get on the test dataset, we are confident in using a RandomForest-based model for detecting human activity from the smartphone dataset.
From the final model, we also see that some categories are fairly straightforward to classify compared to others. We have shown this using a scatterplot matrix colored by category as shown the figure:

Fig 04. Distribution of tBodyAccJerk-std()-X across all 6 categories

References

Random Forest:

SVM:

OOB Score:

UCI-ML dataset location:

https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Scikit-Learn:

http://scikit-learn.org/stable/index.html

Github Page:

https://github.com/nilesh-patil/HumanActivityRecognition

Visualizing distributions

2017-01-15T01:09:55+05:30

Once you have your data, usually you start by building summaries, checking for outliers, anomalies in the data & visualizing it from different angles. Here, we’ll look at a few common approaches to visualize distributions (in a highly general sense).

Connect to data

%pylab inline

import pandas as pd
import seaborn as sns
import sqlite3


db_path = './data/world-development-indicators/database.sqlite'

conn = sqlite3.connect(db_path)
db = conn.cursor()
db.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(db.fetchall())

data_countries = pd.read_sql_query('select * from Country',conn)
data_series = pd.read_sql_query('select * from Series',conn)
data_indicators = pd.read_sql_query('select * from Indicators',conn)

Histogram

Data Prep

selected_indicators = ['Life expectancy at birth, female (years)',
                       'Life expectancy at birth, male (years)',
                       'Life expectancy at birth, total (years)']
countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators)

data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)

data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot[['feature','type']] = data_plot['IndicatorName'].str.split(', ',expand=True)
data_plot.reset_index(inplace=True, drop=True)

Plot

nbins = 15
sns.set(style="white",
        palette="pastel",
        color_codes=True,
        rc={'figure.figsize':(12,8),
            'figure.dpi':500})

# sns.distplot is removed in seaborn >= 0.12; use sns.histplot with kde=True
sns.histplot(data_plot.Value[data_plot.type=='female (years)'], bins=nbins, kde=True)
sns.histplot(data_plot.Value[data_plot.type=='male (years)'], bins=nbins, kde=True)
sns.histplot(data_plot.Value[data_plot.type=='total (years)'], bins=nbins, kde=True)
plt.legend(['Female', 'Male', 'Total'], bbox_to_anchor=(1.12,1.04))
plt.xlim((25,100))
plt.grid(color='black',linestyle='-.',linewidth=0.25)
plt.title('Life expectancy at birth ( In years )')

Scatter Plot

Data Prep

selected_indicators = ['Unemployment, female (% of female labor force)',
                       'Unemployment, male (% of male labor force)']

countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators)

data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)

data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot[['feature','type']] = data_plot['IndicatorName'].str.split(', ',expand=True)
data_plot.reset_index(inplace=True, drop=True)
data_plot['type'] = data_plot.type.str.replace(' \(% of male labor force\)','')
data_plot['type'] = data_plot.type.str.replace(' \(% of female labor force\)','')
data_plot = data_plot.pivot_table(values='Value',index='CountryName',columns='type')

Plot

sns.set(style="white",
        palette="pastel",
        rc={'figure.figsize':(7,5),
            'figure.dpi':500})

sns.lmplot(x = 'female', y = 'male', data = data_plot, fit_reg=False, x_jitter=1.5, y_jitter=1.5)
plt.xlim((0,40))
plt.ylim((0,40))
plt.grid(color='black', linestyle='-.', linewidth=0.25)
plt.title('Unemployment (% of total)',)
plt.savefig('./plots/02.scatter.png',orientation='landscape',dpi=500);

Density plot

Data Prep

selected_indicators = ['Mortality rate, adult, female (per 1,000 female adults)',
                       'Mortality rate, adult, male (per 1,000 male adults)']

countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators)

data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)

data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot[['feature','type']] = data_plot['IndicatorName'].str.split(', adult, ',expand=True)
data_plot.reset_index(inplace=True, drop=True)
data_plot['type'] = data_plot.type.str.replace(' \(per 1,000 female adults\)','')
data_plot['type'] = data_plot.type.str.replace(' \(per 1,000 male adults\)','')
data_plot = data_plot.pivot_table(values='Value',index='CountryName',columns='type')

Plot

sns.set(style="white",
        palette="pastel",
        color_codes=True,
        rc={
            'figure.figsize':(10,6),
            'figure.dpi':200
           })

# seaborn >= 0.12 deprecates positional Series args; use keyword form
sns.kdeplot(data=data_plot, x='male', color='red')
sns.kdeplot(data=data_plot, x='female', color='blue')
plt.grid(color='black',linestyle='-.', linewidth=0.25)
plt.title('Mortality rate')
plt.ylim((0,0.006))
plt.xlim((-100,700))
plt.savefig('./03.density.png');

Boxplot

Data prep

selected_indicators = ['Merchandise trade (% of GDP)']

countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators)

data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)

data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot.reset_index(inplace=True, drop=True)
data_plot['Region'] = data_plot.merge(right=data_countries,on='CountryCode',how='left')['Region']

Plot

columns_order = sorted(data_plot.Region.unique())  # fixed: was `scolumns_order` (typo); sort() → sorted()

sns.set(style="white",
        palette="pastel",
        color_codes=True,
        rc={
            'figure.figsize':(10,6),'figure.dpi':200
           })

sns.boxplot(x='Region',
            y='Value',
            palette='autumn',
            order=columns_order,
            width=0.4,
            fliersize=3,
            data=data_plot);
plt.grid(color='black',linestyle='-.', linewidth=0.25)
plt.xticks(rotation=30)
plt.title('Merchandise trade')
plt.ylabel('% of GDP');
plt.savefig('./04.boxplot.png');

Violin plot

Data prep

selected_indicators = [ 'CO2 emissions from gaseous fuel consumption (% of total)',
                        'CO2 emissions from liquid fuel consumption (% of total)']

countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators)

data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)

data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot.reset_index(inplace=True, drop=True)
data_plot['Region'] = data_plot.merge(right=data_countries,on='CountryCode',how='left')['Region']

Plot

import matplotlib.patches as mpatches

columns_order = sort(data_plot.Region.unique())

sns.set(style="white",
        palette="pastel",
        color_codes=True,
        rc={
            'figure.figsize':(16,10),'figure.dpi':250
           })
sns.violinplot(x ='Region',
               y='Value',
               hue='IndicatorName',
               linewidth=0.25,
               inner="quart",
               palette={"CO2 emissions from gaseous fuel consumption (% of total)": "y",
                        "CO2 emissions from liquid fuel consumption (% of total)": "b"},
               data=data_plot,
               split=True)
plt.grid(color='black',linestyle='-.', linewidth=0.25)
plt.title('CO$_2$ emission')
plt.ylabel('% of total')

gas_patch = mpatches.Patch(color='yellow', label='Gaseous',alpha=0.5)
liquid_patch = mpatches.Patch(color='skyblue', label='Liquid')
plt.legend(handles=[gas_patch, liquid_patch], bbox_to_anchor=(0.2, 0.99), fontsize='x-large')
plt.savefig('./plots/05.violinplot.png', dpi=250, bbox_inches='tight');

Heatmap

Data prep

selected_indicators_export = [
    'Merchandise exports to developing economies in East Asia & Pacific (% of total merchandise exports)',
    'Merchandise exports to developing economies in Latin America & the Caribbean (% of total merchandise exports)',
    'Merchandise exports to developing economies in Middle East & North Africa (% of total merchandise exports)',
    'Merchandise exports to developing economies in South Asia (% of total merchandise exports)',
    'Merchandise exports to developing economies in Sub-Saharan Africa (% of total merchandise exports)',
    'Merchandise exports to developing economies outside region (% of total merchandise exports)',
    'Merchandise exports to developing economies within region (% of total merchandise exports)',
    'Merchandise exports to economies in the Arab World (% of total merchandise exports)',
    'Merchandise exports to high-income economies (% of total merchandise exports)'
]

selected_indicators_imports = [
    'Merchandise imports from developing economies in East Asia & Pacific (% of total merchandise imports)',
    'Merchandise imports from developing economies in Latin America & the Caribbean (% of total merchandise imports)',
    'Merchandise imports from developing economies in Middle East & North Africa (% of total merchandise imports)',
    'Merchandise imports from developing economies in South Asia (% of total merchandise imports)',
    'Merchandise imports from developing economies in Sub-Saharan Africa (% of total merchandise imports)',
    'Merchandise imports from developing economies outside region (% of total merchandise imports)',
    'Merchandise imports from developing economies within region (% of total merchandise imports)',
    'Merchandise imports from economies in the Arab World (% of total merchandise imports)',
    'Merchandise imports from high-income economies (% of total merchandise imports)'
]

countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators_export)
data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)
data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot.reset_index(inplace=True, drop=True)
data_plot['Region'] = data_plot.merge(right=data_countries,on='CountryCode',how='left')['Region']
data_export = data_plot.pivot_table(values='Value',columns='Region',index='IndicatorName')


countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators_imports)
data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)
data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot.reset_index(inplace=True, drop=True)
data_plot['Region'] = data_plot.merge(right=data_countries,on='CountryCode',how='left')['Region']
data_import = data_plot.pivot_table(values='Value',columns='Region',index='IndicatorName')

Plot

sns.set(style="white",
        color_codes=True,
        rc={
            'figure.figsize':(20,8),
            'figure.dpi':250
           })
fig, (imports, exports) = plt.subplots(1, 2, sharex=True)

im1 = sns.heatmap(data_import.loc[:,xlabels],
                  ax=imports,
                  center=50,
                  cbar=False,
                  cmap="YlGnBu")
imports.set_yticklabels(ylabels)
imports.set_ylabel('')
imports.set_xlabel('')
imports.set_title('Imports :');

im2 = sns.heatmap(data_export.loc[:,xlabels],
                  ax=exports,
                  center=50,
                  yticklabels=False,
                  cbar=False,
                  cmap="YlGnBu")
exports.set_ylabel('')
exports.set_xlabel('')
exports.set_title('Exports :');
fig.subplots_adjust(wspace=0.05, hspace=0)

mappable = im1.get_children()[0]
fig.colorbar(mappable, ax = [imports,exports],orientation = 'vertical')
plt.savefig('./plots/06.heatmap.png', dpi=250, bbox_inches='tight');

Rugs

Data prep

selected_indicators = ['Merchandise trade (% of GDP)']

countries = data_countries.CountryCode[data_countries.Region!=''].unique()
condition = data_indicators.IndicatorName.isin(selected_indicators)

data_plot = data_indicators.loc[condition,:]
condition = data_plot.CountryCode.isin(countries)
data_plot = data_plot.loc[condition,:]
data_plot.sort_values(['CountryName','IndicatorName','Year'], inplace=True)

data_plot = data_plot.groupby(['CountryName','IndicatorName'], as_index=False).last()
data_plot.reset_index(inplace=True, drop=True)
data_plot['Region'] = data_plot.merge(right=data_countries,on='CountryCode',how='left')['Region']

Plot

columns_order = sort(data_plot.Region.unique())

sns.set(style="white",
        palette="pastel",
        color_codes=True,
        rc={
            'figure.figsize':(12,8),'figure.dpi':500
           })

g = sns.FacetGrid(data_plot,
                  col="Region",
                  col_wrap=4,
                  col_order=columns_order,subplot_kws={'ylim':(0,0.02)})
# sns.distplot removed in seaborn >= 0.12; use sns.kdeplot with rug=True
g.map(sns.kdeplot, "Value")
g.map(sns.rugplot, "Value");
plt.savefig('./plots/07.rugplot.png', dpi=500, bbox_inches='tight');