<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.nilesh42.science/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.nilesh42.science/" rel="alternate" type="text/html" /><updated>2026-06-06T15:57:37+05:30</updated><id>https://www.nilesh42.science/feed.xml</id><title type="html">Nilesh Patil</title><subtitle>AI leader building deployable AI systems, agentic workflows, and organizational adoption in regulated and large-scale environments.</subtitle><author><name>Nilesh Patil</name></author><entry><title type="html">Adaptive OpenMP scheduling in SimuCell3D for tissue mechanics simulations</title><link href="https://www.nilesh42.science/posts/simucell3d-adaptive-openmp-scheduling/" rel="alternate" type="text/html" title="Adaptive OpenMP scheduling in SimuCell3D for tissue mechanics simulations" /><published>2026-04-16T19:30:00+05:30</published><updated>2026-04-16T19:30:00+05:30</updated><id>https://www.nilesh42.science/posts/simucell3d-adaptive-openmp-scheduling</id><content type="html" xml:base="https://www.nilesh42.science/posts/simucell3d-adaptive-openmp-scheduling/"><![CDATA[<style>
/* Scoped to this post: light hero needs a dark title instead of the theme's white */
.page__hero--overlay .page__title,
.page__hero--overlay .page__lead,
.page__hero--overlay .page__meta,
.page__hero--overlay .page__meta a { color: #13233a !important; text-shadow: none !important; }
</style>

<p>Biology often looks like chemistry, but many of its hardest questions are mechanical : how a sheet of cells bends into a tube, how a spheroid opens into a fluid-filled vesicle, or how a growing tumor pushes against the tissue around it. These processes depend not just on genes and signaling, but on force, geometry, material properties, and the way cells physically interact. Cell-based simulations give us a way to study those rules directly, but detailed 3D models have usually forced a tradeoff: simplify the shape, or simulate only a small number of cells.</p>

<p><a href="https://git.bsse.ethz.ch/iber/Publications/2024_runser_simucell3d">SimuCell3D</a> is an open-source C++ engine built to push past that limitation. It models each cell as a triangulated mesh and simulates tissue growth and deformation at subcellular resolution, including proliferation, extracellular matrix, fluid cavities, nuclei, and the uneven mechanical properties of polarized epithelia. It can work with spheroids, vesicles, sheets, tubes, and more irregular geometries imported from microscopy images, making it useful for inferring biomechanical parameters from realistic tissue structures.</p>

<p>The ETH Iber lab released v1.0 in 2024. This post is a field report from the <a href="https://github.com/nilesh-patil/simucell3d">fork</a> I’ve been building on top of it. Apart from a few bug fixes in the original code,I focus on a practical performance bottleneck : load imbalance in the simulator’s OpenMP-parallel force and contact loops. I’ll walk through where the imbalance came from, how I measured it, and what it took to make the scheduler adapt so large, high-detail 3D tissue simulations become more tractable.</p>

<p>On the surface, the <code class="language-plaintext highlighter-rouge">Static</code> ( = v1.0 ) binary looked like it was using all of the CPU. The
<code class="language-plaintext highlighter-rouge">top</code> showed all eight cores pegged - utilization near 800%, the picture of a machine working hard. A tracing profiler told a different story: on the worst steps, close to a third of that “busy” time was threads idling at a barrier while one of them finished an oversized slice.</p>

<p>That gap is invisible to <code class="language-plaintext highlighter-rouge">top</code> because it reports <em>aggregate</em> CPU time: 800% across eight cores looks identical whether one thread sprints while seven idle or all eight share the load evenly. It simply can’t see the imbalance. A per-thread view can. If you train models, you already know this failure by a different name: a data-parallel step where one worker drew all the long sequences and the rest sit idle at the all-reduce barrier. Same physics, different address space.</p>

<blockquote>
  <p><strong>TL;DR.</strong> Static OpenMP scheduling handed equal-sized index ranges to every thread regardless of per-cell cost - so whichever thread drew the heavy cells kept them. I built a workload estimator, wired it to per-loop schedule selection, and let it adapt as the tissue grows. Result: <strong>~2.0–2.5× speedups at matched cell counts across the best-supported range (50–250 cells)</strong>, with consistent wins to ~2.3–2.4× through 1,000 - 10,000 cells on thinner data, and a suggestive ~3× at the largest sizes. SimuCell3D is a C++ 3D tissue-mechanics simulator developed by the <a href="https://www.nature.com/articles/s43588-024-00620-9">Iber lab at ETH Zürich</a>.</p>
</blockquote>

<hr />

<h2 id="the-threads-were-mostly-standing-around---waiting-to-pickup-computational-work">The threads were mostly standing around - waiting to pickup computational work</h2>

<p>In SimuCell3D the bulk of work is computing forces, contacts, and time steps for cells in a 3D tissue. I held the <code class="language-plaintext highlighter-rouge">Static</code> binary under a tracing profiler and measured. During contact detection - the phase that dominates the runtime - a measurable fraction of each parallel region was lost to threads waiting at the barrier while a straggler finished an oversized share. The simulation always completed. The imbalance just silently ate throughput as complexity of the tissue increased.</p>

<p><img src="/images/blog/simucell3d/figure-1.png" alt="Measured thread imbalance vs tissue size: **Static** vs **Adaptive**" />
<em>Measured <code class="language-plaintext highlighter-rouge">thread_imbalance_pct</code> across the benchmark run, <strong>Static</strong> vs <strong>Adaptive</strong>, plotted against tissue size (log scale). Adaptive sits consistently below Static across the whole range - <strong>Static</strong> : mean 14.6%, max 31.1%, median 16%; <strong>Adaptive</strong> : mean 10.7%, max 16%, median 13%. The v1 outliers reaching 31% are the straggler events that static scheduling manufactures by handing all the heavy cells to one thread.</em></p>

<p>The bottleneck was in the OpenMP directives such that <code class="language-plaintext highlighter-rouge">Static</code> mode leans entirely on fixed scheduling: three bare <code class="language-plaintext highlighter-rouge">#pragma omp parallel for</code> in <code class="language-plaintext highlighter-rouge">solver.cpp</code> with no <code class="language-plaintext highlighter-rouge">schedule()</code> clause (on GCC/Clang, the default is static with equal-sized contiguous chunks), an explicit <code class="language-plaintext highlighter-rouge">schedule(static)</code> in <code class="language-plaintext highlighter-rouge">time_integration.cpp</code>, and another bare parallel loop in the contact model. Static scheduling slices the cells into equal index ranges up front - thread 0 takes cells 0–15, thread 1 takes 16–31, and so on - and never rebalances. For a tissue simulation that assumption fails on the second cell.</p>

<p>The reason is geometric : Each cell is a triangulated mesh, and meshes drift apart in cost. A freshly divided child has fewer faces than its parent; a growing cell has more; a cell wedged in heavy contact spends far longer in its per-face loops than an isolated one. Static scheduling hands out cells by index, not by cost - so whichever thread happened to draw the heavy cells on step one keeps drawing them for the rest of the run.</p>

<hr />

<h2 id="a-single-number-for-how-uneven-is-it">A single number for “how uneven is it?”</h2>

<p>Before changing anything, we need a number to optimize - a fast signal that tells the scheduler “right now the work is lumpy, switch to a finer-grained strategy.”</p>

<p>On this experimental workload, adaptive-mode CoV peaked at 0.16 and never reached the 0.4 or 0.6 chunk-band thresholds built into the scheduler. The chunk divisor stayed at 4 - the coarsest setting - for the entire experiment. The high-CoV dynamic-chunking regime the system is designed for was never exercised on this run. This is modeled as a honest baseline for understanding what the CoV machinery does and why.</p>

<p><strong>Coefficient of variation</strong> <strong>( CoV = σ/μ )</strong> measures how spread out a distribution is relative to its mean. When every cell costs the same, CoV is zero and static scheduling is optimal. As some cells grow much heavier than others, CoV climbs and dynamic scheduling starts to win. The catch: we can’t measure CoV by <em>running</em> the loop - that’s the work we’re trying to schedule. We need to estimate each cell’s cost <em>before</em> the loop starts. The estimator in <code class="language-plaintext highlighter-rouge">src/solver.cpp</code> is a weighted sum of structural features per cell: a <code class="language-plaintext highlighter-rouge">base_cost</code> proportional to face count; a <code class="language-plaintext highlighter-rouge">contact_cost</code> scaled by a compile-time constant for the active contact model (<code class="language-plaintext highlighter-rouge">0.25</code> for node–face springs, <code class="language-plaintext highlighter-rouge">0.28</code> for node–node coupling, <code class="language-plaintext highlighter-rouge">0.32</code> for face–face coupling, selected by a <code class="language-plaintext highlighter-rouge">#if CONTACT_MODEL_INDEX</code> switch so exactly one is live in a given binary); an <code class="language-plaintext highlighter-rouge">integration_cost</code> of <code class="language-plaintext highlighter-rouge">0.65 × base_cost</code> for dynamic cells; plus smaller terms for polarization, growth, and mesh quality.</p>

<p>The coefficients were fit by hand against measured per-cell timings and rounded to two decimals. None of this is precise, and it doesn’t need to be at this point. The accurate alternative - profiling every cell’s real cost each step - would cost more than the imbalance it’s trying to remove. A cheap structural proxy wins as long as it stays <em>roughly monotonic</em> with real cost and runs fast. Both hold: the estimator is <strong>O(N_cells)</strong> and returns a single float.</p>

<p><img src="/images/blog/simucell3d/figure-2.png" alt="Workload CoV vs tissue size: v1 vs adaptive with 0.4 and 0.6 reference lines" />
<em>Measured workload CoV vs tissue size across experimental run. Dashed lines mark the 0.4 and 0.6 chunk-size-band thresholds built into <code class="language-plaintext highlighter-rouge">calculate_optimal_chunk_size()</code>. Adaptive-mode CoV: mean 0.11, median 0.13, max 0.16. v1 CoV: mean 0.15, median 0.16, max 0.31. Neither mode reached either band. On this workload, the adaptive chunk divisor stayed at 4 (coarsest setting) throughout the entire run.</em></p>

<hr />

<h2 id="three-scheduling-modes-chosen-by-how-lumpy-the-work-is">Three scheduling modes, chosen by how lumpy the work is</h2>

<p><code class="language-plaintext highlighter-rouge">static</code> hands each thread a fixed pile up front (fast, but one thread can get stuck with all the slow cells); <code class="language-plaintext highlighter-rouge">dynamic</code> gives everyone a shared queue they pull from as they finish (no idle threads, but a small per-grab cost); <code class="language-plaintext highlighter-rouge">guided</code> is the middle ground - big grabs first, shrinking toward the tail. If you’ve ever tuned dynamic batching or a work-stealing pool, this is the same trade-off: granularity versus coordination overhead.</p>

<p>The function <code class="language-plaintext highlighter-rouge">calculate_optimal_chunk_size()</code> turns the CoV directly into a choice of granularity via a divisor: <code class="language-plaintext highlighter-rouge">4</code> for CoV ≤ 0.4 (mild imbalance, coarse chunks), <code class="language-plaintext highlighter-rouge">10</code> for 0.4 &lt; CoV ≤ 0.6 (moderate), <code class="language-plaintext highlighter-rouge">20</code> for CoV &gt; 0.6 (high imbalance, fine chunks). Then <code class="language-plaintext highlighter-rouge">chunk = max(1, min(num_cells / (num_threads × divisor), 100))</code>:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// CoV → chunk granularity (src/solver.cpp, paraphrased)</span>
<span class="kt">int</span> <span class="n">divisor</span> <span class="o">=</span> <span class="n">cov</span> <span class="o">&lt;=</span> <span class="mf">0.4</span> <span class="o">?</span> <span class="mi">4</span>      <span class="c1">// mild imbalance → coarse chunks</span>
            <span class="o">:</span> <span class="n">cov</span> <span class="o">&lt;=</span> <span class="mf">0.6</span> <span class="o">?</span> <span class="mi">10</span>     <span class="c1">// moderate</span>
            <span class="o">:</span>              <span class="mi">20</span><span class="p">;</span>    <span class="c1">// high imbalance → fine chunks</span>
<span class="c1">// below 100 cells, always uses divisor=4 (hardcoded fast path)</span>
<span class="kt">int</span> <span class="n">chunk</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">max</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">num_cells</span> <span class="o">/</span> <span class="p">(</span><span class="n">num_threads</span> <span class="o">*</span> <span class="n">divisor</span><span class="p">),</span> <span class="mi">100</span><span class="p">));</span>
</code></pre></div></div>

<p>The chunk-size formula is only half of mode selection. The actual path the code takes in adaptive mode is a two-component process :</p>
<ul>
  <li>First, <code class="language-plaintext highlighter-rouge">lookup_benchmark_mode</code> (src/solver.cpp:218–245) consults a 13-entry static table keyed on cell-count range alone - not on thread count or CoV - mapping each range to a fixed schedule mode. The lookup table entries carry hardcoded rationale strings from earlier profiling.</li>
  <li>Second, <code class="language-plaintext highlighter-rouge">multi_factor_heuristic</code> (src/solver.cpp:250–314) runs independently, using CoV as its primary input: CoV &gt; 0.6 → dynamic; CoV &lt; 0.15 and tasks_per_thread ≥ 4 → static; 512 ≤ cells ≤ 4096 at moderate CoV → guided.</li>
</ul>

<p><strong>When the two paths disagree, the benchmark table always wins.</strong> The heuristic’s suggestion is logged in verbose output but never applied (src/solver.cpp:725–740). You build a multi-factor heuristic and then hardcode it to lose to a lookup table - the reasoning is that observed speedup data is more trustworthy than a computed estimate. It’s an unusual design choice, and it’s worth knowing.</p>

<p>The less obvious insight is that <em>different loop categories have different workload shapes</em>, so they should not share one schedule. <code class="language-plaintext highlighter-rouge">initialize_per_loop_schedules()</code> (src/solver.cpp:884–909) sets four fixed structural assignments:</p>

<ul>
  <li><strong>Contact detection</strong> → <code class="language-plaintext highlighter-rouge">omp_sched_dynamic</code>, chunk = max(1, base_chunk). Most irregular; costs vary with mesh density and contact geometry.</li>
  <li><strong>Time integration</strong> → <code class="language-plaintext highlighter-rouge">omp_sched_guided</code>, chunk = max(1, base_chunk × 2). More uniform per-cell cost; guided’s shrinking-grab schedule captures most benefit without dynamic’s per-grab overhead.</li>
  <li><strong>Mesh updates (face-type classification)</strong> → <code class="language-plaintext highlighter-rouge">omp_sched_static</code>, chunk = 0 (equal distribution). The one loop category where cost per cell genuinely is uniform - static maximises cache locality here.</li>
  <li><strong>Cell division</strong> → <code class="language-plaintext highlighter-rouge">omp_sched_dynamic</code>, chunk = max(1, base_chunk / 2). Rarest and most variable phase; finer-grained dynamic avoids stragglers on the few iterations it fires.</li>
</ul>

<p>Note: 8 of 10 <code class="language-plaintext highlighter-rouge">#pragma omp parallel for</code> loops in <code class="language-plaintext highlighter-rouge">src/</code> carry <code class="language-plaintext highlighter-rouge">schedule(runtime)</code>, enabling this late-binding approach. Two bare loops without a schedule clause remain in <code class="language-plaintext highlighter-rouge">cell_divider.cpp:27</code> and <code class="language-plaintext highlighter-rouge">poisson_sampling.cpp:142</code>; these are not yet wired into the adaptive machinery.</p>

<p>The last piece is <em>adaptation over time</em>. Every 50 iterations (<code class="language-plaintext highlighter-rouge">COV_UPDATE_INTERVAL = 50</code>), two functions run back-to-back (src/solver.cpp:1083 and 1086). <code class="language-plaintext highlighter-rouge">update_workload_heterogeneity()</code> handles CoV recalculation for non-adaptive modes, gated by a combined condition at line 1153: if CoV has changed by more than 20% <em>and</em> the mode is not adaptive, it then checks whether the implied chunk size would also shift by more than 20% (line 1161) before calling <code class="language-plaintext highlighter-rouge">omp_set_schedule()</code>. In adaptive mode that outer gate skips both operations entirely. <code class="language-plaintext highlighter-rouge">adaptive_schedule_update()</code> (src/solver.cpp:1211–1296) handles all schedule changes in adaptive mode: it is called every 50 iterations, re-evaluates the simulation phase, and only if the phase has changed does it call <code class="language-plaintext highlighter-rouge">omp_set_schedule()</code> and reset <code class="language-plaintext highlighter-rouge">recent_division_count_</code>.</p>

<hr />

<h2 id="a-three-phase-design-two-phases-exercised">A three-phase design, two phases exercised</h2>

<p>In 41.3 hours and across all three committed benchmark runs, <code class="language-plaintext highlighter-rouge">grep GROWTH</code> across all five performance-diagnostics logs returns zero matches. GROWTH never fired.</p>

<p>The scheduler tracks three simulation phases:</p>

<ul>
  <li><strong>INITIALIZATION</strong> (fewer than 10 cells): <code class="language-plaintext highlighter-rouge">dynamic</code>. Thread count exceeds task count; any fixed assignment starves threads.</li>
  <li><strong>GROWTH</strong> (division_rate &gt; 0.01, where division_rate = recent_division_count_ / (num_cells × 50)): <code class="language-plaintext highlighter-rouge">dynamic</code> if CoV &gt; 0.4, else <code class="language-plaintext highlighter-rouge">guided</code>. Cell counts and costs are changing fast.</li>
  <li><strong>HOMEOSTASIS</strong> (all other cases): <code class="language-plaintext highlighter-rouge">static</code> if cell count &gt; 1,000, else <code class="language-plaintext highlighter-rouge">guided</code>. The tissue has settled; costs are stable enough that static’s cache locality pays off at scale.</li>
</ul>

<p><img src="/images/blog/simucell3d/figure-3.png" alt="Detected scheduler phase vs iteration, colored by phase: INITIALIZATION then HOMEOSTASIS; GROWTH never appears" />
<em>Detected simulation phase across the full 41.3h run (adaptive mode). INITIALIZATION fires for the first ~10,600 iterations while the tissue has fewer than 10 cells. Then the tissue transitions directly to HOMEOSTASIS and stays there for the remainder - growing from 10 to 6,091 cells. GROWTH never fired. Across all three committed benchmark runs, <code class="language-plaintext highlighter-rouge">grep GROWTH</code> across all five performance-diagnostics logs returns zero matches.</em></p>

<p>This is an honest observation about the workload, not a design flaw. The growth-from-1-cell scenario using <code class="language-plaintext highlighter-rouge">parameters_paper_exact.xml</code> divides slowly and steadily enough that <code class="language-plaintext highlighter-rouge">recent_division_count_</code> never exceeded 0.01 × num_cells × 50 - the GROWTH trigger threshold - so the phase stayed in HOMEOSTASIS throughout. A faster-dividing parameter set - or a scenario that starts from a small fixed tissue and forces rapid expansion - would exercise GROWTH. On <em>this</em> workload, the adaptive scheduler spent 105 samples in INITIALIZATION and 348 samples in HOMEOSTASIS, never touching the phase the GROWTH branch was written for.</p>

<hr />

<h2 id="the-numbers">The numbers</h2>

<p>At run’s end adaptive shows <strong>0.030 IPS versus v1’s 0.044 IPS</strong> - adaptive looks slower. It’s not: it’s managing 2.66× more cells (6,091 vs 2,288). The right comparison is at matched cell counts.</p>

<p><img src="/images/blog/simucell3d/figure-4.png" alt="Throughput vs tissue size (log-log) for v1 and adaptive across the full run" />
<em>Iterations per second vs cell count (log-log), from the 41.3h benchmark run. Adaptive sits above v1 at every matched cell count. v1 data terminates at ~2,288 cells; adaptive continues to 6,091. The apparent “slower” IPS at run’s end for adaptive is because it’s managing 2.66× more cells - at matched cell counts, adaptive is consistently faster.</em></p>

<p>At matched cell counts the picture is clear: adaptive is faster across the board, and the gap widens as the tissue grows. 
Speedups:</p>
<ul>
  <li>1.43× at 10 cells;</li>
  <li>2.07× at 50 cells;</li>
  <li>Climbing to ~2.3–2.5× through the 100–500 cell range. The 50–250 cell band is best supported statistically (30, 35, and 22 v1 samples respectively).</li>
  <li>At 500 cells there are only 6 v1 samples; at 1,000 cells there are 5; and the 3.05× at 2,000 cells rests on 2 v1 samples, so treat that as suggestive rather than firm.</li>
</ul>

<p>Adaptive is solidly ~2.0–2.5× faster across the range where the data is dense, with a plausible upward trend at the largest sizes that needs more data to pin down.</p>

<p>The scaling exponent (time per iteration ~ N^α) tells a related story. Adaptive: α = 1.136 (R² = 0.999). v1: α = 1.213 (R² = 0.998). Lower is closer to linear; adaptive’s ~6% better exponent means the gap widens gradually as the tissue grows, which matches what the throughput curve shows.</p>

<p>The biological signal is essentially a non-event, which is what you want: median pressure deviation between the two modes is 1.71% across 823 matched iterations. The adaptive scheduler changes how threads pick up work - not what the physics computes.</p>

<hr />

<h2 id="where-the-runtime-actually-goes">Where the runtime actually goes</h2>

<p>To understand <em>why</em> adaptive is faster, it helps to look at where the time goes.</p>

<p><img src="/images/blog/simucell3d/figure-5.png" alt="Stacked horizontal bar: fraction of iteration time by phase, v1 vs adaptive, cells &gt;100" />
<em>Phase-time fractions (cells &gt; 100) from the 41.3h run. v1: contact detection 81.6%, polarization + internal forces 14.1%, time integration 2.6%, mesh refinement 1.8%. Adaptive: contact detection 52.5%, polarization + internal forces 35.7%, time integration 7.4%, mesh refinement 4.4%. Contact detection drops from 82% to 52% of mean iteration time - by far the biggest shift. The second-largest phase is polarization and internal forces, not time integration.</em></p>

<p>Contact detection dominates v1 at 81.6% of iteration time. In adaptive mode it drops to 52.5% - not just because of scheduling, but because the contact detection improvements (USPG rewrite, Morton sorting, SAP switching) run alongside the scheduler. The second-most-expensive phase is polarization and internal forces (14.1% → 35.7%), which becomes more visible in adaptive mode precisely because contact detection has gotten faster. Time integration accounts for only 2.6–7.4% of iteration time - a minor contributor, not a dominant phase.</p>

<p>This is worth stating plainly for causal clarity: the load-imbalance improvement (mean <code class="language-plaintext highlighter-rouge">thread_imbalance_pct</code> from 14.6% to 10.7%) is real and consistent across the full run. But the biggest lever in the speedup numbers is the contact-detection work - faster algorithms plus better scheduling of an inherently irregular phase. I don’t have a clean ablation between the USPG/Morton changes and the scheduler; the 41.3h run exercised both together. Turning off Morton sorting and re-running is the measurement this section is missing.</p>

<hr />

<h2 id="three-changes-that-compounded-the-gains">Three changes that compounded the gains</h2>

<p><strong>Faster contact detection :</strong></p>

<p>Contact detection is both the most irregular phase and the most expensive, so speeding it up multiplies with the scheduling win. Its spatial-lookup containers (an unbounded uniform grid, “USPG”) switched from <code class="language-plaintext highlighter-rouge">std::forward_list</code> to <code class="language-plaintext highlighter-rouge">std::vector</code> - better cache behaviour and fewer pointer chases. Morton sorting of faces before USPG insertion was added separately : faces near each other in space land near each other in memory and the cache stops thrashing. Above 500 cells, a different broad-phase algorithm - Sweep-and-Prune - switches on automatically (<code class="language-plaintext highlighter-rouge">ADAPTIVE_SAP_CELL_THRESHOLD = 500</code>). SAP projects objects onto axes and finds overlapping intervals; it scales better than a uniform grid at sparse large-N scenes and produces the same exact candidate pairs - it is not a coarser approximation. The <code class="language-plaintext highlighter-rouge">contact_detection_algorithm</code> XML parameter accepts <code class="language-plaintext highlighter-rouge">uspg</code>, <code class="language-plaintext highlighter-rouge">sweep_and_prune</code>, or <code class="language-plaintext highlighter-rouge">adaptive</code>; the default is <code class="language-plaintext highlighter-rouge">uspg</code> unless explicitly set.</p>

<p><strong>Better CI and memory-safety tooling :</strong></p>

<p>v1.0’s CI ran one Release build and <code class="language-plaintext highlighter-rouge">ctest -C Release</code>. The fork adds Debug builds, AddressSanitizer and UndefinedBehaviorSanitizer, and a clang-format check (<a href="https://github.com/nilesh-patil/simucell3d/commit/a2ca28e">commit <code class="language-plaintext highlighter-rouge">a2ca28e</code></a>). Latency profiling was added separately in <a href="https://github.com/nilesh-patil/simucell3d/commit/b0aac1a">commit <code class="language-plaintext highlighter-rouge">b0aac1a</code></a> (2026-02-15). The sanitizers earned their place immediately: they caught a heap-use-after-free in <code class="language-plaintext highlighter-rouge">local_mesh_refiner::split_edge</code> - a reference left dangling after a vector reallocation - that had survived careful manual review (<a href="https://github.com/nilesh-patil/simucell3d/commit/d5e2112">commit <code class="language-plaintext highlighter-rouge">d5e2112</code></a>).</p>

<p><strong>A correctness sweep :</strong></p>

<p>Alongside the performance work: a division-by-zero guard in <code class="language-plaintext highlighter-rouge">mat33::inverse</code>, NaN suppression in <code class="language-plaintext highlighter-rouge">vec3::angle</code>, a fix for a <code class="language-plaintext highlighter-rouge">cell_lst.size()</code> data race in the parallel cell-division loop (<code class="language-plaintext highlighter-rouge">cell_divider.cpp</code>), and null checks in <code class="language-plaintext highlighter-rouge">parameter_reader</code>. The parallel division bug is the kind that <em>only</em> shows up under the heavier thread utilization the new schedules produce, which is why it mattered to fix it before trusting any benchmark.</p>

<hr />

<h2 id="whats-still-open">What’s still open</h2>

<p>A few honest gaps remain:</p>

<ul>
  <li><strong>Thread Sanitizer isn’t in CI yet.</strong> Only ASan and UBSan run. The cell-division parallel section still produces enough benign-looking races that TSan is noisy, and that noise needs triaging before it can gate the build.</li>
  <li><strong>The Python wrapper hasn’t caught up.</strong> The new <code class="language-plaintext highlighter-rouge">--schedule=</code> and <code class="language-plaintext highlighter-rouge">--diagnostics-csv=</code> CLI flags exist on the C++ binary but aren’t exposed through the pybind11 wrapper - <code class="language-plaintext highlighter-rouge">simucell3d_wrapper</code> only forwards simulation parameters, cell list, thread count, and verbosity. Python users can’t reach the new knobs yet. Tracked for the next release.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">assert()</code> in hot paths.</strong> 306 <code class="language-plaintext highlighter-rouge">assert()</code> calls still live in production <code class="language-plaintext highlighter-rouge">src/</code> (all runtime <code class="language-plaintext highlighter-rouge">assert()</code>, none <code class="language-plaintext highlighter-rouge">static_assert</code>). Several have been converted to proper runtime checks in the critical paths; the rest are a slower migration.</li>
  <li><strong>The high-CoV machinery is unexercised on committed workloads.</strong> The 0.4/0.6 chunk-band thresholds and the GROWTH phase detector are implemented and confirmed in the source, but neither fired on the 41.3h paper-exact run. A faster-dividing scenario would be the right workload to exercise them.</li>
</ul>

<hr />

<h2 id="three-things-this-taught-me">Three things this taught me</h2>

<p><strong>One: the most consequential scheduling parameter wasn’t the integrator or time step - it was the order in which threads picked up work.</strong> The right choice costs almost nothing at runtime. The wrong choice costs 2×+, silently, because CPU utilization stays pinned at 100% even while thread utilization tanks. <code class="language-plaintext highlighter-rouge">top</code> tells you nothing useful here; a tracing profiler tells you everything.</p>

<p><strong>Two: instrumentation before optimization.</strong> It would have been easy to flip <code class="language-plaintext highlighter-rouge">schedule(static)</code> to <code class="language-plaintext highlighter-rouge">schedule(dynamic)</code> and call it done. Building the full measurement stack took longer - but it produced the uncomfortable observation that CoV never triggered the fancy band machinery on this workload, that GROWTH never fired, that the high-CoV branches are correct but untested on these parameters. Knowing those gaps is more useful than assuming everything worked.</p>

<p><strong>Three: fix correctness before you trust a benchmark.</strong> The sanitizers found a memory bug that survived manual review; the parallel-division race only surfaces under the heavier thread utilization the new schedules cause. In the right order, you catch those before they quietly contaminate your numbers.</p>

<hr />

<ul>
  <li><strong>Code</strong>: <a href="https://github.com/nilesh-patil/simucell3d">github.com/nilesh-patil/simucell3d</a> (tag <code class="language-plaintext highlighter-rouge">v2.0</code>, branch <code class="language-plaintext highlighter-rouge">main</code>)</li>
  <li><strong>Reference</strong>: Runser et al., <a href="https://www.nature.com/articles/s43588-024-00620-9">SimuCell3D</a>, <em>Nature Computational Science</em> (2024)</li>
  <li><strong>Project page</strong>: <a href="/portfolio/simucell3d/">SimuCell3D on Side Projects</a></li>
</ul>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="cpp" /><category term="hpc" /><category term="openmp" /><category term="simucell3d" /><category term="computational-biology" /><category term="profiling" /><summary type="html"><![CDATA[Measuring work imbalance and teaching a tissue simulator's scheduler to adapt for dynamic workloads.]]></summary></entry><entry><title type="html">Distributed K-Means Clustering in Python</title><link href="https://www.nilesh42.science/posts/distributed-kmeans-clustering/" rel="alternate" type="text/html" title="Distributed K-Means Clustering in Python" /><published>2020-05-20T15:30:00+05:30</published><updated>2020-05-20T15:30:00+05:30</updated><id>https://www.nilesh42.science/posts/distributed-kmeans-clustering</id><content type="html" xml:base="https://www.nilesh42.science/posts/distributed-kmeans-clustering/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p><a href="https://en.wikipedia.org/wiki/K-means_clustering">K-means clustering</a> is one of the most widely used unsupervised machine learning algorithms for partitioning data into k-clusters. While the algorithm is conceptually simple and computationally efficient for moderate-sized datasets, it faces significant challenges when dealing with large datasets where each iteration can contain millions or billions of data points.</p>

<p>In this post, we’ll explore how to implement distributed k-means clustering in Python using popular frameworks like <strong>PySpark</strong> and <strong>Dask</strong>, enabling us to handle massive datasets that don’t fit into memory on a single machine.</p>

<h2 id="k-means-algorithm-overview">K-Means Algorithm Overview</h2>

<p>Before diving into distributed implementations, let’s quickly review the standard k-means algorithm:</p>

<ol>
  <li><strong>Initialization</strong>: Sample k-cluster centroids randomly</li>
  <li><strong>Assignment Step</strong>: Assign each data point to the nearest centroid</li>
  <li><strong>Update Step</strong>: Recalculate centroids as the mean of assigned points</li>
  <li><strong>Verify</strong> : Check for stop condition</li>
  <li><strong>Repeat</strong> : steps 2-3 until convergence or maximum iterations reached</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="c1"># Standard scikit-learn k-means for small datasets
</span><span class="k">def</span> <span class="nf">standard_kmeans_example</span><span class="p">():</span>
    <span class="c1"># Generate sample data
</span>    <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
    <span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
    
    <span class="c1"># Apply k-means
</span>    <span class="n">kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
    <span class="n">labels</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
    
    <span class="c1"># Plot results
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">labels</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'viridis'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">cluster_centers_</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> 
                <span class="n">kmeans</span><span class="p">.</span><span class="n">cluster_centers_</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> 
                <span class="n">c</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">'x'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Standard K-Means Clustering'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="n">kmeans</span><span class="p">,</span> <span class="n">labels</span>
</code></pre></div></div>

<h2 id="challenges-with-large-scale-data">Challenges with Large-Scale Data</h2>

<p>When dealing with large datasets, traditional k-means implementations face several challenges:</p>

<ol>
  <li><strong>Memory Constraints</strong>: Large datasets may not fit into memory of a single machine</li>
  <li><strong>Computational Complexity</strong>: $O(n \times k \times d \times i)$ time complexity becomes prohibitive where <code class="language-plaintext highlighter-rouge">n</code> is the number of data points, <code class="language-plaintext highlighter-rouge">k</code> is the number of clusters, <code class="language-plaintext highlighter-rouge">d</code> is the dimensionality, and <code class="language-plaintext highlighter-rouge">i</code> is the number of iterations</li>
  <li><strong>I/O Bottlenecks</strong>: Reading massive datasets from disk creates slow data transfer between disk and memory</li>
  <li><strong>Scalability</strong>: Single-machine limitations prevent processing datasets beyond hardware capacity</li>
</ol>

<h2 id="distributed-k-means-implementations">Distributed K-Means Implementations</h2>

<h3 id="distributed-k-means-with-pyspark">Distributed K-Means with PySpark</h3>

<p><a href="https://spark.apache.org/">Apache Spark</a> provides good support for distributed k-means clustering through its MLlib library. Here’s how to implement it:</p>

<h4 id="setting-up-pyspark-environment">Setting Up PySpark Environment</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="kn">from</span> <span class="nn">pyspark.ml.clustering</span> <span class="kn">import</span> <span class="n">KMeans</span> <span class="k">as</span> <span class="n">SparkKMeans</span>
<span class="kn">from</span> <span class="nn">pyspark.ml.feature</span> <span class="kn">import</span> <span class="n">VectorAssembler</span>
<span class="kn">from</span> <span class="nn">pyspark.ml.evaluation</span> <span class="kn">import</span> <span class="n">ClusteringEvaluator</span>
<span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span>
<span class="kn">import</span> <span class="nn">pyspark.sql.functions</span> <span class="k">as</span> <span class="n">F</span>

<span class="c1"># Initialize Spark session
</span><span class="k">def</span> <span class="nf">create_spark_session</span><span class="p">():</span>
    <span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span> \
        <span class="p">.</span><span class="n">appName</span><span class="p">(</span><span class="s">"DistributedKMeans"</span><span class="p">)</span> \
        <span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.sql.adaptive.enabled"</span><span class="p">,</span> <span class="s">"true"</span><span class="p">)</span> \
        <span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">"spark.sql.adaptive.coalescePartitions.enabled"</span><span class="p">,</span> <span class="s">"true"</span><span class="p">)</span> \
        <span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
    
    <span class="n">spark</span><span class="p">.</span><span class="n">sparkContext</span><span class="p">.</span><span class="n">setLogLevel</span><span class="p">(</span><span class="s">"WARN"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">spark</span>
</code></pre></div></div>

<h4 id="implementing-distributed-k-means">Implementing Distributed K-Means</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">distributed_kmeans_pyspark</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">data_path</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
    <span class="s">"""
    Perform distributed k-means clustering using PySpark
    
    Parameters:
    -----------
    spark : SparkSession
        Active Spark session
    data_path : str
        Path to the dataset (CSV, Parquet, etc.)
    k : int
        Number of clusters
    max_iter : int
        Maximum number of iterations
        
    Returns:
    --------
    model : KMeansModel
        Trained k-means model
    predictions : DataFrame
        DataFrame with cluster assignments
    """</span>
    
    <span class="c1"># Load data
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">"header"</span><span class="p">,</span> <span class="s">"true"</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">)</span>
    
    <span class="c1"># Convert string columns to numeric (if needed)
</span>    <span class="n">numeric_cols</span> <span class="o">=</span> <span class="p">[</span><span class="n">col_name</span> <span class="k">for</span> <span class="n">col_name</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span> 
                    <span class="k">if</span> <span class="n">col_name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'id'</span><span class="p">,</span> <span class="s">'label'</span><span class="p">]]</span>
    
    <span class="k">for</span> <span class="n">col_name</span> <span class="ow">in</span> <span class="n">numeric_cols</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">withColumn</span><span class="p">(</span><span class="n">col_name</span><span class="p">,</span> <span class="n">col</span><span class="p">(</span><span class="n">col_name</span><span class="p">).</span><span class="n">cast</span><span class="p">(</span><span class="s">"double"</span><span class="p">))</span>
    
    <span class="c1"># Create feature vector
</span>    <span class="n">assembler</span> <span class="o">=</span> <span class="n">VectorAssembler</span><span class="p">(</span>
        <span class="n">inputCols</span><span class="o">=</span><span class="n">numeric_cols</span><span class="p">,</span>
        <span class="n">outputCol</span><span class="o">=</span><span class="s">"features"</span>
    <span class="p">)</span>
    <span class="n">df_vectorized</span> <span class="o">=</span> <span class="n">assembler</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    
    <span class="c1"># Initialize and train k-means model
</span>    <span class="n">kmeans</span> <span class="o">=</span> <span class="n">SparkKMeans</span><span class="p">(</span>
        <span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">,</span>
        <span class="n">maxIter</span><span class="o">=</span><span class="n">max_iter</span><span class="p">,</span>
        <span class="n">seed</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span>
        <span class="n">featuresCol</span><span class="o">=</span><span class="s">"features"</span><span class="p">,</span>
        <span class="n">predictionCol</span><span class="o">=</span><span class="s">"cluster"</span>
    <span class="p">)</span>
    
    <span class="n">model</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_vectorized</span><span class="p">)</span>
    
    <span class="c1"># Make predictions
</span>    <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df_vectorized</span><span class="p">)</span>
    
    <span class="c1"># Display cluster centers
</span>    <span class="n">centers</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">clusterCenters</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Cluster Centers:"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">center</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">centers</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Cluster </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">center</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="c1"># Calculate Within Set Sum of Squared Errors (WSSSE)
</span>    <span class="n">wssse</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">trainingCost</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Within Set Sum of Squared Errors: </span><span class="si">{</span><span class="n">wssse</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">model</span><span class="p">,</span> <span class="n">predictions</span>

<span class="c1"># Example usage
</span><span class="k">def</span> <span class="nf">run_pyspark_example</span><span class="p">():</span>
    <span class="n">spark</span> <span class="o">=</span> <span class="n">create_spark_session</span><span class="p">()</span>
    
    <span class="c1"># Generate sample distributed dataset
</span>    <span class="n">sample_data</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100000</span><span class="p">).</span><span class="n">select</span><span class="p">(</span>
        <span class="n">F</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="n">seed</span><span class="o">=</span><span class="mi">42</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">"feature1"</span><span class="p">),</span>
        <span class="n">F</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="n">seed</span><span class="o">=</span><span class="mi">43</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">"feature2"</span><span class="p">),</span>
        <span class="n">F</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="n">seed</span><span class="o">=</span><span class="mi">44</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">"feature3"</span><span class="p">)</span>
    <span class="p">)</span>
    
    <span class="c1"># Save to temporary location for demonstration
</span>    <span class="n">sample_data</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">"overwrite"</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">"header"</span><span class="p">,</span> <span class="s">"true"</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">"/tmp/sample_data"</span><span class="p">)</span>
    
    <span class="c1"># Run distributed k-means
</span>    <span class="n">model</span><span class="p">,</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">distributed_kmeans_pyspark</span><span class="p">(</span>
        <span class="n">spark</span><span class="p">,</span> <span class="s">"/tmp/sample_data"</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">50</span>
    <span class="p">)</span>
    
    <span class="c1"># Show sample predictions
</span>    <span class="n">predictions</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="s">"feature1"</span><span class="p">,</span> <span class="s">"feature2"</span><span class="p">,</span> <span class="s">"feature3"</span><span class="p">,</span> <span class="s">"cluster"</span><span class="p">).</span><span class="n">show</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
    
    <span class="n">spark</span><span class="p">.</span><span class="n">stop</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">model</span><span class="p">,</span> <span class="n">predictions</span>
</code></pre></div></div>

<h4 id="advanced-pyspark-k-means-with-custom-initialization">Advanced PySpark K-Means with Custom Initialization</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">advanced_kmeans_pyspark</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">init_method</span><span class="o">=</span><span class="s">"k-means++"</span><span class="p">):</span>
    <span class="s">"""
    Advanced k-means implementation with custom initialization strategies
    """</span>
    
    <span class="k">if</span> <span class="n">init_method</span> <span class="o">==</span> <span class="s">"k-means++"</span><span class="p">:</span>
        <span class="c1"># PySpark uses k-means|| (k-means parallel) as its scalable
</span>        <span class="c1"># initialisation — not k-means++.  Both aim for good seeding but
</span>        <span class="c1"># k-means|| runs in O(log k) passes rather than k sequential passes.
</span>        <span class="n">kmeans</span> <span class="o">=</span> <span class="n">SparkKMeans</span><span class="p">(</span><span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="n">initMode</span><span class="o">=</span><span class="s">"k-means||"</span><span class="p">,</span> <span class="n">initSteps</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">init_method</span> <span class="o">==</span> <span class="s">"random"</span><span class="p">:</span>
        <span class="n">kmeans</span> <span class="o">=</span> <span class="n">SparkKMeans</span><span class="p">(</span><span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="n">initMode</span><span class="o">=</span><span class="s">"random"</span><span class="p">)</span>
    
    <span class="c1"># Add convergence tolerance
</span>    <span class="n">kmeans</span><span class="p">.</span><span class="n">setTol</span><span class="p">(</span><span class="mf">1e-4</span><span class="p">)</span>
    
    <span class="c1"># Train model
</span>    <span class="n">model</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    
    <span class="c1"># Evaluate model performance
</span>    <span class="n">silhouette_evaluator</span> <span class="o">=</span> <span class="n">ClusteringEvaluator</span><span class="p">()</span>
    <span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    <span class="n">silhouette</span> <span class="o">=</span> <span class="n">silhouette_evaluator</span><span class="p">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span>
    
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Silhouette Score: </span><span class="si">{</span><span class="n">silhouette</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">model</span><span class="p">,</span> <span class="n">predictions</span>
</code></pre></div></div>

<h3 id="distributed-k-means-with-dask">Distributed K-Means with Dask</h3>

<p>Dask provides another robust framework for distributed computing in Python, with a more Pythonic API and specifically targeting numerical computing applications :</p>

<h4 id="setting-up-dask-environment">Setting Up Dask Environment</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dask.dataframe</span> <span class="k">as</span> <span class="n">dd</span>
<span class="kn">import</span> <span class="nn">dask.array</span> <span class="k">as</span> <span class="n">da</span>
<span class="kn">from</span> <span class="nn">dask.distributed</span> <span class="kn">import</span> <span class="n">Client</span><span class="p">,</span> <span class="n">as_completed</span>
<span class="kn">from</span> <span class="nn">dask_ml.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span> <span class="k">as</span> <span class="n">DaskKMeans</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="k">def</span> <span class="nf">create_dask_client</span><span class="p">(</span><span class="n">n_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">):</span>
    <span class="s">"""
    Create a Dask client for distributed computing
    """</span>
    <span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="n">n_workers</span><span class="o">=</span><span class="n">n_workers</span><span class="p">,</span> <span class="n">threads_per_worker</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Dask dashboard available at: </span><span class="si">{</span><span class="n">client</span><span class="p">.</span><span class="n">dashboard_link</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">client</span>
</code></pre></div></div>

<h4 id="implementing-distributed-k-means-with-dask">Implementing Distributed K-Means with Dask</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">distributed_kmeans_dask</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">chunk_size</span><span class="o">=</span><span class="s">"100MB"</span><span class="p">):</span>
    <span class="s">"""
    Perform distributed k-means clustering using Dask
    
    Parameters:
    -----------
    data_path : str
        Path to the dataset
    k : int
        Number of clusters
    max_iter : int
        Maximum number of iterations
    chunk_size : str
        Size of data chunks for processing
        
    Returns:
    --------
    model : DaskKMeans
        Trained k-means model
    labels : dask.array
        Cluster assignments
    """</span>
    
    <span class="c1"># Load data with Dask
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">dd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">)</span>
    
    <span class="c1"># Convert to numpy array for clustering
</span>    <span class="c1"># Exclude non-numeric columns
</span>    <span class="n">numeric_cols</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">select_dtypes</span><span class="p">(</span><span class="n">include</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">number</span><span class="p">]).</span><span class="n">columns</span>
    <span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">numeric_cols</span><span class="p">].</span><span class="n">to_dask_array</span><span class="p">(</span><span class="n">lengths</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    
    <span class="c1"># Initialize Dask k-means
</span>    <span class="n">kmeans</span> <span class="o">=</span> <span class="n">DaskKMeans</span><span class="p">(</span>
        <span class="n">n_clusters</span><span class="o">=</span><span class="n">k</span><span class="p">,</span>
        <span class="n">max_iter</span><span class="o">=</span><span class="n">max_iter</span><span class="p">,</span>
        <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span>
        <span class="n">init_max_iter</span><span class="o">=</span><span class="mi">3</span>  <span class="c1"># For k-means++ initialization
</span>    <span class="p">)</span>
    
    <span class="c1"># Fit the model
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"Training distributed k-means model..."</span><span class="p">)</span>
    <span class="n">kmeans</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
    
    <span class="c1"># Predict cluster labels
</span>    <span class="n">labels</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
    
    <span class="c1"># Get cluster centers
</span>    <span class="n">centers</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">cluster_centers_</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Cluster centers shape: </span><span class="si">{</span><span class="n">centers</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="c1"># Calculate inertia (within-cluster sum of squares)
</span>    <span class="n">inertia</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">inertia_</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Inertia: </span><span class="si">{</span><span class="n">inertia</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">kmeans</span><span class="p">,</span> <span class="n">labels</span>

<span class="c1"># Example with synthetic data generation
</span><span class="k">def</span> <span class="nf">generate_large_dataset_dask</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">1000000</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">n_centers</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
    <span class="s">"""
    Generate a large synthetic dataset using Dask
    """</span>
    <span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">make_blobs</span>
    
    <span class="c1"># Generate data in chunks
</span>    <span class="n">chunk_size</span> <span class="o">=</span> <span class="mi">100000</span>
    <span class="n">chunks</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n_samples</span><span class="p">,</span> <span class="n">chunk_size</span><span class="p">):</span>
        <span class="n">current_chunk_size</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">chunk_size</span><span class="p">,</span> <span class="n">n_samples</span> <span class="o">-</span> <span class="n">i</span><span class="p">)</span>
        <span class="n">X_chunk</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">make_blobs</span><span class="p">(</span>
            <span class="n">n_samples</span><span class="o">=</span><span class="n">current_chunk_size</span><span class="p">,</span>
            <span class="n">centers</span><span class="o">=</span><span class="n">n_centers</span><span class="p">,</span>
            <span class="n">n_features</span><span class="o">=</span><span class="n">n_features</span><span class="p">,</span>
            <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span>
            <span class="n">cluster_std</span><span class="o">=</span><span class="mf">1.5</span>
        <span class="p">)</span>
        <span class="n">chunks</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">da</span><span class="p">.</span><span class="n">from_array</span><span class="p">(</span><span class="n">X_chunk</span><span class="p">,</span> <span class="n">chunks</span><span class="o">=</span><span class="p">(</span><span class="n">current_chunk_size</span><span class="p">,</span> <span class="n">n_features</span><span class="p">)))</span>
    
    <span class="c1"># Concatenate chunks
</span>    <span class="n">X_large</span> <span class="o">=</span> <span class="n">da</span><span class="p">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">chunks</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">X_large</span>

<span class="k">def</span> <span class="nf">run_dask_example</span><span class="p">():</span>
    <span class="s">"""
    Complete example of distributed k-means with Dask
    """</span>
    <span class="c1"># Create Dask client
</span>    <span class="n">client</span> <span class="o">=</span> <span class="n">create_dask_client</span><span class="p">(</span><span class="n">n_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
    
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Generate large synthetic dataset
</span>        <span class="k">print</span><span class="p">(</span><span class="s">"Generating large synthetic dataset..."</span><span class="p">)</span>
        <span class="n">X</span> <span class="o">=</span> <span class="n">generate_large_dataset_dask</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">500000</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">n_centers</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
        
        <span class="c1"># Apply k-means clustering
</span>        <span class="k">print</span><span class="p">(</span><span class="s">"Applying distributed k-means..."</span><span class="p">)</span>
        <span class="n">kmeans</span> <span class="o">=</span> <span class="n">DaskKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
        
        <span class="c1"># Fit and predict
</span>        <span class="n">labels</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        
        <span class="c1"># Compute results
</span>        <span class="n">unique_labels</span> <span class="o">=</span> <span class="n">da</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">labels</span><span class="p">).</span><span class="n">compute</span><span class="p">()</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Unique cluster labels: </span><span class="si">{</span><span class="n">unique_labels</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="c1"># Calculate cluster statistics
</span>        <span class="n">centers</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">cluster_centers_</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Cluster centers:</span><span class="se">\n</span><span class="si">{</span><span class="n">centers</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">kmeans</span><span class="p">,</span> <span class="n">labels</span>
        
    <span class="k">finally</span><span class="p">:</span>
        <span class="c1"># Clean up
</span>        <span class="n">client</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>

<h4 id="incremental-k-means-with-dask">Incremental K-Means with Dask</h4>

<blockquote>
  <p><strong>Note (corrected):</strong> <code class="language-plaintext highlighter-rouge">dask_ml.cluster.KMeans</code> does <strong>not</strong> expose
<code class="language-plaintext highlighter-rouge">partial_fit</code> — calling it raises <code class="language-plaintext highlighter-rouge">AttributeError</code>. For true incremental /
streaming k-means, use <code class="language-plaintext highlighter-rouge">dask_ml.wrappers.Incremental</code> wrapping scikit-learn’s
<code class="language-plaintext highlighter-rouge">MiniBatchKMeans</code>, which does support <code class="language-plaintext highlighter-rouge">partial_fit</code>.</p>
</blockquote>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">incremental_kmeans_dask</span><span class="p">(</span><span class="n">data_stream</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">10000</span><span class="p">):</span>
    <span class="s">"""
    Implement incremental k-means for streaming data using Dask's Incremental
    wrapper around scikit-learn's MiniBatchKMeans.

    dask_ml.cluster.KMeans does NOT have partial_fit; use Incremental instead.
    """</span>
    <span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">MiniBatchKMeans</span>
    <span class="kn">from</span> <span class="nn">dask_ml.wrappers</span> <span class="kn">import</span> <span class="n">Incremental</span>

    <span class="c1"># Wrap MiniBatchKMeans (which supports partial_fit) with Dask's Incremental
</span>    <span class="n">base</span> <span class="o">=</span> <span class="n">MiniBatchKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
    <span class="n">kmeans</span> <span class="o">=</span> <span class="n">Incremental</span><span class="p">(</span><span class="n">base</span><span class="p">)</span>

    <span class="c1"># Process data in batches — Incremental delegates to partial_fit internally
</span>    <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">data_stream</span><span class="p">:</span>
        <span class="n">kmeans</span><span class="p">.</span><span class="n">partial_fit</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>

        <span class="c1"># Track convergence via the underlying estimator
</span>        <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">estimator_</span><span class="p">,</span> <span class="s">'inertia_'</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Current inertia: </span><span class="si">{</span><span class="n">kmeans</span><span class="p">.</span><span class="n">estimator_</span><span class="p">.</span><span class="n">inertia_</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">kmeans</span>
</code></pre></div></div>

<h2 id="performance-comparison">Performance Comparison</h2>

<p>Let’s compare the performance of different implementations:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">make_blobs</span>

<span class="k">def</span> <span class="nf">performance_comparison</span><span class="p">():</span>
    <span class="s">"""
    Compare performance of different k-means implementations
    """</span>
    <span class="c1"># Generate test data
</span>    <span class="n">sizes</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10000</span><span class="p">,</span> <span class="mi">50000</span><span class="p">,</span> <span class="mi">100000</span><span class="p">]</span>
    <span class="n">results</span> <span class="o">=</span> <span class="p">{}</span>
    
    <span class="k">for</span> <span class="n">size</span> <span class="ow">in</span> <span class="n">sizes</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Testing with </span><span class="si">{</span><span class="n">size</span><span class="si">}</span><span class="s"> samples..."</span><span class="p">)</span>
        <span class="n">X</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">make_blobs</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="n">size</span><span class="p">,</span> <span class="n">centers</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
        
        <span class="c1"># Scikit-learn (single-threaded)
</span>        <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
        <span class="n">sklearn_kmeans</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
        <span class="n">sklearn_kmeans</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="n">sklearn_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span>
        
        <span class="c1"># Dask (if data fits in memory)
</span>        <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
        <span class="n">X_dask</span> <span class="o">=</span> <span class="n">da</span><span class="p">.</span><span class="n">from_array</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">chunks</span><span class="o">=</span><span class="p">(</span><span class="mi">10000</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
        <span class="n">dask_kmeans</span> <span class="o">=</span> <span class="n">DaskKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
        <span class="n">dask_kmeans</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_dask</span><span class="p">)</span>
        <span class="n">dask_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span>
        
        <span class="n">results</span><span class="p">[</span><span class="n">size</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">'sklearn'</span><span class="p">:</span> <span class="n">sklearn_time</span><span class="p">,</span>
            <span class="s">'dask'</span><span class="p">:</span> <span class="n">dask_time</span><span class="p">,</span>
            <span class="s">'speedup'</span><span class="p">:</span> <span class="n">sklearn_time</span> <span class="o">/</span> <span class="n">dask_time</span>
        <span class="p">}</span>
        
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Scikit-learn: </span><span class="si">{</span><span class="n">sklearn_time</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">s"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Dask: </span><span class="si">{</span><span class="n">dask_time</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">s"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Speedup: </span><span class="si">{</span><span class="n">sklearn_time</span><span class="o">/</span><span class="n">dask_time</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">x"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">results</span>
</code></pre></div></div>

<h2 id="best-practices">Best Practices</h2>

<h3 id="1-data-preprocessing">1. Data Preprocessing</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">preprocess_for_clustering</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
    <span class="s">"""
    Best practices for data preprocessing
    """</span>
    <span class="c1"># Handle missing values
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">mean</span><span class="p">())</span>
    
    <span class="c1"># Standardize features
</span>    <span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
    <span class="n">scaler</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">()</span>
    <span class="n">df_scaled</span> <span class="o">=</span> <span class="n">scaler</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    
    <span class="c1"># Remove outliers (optional)
</span>    <span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
    <span class="n">z_scores</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">stats</span><span class="p">.</span><span class="n">zscore</span><span class="p">(</span><span class="n">df_scaled</span><span class="p">))</span>
    <span class="n">df_clean</span> <span class="o">=</span> <span class="n">df_scaled</span><span class="p">[(</span><span class="n">z_scores</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">).</span><span class="nb">all</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)]</span>
    
    <span class="k">return</span> <span class="n">df_clean</span>
</code></pre></div></div>

<h3 id="2-optimal-number-of-clusters">2. Optimal Number of Clusters</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">find_optimal_k_distributed</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
    <span class="s">"""
    Sweep k from 1..max_k and record the within-cluster sum of squares (inertia)
    for each fit. Returns the k-range and the inertia curve for elbow inspection.
    """</span>
    <span class="n">inertias</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">k_range</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">max_k</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>

    <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">k_range</span><span class="p">:</span>
        <span class="n">kmeans</span> <span class="o">=</span> <span class="n">DaskKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="n">k</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
        <span class="n">kmeans</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="n">inertias</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">kmeans</span><span class="p">.</span><span class="n">inertia_</span><span class="p">)</span>

    <span class="c1"># Plot elbow curve
</span>    <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span><span class="p">,</span> <span class="s">'bo-'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Number of Clusters (k)'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Inertia'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Elbow Method for Optimal k'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>

    <span class="k">return</span> <span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span>


<span class="k">def</span> <span class="nf">find_elbow_point</span><span class="p">(</span><span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span><span class="p">):</span>
    <span class="s">"""
    Pick the k whose point on the inertia curve is farthest (in perpendicular
    distance) from the straight line connecting the first and last points.
    A simple, robust heuristic for elbow detection that avoids a manual eyeball.
    """</span>
    <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
    <span class="n">points</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span><span class="p">)),</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span>
    <span class="n">line_vec</span> <span class="o">=</span> <span class="n">points</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">points</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">line_vec_norm</span> <span class="o">=</span> <span class="n">line_vec</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">line_vec</span><span class="p">)</span>
    <span class="n">vec_from_first</span> <span class="o">=</span> <span class="n">points</span> <span class="o">-</span> <span class="n">points</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">scalar_proj</span> <span class="o">=</span> <span class="n">vec_from_first</span> <span class="o">@</span> <span class="n">line_vec_norm</span>
    <span class="n">proj_points</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">outer</span><span class="p">(</span><span class="n">scalar_proj</span><span class="p">,</span> <span class="n">line_vec_norm</span><span class="p">)</span> <span class="o">+</span> <span class="n">points</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">distances</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">points</span> <span class="o">-</span> <span class="n">proj_points</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">k_range</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">distances</span><span class="p">))])</span>
</code></pre></div></div>

<h3 id="3-memory-management">3. Memory Management</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">optimize_memory_usage</span><span class="p">():</span>
    <span class="s">"""
    Tips for optimizing memory usage in distributed k-means
    """</span>
    
    <span class="c1"># 1. Use appropriate chunk sizes
</span>    <span class="n">chunk_size</span> <span class="o">=</span> <span class="s">"100MB"</span>  <span class="c1"># Adjust based on available memory
</span>    
    <span class="c1"># 2. Use float32 instead of float64 when possible
</span>    <span class="n">dtype</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">float32</span>
    
    <span class="c1"># 3. Persist intermediate results strategically
</span>    <span class="c1"># df.persist()  # Only when data will be reused multiple times
</span>    
    <span class="c1"># 4. Use garbage collection
</span>    <span class="kn">import</span> <span class="nn">gc</span>
    <span class="n">gc</span><span class="p">.</span><span class="n">collect</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="p">{</span>
        <span class="s">'chunk_size'</span><span class="p">:</span> <span class="n">chunk_size</span><span class="p">,</span>
        <span class="s">'dtype'</span><span class="p">:</span> <span class="n">dtype</span>
    <span class="p">}</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Distributed k-means clustering is essential for handling large-scale datasets that exceed single-machine capabilities. Both PySpark and Dask offer robust solutions:</p>

<p><strong>PySpark MLlib</strong> is ideal when:</p>
<ul>
  <li>Working with very large datasets (&gt;1TB)</li>
  <li>Integration with existing Spark ecosystem</li>
  <li>Need for production-grade fault tolerance</li>
</ul>

<p><strong>Dask</strong> is preferred when:</p>
<ul>
  <li>Working with Python-centric workflows</li>
  <li>Need for interactive development</li>
  <li>Integration with existing NumPy/Pandas code</li>
</ul>

<p><strong>Key Takeaways:</strong></p>

<ol>
  <li><strong>Preprocessing</strong> is crucial for distributed clustering success</li>
  <li><strong>Chunk size</strong> optimization significantly impacts performance</li>
  <li><strong>Initialization methods</strong> (k-means++) are important for convergence</li>
  <li><strong>Monitoring</strong> convergence and performance metrics is essential</li>
  <li><strong>Memory management</strong> becomes critical at scale</li>
</ol>

<p>The choice between frameworks depends on your specific use case, data size, and existing infrastructure. Both approaches can handle datasets that would be impossible to process on a single machine, making k-means clustering accessible for big data applications.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Final example: Complete pipeline
</span><span class="k">def</span> <span class="nf">complete_distributed_kmeans_pipeline</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">framework</span><span class="o">=</span><span class="s">'dask'</span><span class="p">):</span>
    <span class="s">"""
    Complete pipeline for distributed k-means clustering
    """</span>
    <span class="k">if</span> <span class="n">framework</span> <span class="o">==</span> <span class="s">'dask'</span><span class="p">:</span>
        <span class="n">client</span> <span class="o">=</span> <span class="n">create_dask_client</span><span class="p">()</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Load and preprocess data
</span>            <span class="n">df</span> <span class="o">=</span> <span class="n">dd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">)</span>
            <span class="n">X</span> <span class="o">=</span> <span class="n">preprocess_for_clustering</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
            
            <span class="c1"># Find optimal k
</span>            <span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span> <span class="o">=</span> <span class="n">find_optimal_k_distributed</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
            <span class="n">optimal_k</span> <span class="o">=</span> <span class="n">find_elbow_point</span><span class="p">(</span><span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span><span class="p">)</span>
            
            <span class="c1"># Train final model
</span>            <span class="n">kmeans</span> <span class="o">=</span> <span class="n">DaskKMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="n">optimal_k</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
            <span class="n">labels</span> <span class="o">=</span> <span class="n">kmeans</span><span class="p">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
            
            <span class="k">return</span> <span class="n">kmeans</span><span class="p">,</span> <span class="n">labels</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="n">client</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    
    <span class="k">elif</span> <span class="n">framework</span> <span class="o">==</span> <span class="s">'pyspark'</span><span class="p">:</span>
        <span class="n">spark</span> <span class="o">=</span> <span class="n">create_spark_session</span><span class="p">()</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># k is selected here in the same way as the Dask branch, so the
</span>            <span class="c1"># PySpark path also runs the elbow scan instead of relying on a
</span>            <span class="c1"># name carried over from another branch.
</span>            <span class="n">df_spark</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">csv</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">inferSchema</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
            <span class="n">X</span> <span class="o">=</span> <span class="n">df_spark</span><span class="p">.</span><span class="n">toPandas</span><span class="p">().</span><span class="n">values</span>
            <span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span> <span class="o">=</span> <span class="n">find_optimal_k_distributed</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
            <span class="n">optimal_k</span> <span class="o">=</span> <span class="n">find_elbow_point</span><span class="p">(</span><span class="n">k_range</span><span class="p">,</span> <span class="n">inertias</span><span class="p">)</span>

            <span class="n">model</span><span class="p">,</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">distributed_kmeans_pyspark</span><span class="p">(</span>
                <span class="n">spark</span><span class="p">,</span> <span class="n">data_path</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">optimal_k</span>
            <span class="p">)</span>
            <span class="k">return</span> <span class="n">model</span><span class="p">,</span> <span class="n">predictions</span>
        <span class="k">finally</span><span class="p">:</span>
            <span class="n">spark</span><span class="p">.</span><span class="n">stop</span><span class="p">()</span>
</code></pre></div></div>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="machine-learning" /><category term="clustering" /><category term="distributed-computing" /><category term="python" /><category term="pyspark" /><category term="dask" /><summary type="html"><![CDATA[Implementing scalable k-means clustering using distributed computing frameworks]]></summary></entry><entry><title type="html">Galactic Morphology using Deep-Learning</title><link href="https://www.nilesh42.science/posts/galactic-morphology-using-deep-learning/" rel="alternate" type="text/html" title="Galactic Morphology using Deep-Learning" /><published>2017-07-26T01:09:55+05:30</published><updated>2017-07-26T05:18:19+05:30</updated><id>https://www.nilesh42.science/posts/galactic-morphology-using-deep-learning</id><content type="html" xml:base="https://www.nilesh42.science/posts/galactic-morphology-using-deep-learning/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>Astronomy has historically been one of the most data intensive fields &amp; a major chunk of this data is collected as images collected by a number of telescopes - terrestrial as well as in space. A BIG data-project which aims to collate this data from various sources to form a coherent picture of the universe is <a href="http://www.sdss.org/">Sloan Digital Sky Survey</a>.</p>

<p>To quote the project website:</p>

<blockquote>
  <p>The Sloan Digital Sky Survey has created the most detailed three-dimensional maps of the Universe ever made, with deep multi-color images of one third of the sky, and spectra for more than three million astronomical objects. Learn and explore all phases and surveys — past, present, and future — of the SDSS.</p>
</blockquote>

<p>A citizen science project called <a href="https://www.galaxyzoo.org">Galaxy zoo</a> was launched in 2007, through this project thousands of volunteers classified 100k+ images of galaxies. A flow-chart of questions asked to volunteers shown on the project website is as follows.</p>

<p><img src="/images/blog/galaxyzoo/00.galaxyzoo-tree.png" alt="Galaxy Zoo decision tree" class="center-image" height="850px" width="1050px" /></p>

<h2 id="data-description">Data Description</h2>

<p>The dataset consists of 100k+ jpeg images and the corresponding score vector for each image. The score vector has 37 values where each value represents the weighted score from volunteers in the project.</p>

<p>The important point to remember is that the scores aren’t probability score per se. They are weighted scores &amp; so they vary from 0 to 1 but all sub scores for a question don’t necessarily sum up to 1 as a rule.</p>

<p>Each image is of size <code class="language-plaintext highlighter-rouge">424×424×3</code> and each value is between <code class="language-plaintext highlighter-rouge">0</code>–<code class="language-plaintext highlighter-rouge">255</code>. A good practice is to rescale the data. In our experiments, we normalize the images by computing $\mu_\text{channel}$ and $\sigma_\text{channel}$ over the full dataset, then normalizing each channel using its corresponding $\mu$ and $\sigma$. The channels and cell values do not represent the physical aspect of data collection — they are standardized to the accepted image format range of <code class="language-plaintext highlighter-rouge">0</code>–<code class="language-plaintext highlighter-rouge">255</code> — so the normalization is more of a hack for better gradient updates than a domain-knowledge-based modification.</p>

<p>A few sample images from the dataset:</p>

<p><img src="/images/blog/galaxyzoo/01.galaxies.png" alt="Galaxy sample" class="center-image" /></p>

<p>These images are read in as <code class="language-plaintext highlighter-rouge">numpy</code> arrays in python with the following representation:</p>

<p><img src="/images/blog/galaxyzoo/02.numpy_array.png" alt="Numpy array" class="center-image" height="250px" width="350px" /></p>

<h2 id="fully-convolutional-classifier">Fully Convolutional Classifier</h2>

<p>A convolutional network takes in your image array as input extracts features from this array which best represents the task at hand &amp; then gives out a classification/regression output. Standard classification models use one-vs-rest scheme to represent output for an elegant representation of the classification task. In this form, the correct class is assigned <code class="language-plaintext highlighter-rouge">1</code> while other possible classes in the dataset are assigned <code class="language-plaintext highlighter-rouge">0</code>. The output vector is of length <code class="language-plaintext highlighter-rouge">c</code>, where <code class="language-plaintext highlighter-rouge">c</code> is total number of classes in the dataset.</p>

<p>In the Galaxy Morphology classification task, we use standard <code class="language-plaintext highlighter-rouge">.jpeg</code> images to learn the shape attributes as a vector of length 37 which describes its properties. We set it up as a regression task in this case, since our ground truth is a weighted version of votes gathered from volunteers.</p>

<p>The model takes in a normalized array representing input image. This array passes through the following layers stacked after each other:</p>

<ol>
  <li>
    <p><em>Convolutional layer</em> : It consists of a set of learnt features. In terms of standard modeling terminology, the features that a model uses are usually handcrafted i.e. some form of transformations of the raw input data. In images, the kernels that form the convolutional layer are expected to learn optimal features for the task at hand, instead of features crafted by a domain expert. Since the output from this convolutional layer is learnt w.r.t output, the features being generated at each step should ideally be the optimal representation of input provided at that step.</p>
  </li>
  <li>
    <p><em>Pooling layer</em> : The pooling layer reduce size of incoming representation by selecting from a set of appropriate downsampling functions. The <code class="language-plaintext highlighter-rouge">max-pooling</code> layer chooses maximum from a given volume of array as an appropriate representation of the focus. Similarly, <code class="language-plaintext highlighter-rouge">average-pooling</code> takes average of the volume.</p>
  </li>
  <li>
    <p><em>Activation</em> : Activation functions are used to introduce non-linearity in the model. This layer applies a given function to each element of input array. Standard activation functions used in models are ‘relu’, ‘softmax’, ‘sigmoid’, ‘tanh’ etc.</p>
  </li>
  <li>
    <p><em>Dropout</em>: a regularization technique developed to reduce overfitting in deep neural networks. At training time, the activations of a randomly chosen fraction of neurons are set to <code class="language-plaintext highlighter-rouge">zero</code>; at prediction time, the weights learnt for those units are multiplied by the keep-probability <code class="language-plaintext highlighter-rouge">p</code>. Below, (a) shows a standard fully-connected network and (b) shows the same network with dropout applied — at each training step a different subset of activations is zeroed, which prevents any single neuron from dominating the learned representation.</p>

    <p><img src="/images/blog/galaxyzoo/03.droput_representation.png" alt="Dropout network" class="center-image" height="250px" width="500px" /></p>
  </li>
  <li>
    <p><em>Batch normalization</em>: a major problem during training is that the distribution of inputs to each successive layer shifts as the parameters of preceding layers update. This <em>internal covariate shift</em> forces small learning rates and slow convergence. Batch normalization addresses it by normalizing each channel’s activations to $\mu_x = 0$ and $\sigma_x = 1$ across the current mini-batch, then applying a learned affine transform:</p>

\[X_\text{out} = \gamma \cdot \frac{X_\text{in} - \mu_X}{\sigma_X} + \beta\]

    <p>Here $\mu_X$ and $\sigma_X$ are computed channelwise. The effect is that subsequent layers see a more stable input distribution, which permits larger learning rates and faster training.</p>
  </li>
</ol>

<h3 id="setup">Setup</h3>

<p>The layers above are stacked into a module and we experiment with several network structures, starting from well-established architectures and moving toward more recently published ones. The final layers are <code class="language-plaintext highlighter-rouge">Dense</code> (fully connected) layers; their output is compared with the expected output to compute the loss for each observation. The loss is back-propagated to update the convolutional kernel weights and the dense layer weights, with the expectation that, given enough data, the network learns features that are optimal for the task at hand.</p>

<p>In our setup, features are extracted successively from the raw galaxy image and a regression head matches the human-generated score vector for the same image. The score is a vector of length <code class="language-plaintext highlighter-rouge">37</code>, where each entry encodes one physical aspect of the galaxy — together describing its morphology.</p>

<h2 id="references">References</h2>

<ol>
  <li>Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &amp; Salakhutdinov, R. (2014). <a href="https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf">Dropout: A Simple Way to Prevent Neural Networks from Overfitting</a>. <em>Journal of Machine Learning Research</em>, 15.</li>
  <li>Ioffe, S. &amp; Szegedy, C. (2015). <a href="https://arxiv.org/abs/1502.03167">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>. <em>Proceedings of the 32nd International Conference on Machine Learning</em>.</li>
</ol>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="deep learning" /><category term="galaxy" /><category term="computer-vision" /><category term="machine learning" /><category term="neural networks" /><summary type="html"><![CDATA[Training a deep neural net to understand galactic structure]]></summary></entry><entry><title type="html">Characterizing &amp;amp; analyzing networks : NYC taxi data</title><link href="https://www.nilesh42.science/posts/transportation-graph-nyc-taxi-data/" rel="alternate" type="text/html" title="Characterizing &amp;amp; analyzing networks : NYC taxi data" /><published>2017-03-15T01:09:55+05:30</published><updated>2017-03-15T05:18:19+05:30</updated><id>https://www.nilesh42.science/posts/transportation-graph-nyc-taxi-data</id><content type="html" xml:base="https://www.nilesh42.science/posts/transportation-graph-nyc-taxi-data/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>Transportation networks offer a fascinating opportunity to identify local population’s travel habits, daily routines and a usage-data driven way to augment city-planning decisions. In our analysis, we focus on exploring travel patterns of New York City residents using 146+ million taxi trips taken for the year 2015. The complete code, visualizations &amp; reports are available in the <a href="https://github.com/nilesh-patil/TransportationFlowNetwork"><strong>Github Repository</strong></a></p>

<h2 id="prior-work">Prior work</h2>

<p>GPS based transportation networks have been studied in detail for traffic flow analysis and determining social dynamics<sup>[1]</sup>. Bike sharing datasets have been used for clustering locations based on the usage profile<sup>[2]</sup> and predicting bike demand<sup>[3]</sup> GPS based taxi datasets have been used to identify mobility patterns in Shanghai, China<sup>[4]</sup>.</p>

<p>In<sup>[4]</sup>, the trip distribution has been characterized as combination of 3 independent types and non-negative matrix factorization has been used to identify 3 patterns from 1.58 million trips in Shanghai, China. This is the core of our approach, as we are attempting to characterize taxi usage in New York City with a similar dataset. Prior analysis has also been done to produce transportation network graphs from street geometries<sup>[5]</sup> and subway maps<sup>[6]</sup>.</p>

<h2 id="data-description">Data Description</h2>

<p>a)  <em>Raw data :</em></p>

<p>New York City’s Taxi &amp; Limousine Commission has made the taxi-trips dataset available for public use since 2009<sup>[7]</sup>. We used this dataset for our analysis. The complete dataset contains approximately 150 million trips per year &amp; each row represents one trip, with features for starting and stopping point, distance travelled, taxi-charge, time taken etc. We are using 2015 dataset which contains 146,112,990 trips in total. For combining the geolocation data to census tracts, we used the extremely helpful NYC landuse dataset<sup>[8]</sup>.</p>

<p>b)  <em>Data transformation :</em></p>

<p>We use the following variables for each trip:</p>
<ul>
  <li>Trip starting timestamp</li>
  <li>Start point (Lat/Long)</li>
  <li>Trip stopping timestamp</li>
  <li>Stop point(Lat/Long)</li>
  <li>Charges</li>
</ul>

<p>Using these trips, we build our directed graph with start and stop location for a given trip as nodes. As an additional condition, we only used locations with high number of trips (more than 500 in the year for a given pair of locations). From prior work we summarized that rounding off location coordinates to 2 decimal points is also an option and given our difficulties analyzing the dataset with 40,000+ nodes, we are now in the process of reducing our network by 2 separate approaches:</p>

<ul>
  <li>One node for 5 Manhattan blocks</li>
  <li>Using 6 million most frequent trips to get 1275 most frequently travelled edges.</li>
  <li>In our current network, each node represents 200m x 200m around it and each edge represents the total number of trips between two nodes in a given year.</li>
</ul>

<p>We create features for month, day, weekday, period of the day etc from the timestamp. An issue that we faced during our initial analysis was that due to geography of the network itself, we had a network where multiple nodes represented the same location due to multiple entrances (e.g. Penn station has multiple entries and exits &amp; our network had multiple nodes representing essentially the same real world location)</p>

<p>Same location : multiple close nodes</p>

<p><img src="/images/blog/graphs/nycTaxiData/image2.png" alt="nycTaxiDataimage2" class="center-image" height="300px" width="300px" /></p>

<p>We finally decided to merge our dataset with US Census Bureau census tracts which removed the above problem. We have 580+ nodes in the final network and are worked with analyzing census tract as node and number of trips between two census tracts represented as an edge.</p>

<h2 id="exploratory-analysis">Exploratory Analysis</h2>

<p>1 . Trips taken in each month(fig-i) peaks between March-May and drops substantially during June onwards. This can be attributed directly to the weather pattern, as commuters are expected to avoid walking long distances during low temperatures or rainy weather.</p>

<p><img src="/images/blog/graphs/nycTaxiData/image3.png" alt="Monthly trip volume across the 2015 calendar year" class="center-image" height="300px" width="850px" /></p>
<center>(fig-i) Trips per month, 2015</center>

<p>2 . For our full dataset, the degree distribution is shown in first plot whereas the second plot shows degree distribution for graph generated using edges with at least 500 trips in 2015.</p>

<p><img src="/images/blog/graphs/nycTaxiData/image4.png" alt="Degree distribution: full graph (left) and filtered to edges with ≥500 trips (right)" class="center-image" height="300px" width="850px" /></p>
<center>(fig-ii) Degree distribution, full vs. filtered network</center>

<p>3 . The heat-map (fig iii) shows relative trip-density for each hour of the day in 2015. From this information, we summarize that the busiest hours are 6AM to 9AM. We attribute most of this traffic to users on their daily work-commute whereas there is a remarkable increase in density between 12AM to 4AM on weekends. We are looking for an approach to perform similar analysis on the network structure generated from our subset, to create temporal traffic density visualization.</p>

<p><img src="/images/blog/graphs/nycTaxiData/image5.png" alt="Hourly trip-density heatmap across the year — rows are days of the week, columns are hours of the day" class="center-image" height="1000px" width="2000px" /></p>
<center>(fig-iii) Trip density by hour of day and day of week</center>

<p>4 . We analyzed the cost vs duration relationship for the trips and found interesting abnormal number of constant cost trips. We are attributing these trips to:</p>

<ul>
  <li>
    <p>Tips being rounded off to nearest 5/10</p>
  </li>
  <li>
    <p>Traffic delays within the same trips leading to delays</p>
  </li>
</ul>

<p><img src="/images/blog/graphs/nycTaxiData/image6.png" alt="Cost-versus-duration scatter showing constant-cost outliers attributable to rounded tips and traffic delays" class="center-image" height="500px" width="500px" /></p>

<h2 id="full-network-analysis">Full Network analysis</h2>

<p><img src="/images/blog/graphs/nycTaxiData/image7.png" alt="Taxi-Graph" class="center-image" height="750px" width="750px" /></p>

<p>The full network is approximately represented on its actual geographic location and we have mapped out-degree to node size &amp; trips as edge thickness. We observed that:</p>

<ul>
  <li>The suburbs are served less than Manhattan, Upper east/west and downtown</li>
  <li>Transportation hubs are also network hubs and office areas are the next closest central nodes</li>
  <li>Surprisingly, east village &amp; lower east side is also least connected of the complete network, even though these areas are not geographically separated like the suburbs</li>
</ul>

<p><img src="/images/blog/graphs/nycTaxiData/image8.PNG" alt="In-degree and out-degree versus total trips, split by trip-volume tier" class="center-image" height="750px" width="750px" /></p>

<p>When we divide our nodes into two subcategories by number of trips greater than or equal to 500 and less than 500 respectively, and plot them in-degree/out-degree against the total number of trips, two stark contrast appear. The top two graphs on Figure 5 speak for nodes with number of trips greater than or equal to 500, with blue graph marks the in-degree to trips ratio and the red graph for out-degree. The bottle two graphs are for nodes with number of trips less than 500.</p>

<p>For number of trips&gt;=500, most of the outliers on the far right of the graphs have their physical locations in Madison Square Garden, Penn Station such inner city attractions. This means that a large number of people coming to these attractions from relatively small number of places and most of these in-coming places are located in Manhattan (e.g. 250,000 trips coming from about 200 places. On average 1250 trips from a single locale).</p>

<p>For number of trips &gt;=500, most of the outliers on the far right of the graphs have their physical locations in Airports (LaGuardia as well as JFK), and their trips versus degree ratio is much smaller, meaning that a small amount of people coming from all sorts of places. And we can easily identify places with low connectivity by looking at the “tail” of the plot, and we found out that the smaller this ratio is, the farther away the node is from Manhattan.</p>

<p><img src="/images/blog/graphs/nycTaxiData/image9.png" alt="Network laid out by geographic location; node size encodes out-degree, edge thickness encodes trip count" class="center-image" height="750px" width="750px" /></p>

<p>We divided the network into 3 communities, using multilevel community detection in igraph. The above plot has these communities mapped by size to number of trips leaving each node in the whole year. The 3 communities can be described as follows:</p>

<ul>
  <li>The Blue labels represent <strong>community</strong> <strong>A</strong>, and the nodes neatly fall into Manhattan and adjoining New Jersey areas which turn out to be most well connected nodes. They are well connected to Manhattan (Manhattan as well as New Jersey locations) and the other two communities (Only Manhattan locations)</li>
  <li>The Green labels represent <strong>community B</strong>, and it represents the locations with highest taxi connectivity to north parts of the city which in turn is due to least city transport connectivity (Bus/Metro etc) – towards north in general</li>
  <li>The red labels represent <strong>community C,</strong> representing north NYC, Queens &amp; Bronx which we understand are in the same community due to least connectivity towards south in general.</li>
  <li>We wanted to show determine if the Suburb structure as determined by Dash &amp; Rae<sup>[11]</sup> using national dataset holds true at a local level, and that’s why this result is interesting – based on our primary exploration our inference is, that within a city, it doesn’t hold up.</li>
</ul>

<p><img src="/images/blog/graphs/nycTaxiData/image10.png" alt="Community-detection partitioning of the trip network — each color is one community" class="center-image" height="750px" width="750px" /></p>

<p>We plotted a snapshot of the trips leaving major NYC areas and this shows, Manhattan is the most connected of all, whereas most trips from Lower east side, East village &amp; Brooklyn end up towards northern sides of NYC. With a small fraction ending up within the community itself.</p>

<h2 id="final-comments">Final comments</h2>

<ul>
  <li>
    <p>Cities are different from other networks in the sense that minor rerouting is usually pretty straightforward i.e. a detour of one block is easy to take and usually does not lead to significant change in cost, time or length of the route.</p>
  </li>
  <li>
    <p>A city like New York won’t have a single critical location apart from the structural hubs (Metro hubs, airports &amp; bus hubs) – this is pretty apparent from the degree centrality analysis. The closest thing to a central location in NYC is its avenues, specifically Broadway &amp; 6th avenue. Broadway runs North to South whereas 6th Avenue runs South to North (one-way routes).</p>
  </li>
  <li>
    <p>Another interesting observation from the analysis is that East village &amp; below is similar to suburbs in terms of taxi usage. This is surprising because as stated by Tobler, the first law of geography is “everything is related to everything else, but near things are more related than distant things.”<sup>[12]</sup> &amp; this first law is the foundation of spatial dependence and spatial autocorrelation utilized specifically for the inverse distance weighting method for spatial interpolation<sup>[13]</sup>.</p>
  </li>
</ul>

<h2 id="references">References</h2>

<ol>
  <li>P. S. Castro, D. Zhang, C. Chen, S. Li, and G. Pan. From taxi gps traces to social and community dynamics: A survey.ACM Comput. Surv, December 2013.</li>
  <li>C. Etienne and O. Latifa. Model-based count series clustering for bike sharing system usage mining. ACM Trans. Intell. Syst. Technol., July 2014</li>
  <li>D. Singhvi, S. Singhvi, P. Frazier, S. Henderson, E. Mahony, D. Shmoys, and D. Woodard. Predicting bike usage for new york city’s bike sharing system. In AAAI Workshops, 2015.</li>
  <li>Peng C, Jin X, Wong K-C, Shi M, Lio P (2012) Collective Human Mobility Pattern from Taxi Trips in Urban Area. PLoS ONE 7(4): e34487.doi:10.1371/journal.pone.0034487</li>
  <li>P. Crucitti, V. Latora, and S. Porta. Centrality measures in spatial net-works of urban streets. PHYSICAL REVIEW E, 73(3):036125, 2006.</li>
  <li>Derrible S (2012) Network Centrality of Metro Systems. PLoS ONE 7(7): e40575. doi:10.1371/journal.pone.0040575</li>
  <li>NYC taxi data: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml</li>
  <li>NYC land use: <a href="https://www1.nyc.gov/site/planning/index.page">https://www1.nyc.gov/site/planning/index.page</a></li>
  <li>W. Cui, H. Zhou, H. Qu, P. C. Wong and X. Li, “Geometry-Based Edge Clustering for Graph Visualization,” in <em>IEEE Transactions on visualization and Computer Graphics</em>, vol. 14, no. 6, pp. 1277-1284, Nov.-Dec. 2008 doi: 10.1109/TVCG.2008.135</li>
  <li>Holten, D, &amp; Wijk, J 2009, ‘Force-Directed Edge Bundling for Graph Visualization’, <em>Computer Graphics Forum</em>, 28, 3, pp. 983-990, Business Source Premier, EBSCO<em>host</em>, viewed 12 November 2016.</li>
  <li>Dash Nelson G, Rae A (2016) An Economic Geography of the United States: From Commutes to Megaregions. PLoSONE11(11):e0166083.doi:10.1371/journal.pone.0166083</li>
  <li>Tobler W. A computer movie simulating urban growth in the Detroit region. Economic Geography 1970;46: 234–240.</li>
  <li>https://en.wikipedia.org/wiki/Tobler’s_first_law_of_geography</li>
</ol>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="graph" /><category term="network" /><category term="visualization" /><category term="nyc" /><category term="transportation" /><summary type="html"><![CDATA[Analyzing a real world graph : transportation network in NYC]]></summary></entry><entry><title type="html">Working with numpy</title><link href="https://www.nilesh42.science/posts/working-with-numpy/" rel="alternate" type="text/html" title="Working with numpy" /><published>2017-03-04T11:40:55+05:30</published><updated>2017-07-11T00:39:55+05:30</updated><id>https://www.nilesh42.science/posts/working-with-numpy</id><content type="html" xml:base="https://www.nilesh42.science/posts/working-with-numpy/"><![CDATA[<p><a href="http://www.numpy.org">NumPy</a> is a Python library that provides fast computation over arrays (vectors, matrices, tensors). It is faster than the equivalent base-Python because operations are vectorized, and the resulting code stays close to the mathematical notation of the underlying operation — without the bookkeeping and overhead of element-wise loops.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>

<h2 id="vectors">Vectors</h2>

<h3 id="create-vectors-by-generating-different-sequence-of-numbers">Create vectors by generating different sequence of numbers</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># sequence of integers between given bounds
</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">25</span><span class="p">,</span> <span class="n">step</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># 10 random integers between given bounds
</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="n">low</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="n">high</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>

<span class="c1"># 10 real numbers drawn from standard normal distribution
</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>

<span class="c1"># A vector of length 10 with all zeroes
</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>

<span class="c1"># Another convenient way to generate a vetor or even an array of zeroes is as follows:
</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>

<span class="c1"># generate sequence of numbers between given bounds &amp; fixed step
</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">stop</span><span class="o">=</span><span class="mi">35</span><span class="p">,</span> <span class="n">step</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8 5 8 2 2 6 1 0 2 5]
</code></pre></div></div>

<h3 id="single-vector-operations">Single vector operations</h3>

<h4 id="sum-of-a-sequence">Sum of a sequence</h4>

\[\text{Sum} = \displaystyle\sum_{i=1}^{n} x_i\]

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>30
</code></pre></div></div>

<h4 id="adding-a-constant-to-each-element-of-a-vector">Adding a constant to each element of a vector</h4>

\[x_{i,\text{new}} = x_i + c\]

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">c</span> <span class="o">=</span> <span class="mi">2</span>

<span class="n">X_new</span> <span class="o">=</span> <span class="n">x</span><span class="o">+</span><span class="n">c</span>
<span class="k">print</span><span class="p">(</span><span class="n">X_new</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ 2  2 11  4  5  8  7  6  3  2]
</code></pre></div></div>

<h4 id="multiplying-every-element-of-a-vector-by-a-constant">Multiplying every element of a vector by a constant</h4>

\[x_{i,\text{new}} = x_i \cdot c\]

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">c</span> <span class="o">=</span> <span class="mi">5</span>

<span class="n">X_new</span> <span class="o">=</span> <span class="n">x</span><span class="o">*</span><span class="n">c</span>
<span class="k">print</span><span class="p">(</span><span class="n">X_new</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ 0  0 45 10 15 30 25 20  5  0]
</code></pre></div></div>

<h4 id="reverse-a-vector">Reverse a vector</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">S_new</span> <span class="o">=</span> <span class="n">s</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">S_new</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[33 31 29 27 25 23 21 19 17 15]
</code></pre></div></div>

<h3 id="calculate-basic-statistical-measures">Calculate basic statistical measures</h3>

<h4 id="mean-mu">Mean ($\mu$)</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="n">low</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span><span class="n">high</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>


<span class="n">x</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>475.54001
</code></pre></div></div>

<h4 id="standard-deviation-sigma">Standard deviation ($\sigma$)</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>298.57318
</code></pre></div></div>

<h4 id="variance-sigma2">Variance ($\sigma^2$)</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>89145.938
</code></pre></div></div>

<h3 id="subset-a-vector">Subset a vector</h3>

<h4 id="index-for-maximum--minimum-values-in-a-sequence">Index for maximum &amp; minimum values in a sequence</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">.</span><span class="n">argmax</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>37
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">.</span><span class="n">argmin</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>3
</code></pre></div></div>

<h4 id="subset-using-index">Subset using index</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 2nd to 5th element (excluding 5th)
</span>
<span class="n">x</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="mi">5</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([ 31,   1, 561])
</code></pre></div></div>

<h2 id="matrices">Matrices</h2>

<h3 id="create-a-matrix">Create a matrix</h3>

<h4 id="get-a-matrix-of-particular-shape-by-providing-numbers">Get a matrix of particular shape by providing numbers</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span>
            <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span>
            <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span>
            <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">]</span>
         <span class="p">])</span>
<span class="n">x</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]])
</code></pre></div></div>

<h4 id="transpose-of-a-matrix">Transpose of a matrix</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span>
            <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span>
            <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span>
            <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">]</span>
         <span class="p">]).</span><span class="n">T</span>
<span class="n">y</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])
</code></pre></div></div>

<h4 id="get-a-matrix-of-particular-shape">Get a matrix of particular shape</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 0.09429882,  0.34480325,  0.11695385,  0.96194279,  0.53927071],
       [ 0.78844899,  0.7351646 ,  0.43960103,  0.20815778,  0.50149201],
       [ 0.26338585,  0.89077065,  0.20248855,  0.90770632,  0.91826611],
       [ 0.62807109,  0.48525764,  0.55865624,  0.88327996,  0.51471048]])
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])
</code></pre></div></div>

<h3 id="matrix-operations">Matrix operations</h3>

<h4 id="multiply-a-matrix-by-constant">Multiply a matrix by constant</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">5</span> <span class="o">*</span> <span class="n">x</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 5, 10, 15, 20],
       [ 5, 10, 15, 20],
       [ 5, 10, 15, 20]])
</code></pre></div></div>

<h4 id="multiply-a-matrix-by-another">Multiply a matrix by another</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span><span class="o">@</span><span class="n">x</span>  <span class="c1"># New matrix maultiplication operator in python3.5+ !
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 3,  6,  9, 12],
       [ 6, 12, 18, 24],
       [ 9, 18, 27, 36],
       [12, 24, 36, 48]])
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="n">x</span><span class="p">)</span> <span class="c1"># numpy based dot product
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 3,  6,  9, 12],
       [ 6, 12, 18, 24],
       [ 9, 18, 27, 36],
       [12, 24, 36, 48]])
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">.</span><span class="n">T</span> <span class="c1"># elementwise multiplication or hadamard product of two matrices with same shape
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 1,  4,  9, 16],
       [ 1,  4,  9, 16],
       [ 1,  4,  9, 16]])
</code></pre></div></div>

<h3 id="subset-a-matrix">Subset a matrix</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">[:,:]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]])
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="p">[:,</span><span class="mi">1</span><span class="p">:</span><span class="mi">3</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[2, 3],
       [2, 3],
       [2, 3]])
</code></pre></div></div>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="numpy" /><category term="python" /><summary type="html"><![CDATA[Vectors, matrices, and basic linear algebra with NumPy]]></summary></entry><entry><title type="html">Human activity recognition</title><link href="https://www.nilesh42.science/posts/human-activity-recognition/" rel="alternate" type="text/html" title="Human activity recognition" /><published>2017-02-16T01:09:55+05:30</published><updated>2017-02-15T23:49:19+05:30</updated><id>https://www.nilesh42.science/posts/human-activity-recognition</id><content type="html" xml:base="https://www.nilesh42.science/posts/human-activity-recognition/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>A modern smartphone comes equipped with variety of sensors from motion detectors to optical calibrators. The data collected by these sensors is valuable for better aligning the applications on the phone with user’s lifestyle. In this project, we have focused on using data collected from motion sensors to build a model which identifies type of activity being performed with minimal computation involved. The end goal is to create a model which can classify the activity being performed with high accuracy without sacrificing the limited computational resources available on a single phone.</p>

<p>The project is hosted here: <a href="https://github.com/nilesh-patil/HumanActivityRecognition">Github</a></p>

<h2 id="data-collection-and-preparation">Data Collection and Preparation</h2>

<p>We used the data provided by Human Activity Recognition research project, which built this database from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The complete data &amp; related papers can be accessed at: <a href="https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones">UCI ML repository page</a></p>

<p>Data was collected for 30 volunteers whose age was between 19-48 years. Each record in the data represents information about features like acceleration along x,y,z axes, velocity along a,y,z axes, 561 attributes derived from these basic measurements, identifier variable for the user &amp; the activity being performed.</p>

<p>There are 6 categories of activities being performed:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">standing</code></li>
  <li><code class="language-plaintext highlighter-rouge">sitting</code></li>
  <li><code class="language-plaintext highlighter-rouge">laying</code></li>
  <li><code class="language-plaintext highlighter-rouge">walking</code></li>
  <li><code class="language-plaintext highlighter-rouge">walking-downstairs</code></li>
  <li><code class="language-plaintext highlighter-rouge">walking-upstairs</code></li>
</ol>

<p>The raw data has separate text files for most of the variable groups; we used the dataset packaged as a single <code class="language-plaintext highlighter-rouge">.RData</code> file. In this dataset, a single column(‘subject’) is used to identify a user and the last column(‘activity’) was used to identify the activity being performed when the measurements were taken. All other attributes are available in the same column oriented data format. This is important to know, because, the values in the dataset have been normalized.</p>

<h2 id="exploratory-analysis">Exploratory Analysis</h2>

<p>a) <em>High dimensionality:</em></p>

<p>The dataset contains 561 features and we started out by exploring how these are related to each other &amp; whether there are some which can be safely ignored for our problem.</p>

<p>b) <em>Correlation Check:</em></p>

<p>We built a correlation matrix for all 561 variables in one got to identify any apparent patterns in the relationships. We see that most of these features are highly correlated with each other and it’s a good decision to drop most of these highly correlated features since we can get the same information from some other feature with high correlation to a group of them.</p>

<center><figure class="full">
	<img src="/images/blog/activityRecognition/image8.png" alt="Correlation matrix for all 561 sensor features" height="1000px" width="1000px" />
	<figcaption>Fig 01. Correlation Matrix between all 561 features</figcaption>
</figure></center>

<p>c) <em>Variance Check:</em></p>

<p>We checked our variable for zero or low variance so that they can be removed before running any analysis. Variables which do not change have low variance and ‘ll eventually have smaller impact on the classification model itself.</p>

<p>d) <em>Missing value Check:</em></p>

<p>We checked for any missing values in our columns, which might lead to errors in any future analysis but didn’t find any and so proceeded with the complete dataset.</p>

<p>e) <em>Visual exploration:</em></p>

<p>We also started out with basic visual exploration of the dataset by plotting distributions for the variables for each category, but given the large number involved, we dropped the idea. Though, in general there are two distinct major groups which we can see through the distributions as shown in:</p>

<center><figure class="full">
	<img src="/images/blog/activityRecognition/image9.png" alt="Representative distribution of sensor features by activity category" height="750px" width="750px" />
	<figcaption>Representative distribution</figcaption>
</figure></center>

<h2 id="model">Model</h2>

<p>The first step was to create a train &amp; test set. We split our data into two sets in 7:3 ratios by random sampling without replacement. This ensures that our train &amp; test sets are representative of the complete dataset. Another approach to do it would be to do this sampling for each output class. In our case, the result wasn’t significantly different.</p>

<p>For modelling, we used the following techniques on our training set:</p>
<ul>
  <li>SVM – Support vector machines</li>
  <li>Random forest (Final Model)</li>
</ul>

<p>To determine stability of the model being used, we use OOB score calculated during model building phase as representative of the validation set &amp; optimized our model to increase this score. For determining true performance, we used a separate test set which was not included in any of our variable selection, model training or validation phases. A high accuracy on this independent test set is proof that the model is not overfitting our training data &amp; hence, should generalize well.</p>

<p>We used Random forest variable importance scores to determine the final variables to build our submission model. This process of variable selection was done iteratively &amp; various parameters were tested. To maintain reproducibility, we set RandomState=42 at the beginning of the code so as to have uniform train/test sets &amp; variables every time we run this code.</p>

<p>We started out with all 561 variables &amp; reduced the total features to 5 in our final model. The focus of our process was to follow a algorithmic approach instead of a domain knowledge based model building process &amp; hence we relied on oob score &amp; variable importance to determine the optimal number of features, trees to be used &amp; which features to use.</p>

<h3 id="step-by-step-process">Step-by-step process</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Set RandomState = 42

2. Split data into two sets:

   - train(70%)

   - test(30%)

3. Using training set &amp; ALL features, build a random forest ensemble

4. From variable importance measure generated during the previous step,
   rank features according to their importance in differentiating between
   categories
</code></pre></div></div>

<center><figure class="full">
	<img src="/images/blog/activityRecognition/image10.png" alt="Random forest variable importance scores for all 561 features" height="400px" width="750px" />
	<figcaption>Variable importance scores</figcaption>
</figure></center>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>5. Determine optimal number of trees &amp; variables by iterating over 0-150
   trees &amp; for 1-25 variables
</code></pre></div></div>

<center>
 <figure class="half">
  	<img src="/images/blog/activityRecognition/image2.png" alt="OOB error vs number of variables selected" height="400px" width="350px" />
  	<img src="/images/blog/activityRecognition/image1.png" alt="OOB error vs number of trees" height="350px" width="350px" />
  	<figcaption>Determining optimal number of variables &amp; trees for training</figcaption>
 </figure>
</center>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>6. For the final step, we use 5 of the most important measures determined
   in this fashion &amp; number of trees = 50.
7. Using oob score during training phase &amp; accuracy score from the test
   set as final step, we freeze this model for final submission
</code></pre></div></div>

<p>The only major assumption in our choice of algorithm (RandomForest) is that random forests don’t usually overfit training set. This assumption breaks down when the training dataset is extremely biased, but in our case its relatively balanced &amp; hence we choose it over other algorithms.</p>

<h2 id="analysis-and-results">Analysis and Results</h2>

<p><em>1. Important Features:</em></p>

<p>Using the previously described feature selection, we determined that the following features were important for building our classification model:</p>

<ul>
  <li>angle(X,gravityMean)</li>
  <li>tGravityAcc-mean()-Y</li>
  <li>tGravityAcc-min()-X</li>
  <li>tGravityAcc-max()-X</li>
  <li>tBodyAcc-mad()-X</li>
</ul>

<p>The final model had importance scores are as shown in the figure:</p>

<center><figure class="full">
	<img src="/images/blog/activityRecognition/image3.png" alt="Variable importance scores for the final 5 selected features" height="500px" width="500px" />
	<figcaption>Importance scores for final selected features</figcaption>
</figure></center>

<p><em>2. We used SVM &amp; RandomForest for the final model &amp; their accuracy scores along with confusion matrices are as shown:</em></p>

<p><strong>Accuracy Scores :</strong></p>

<table class="table">
  <thead>
    <tr>
      <th style="text-align: center"> </th>
      <th style="text-align: center">Random-Forest</th>
      <th style="text-align: center">SVM</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center"><em>Train</em></td>
      <td style="text-align: center">94.50% (oob)</td>
      <td style="text-align: center">83.48%</td>
    </tr>
    <tr>
      <td style="text-align: center"><em>Test</em></td>
      <td style="text-align: center">94.37%</td>
      <td style="text-align: center">82.37%</td>
    </tr>
  </tbody>
</table>

<p><strong>Confusion matrix :</strong></p>

<center>
  <figure class="half">
   	<img src="/images/blog/activityRecognition/08.a.ConfusionMatrix-Test_RF.png" alt="Confusion matrix for Random Forest on test set" />
   	<img src="/images/blog/activityRecognition/08.b.ConfusionMatrix-Test_SVM.png" alt="Confusion matrix for SVM on test set" />
  	 <figcaption>Confusion matrix for test dataset</figcaption>
  </figure>
</center>

<ol>
  <li>
    <p>Given the high accuracy we get on the test dataset, we are confident in using a RandomForest-based model for detecting human activity from the smartphone dataset.</p>
  </li>
  <li>
    <p>From the final model, we also see that some categories are fairly straightforward to classify compared to others. We have shown this using a scatterplot matrix colored by category as shown the figure:</p>
  </li>
</ol>
<center><figure class="full">
	<img src="/images/blog/activityRecognition/image11.png" alt="Scatterplot matrix colored by activity category showing separability of classes" height="750px" width="750px" />
	<figcaption>Fig 04. Distribution of tBodyAccJerk-std()-X across all 6 categories</figcaption>
</figure></center>

<h2 id="references">References</h2>

<p><em>Random Forest:</em></p>

<ul>
  <li><a href="https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm">https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm</a></li>
  <li><a href="http://scikit-learn.org/stable/modules/ensemble.html#forest">http://scikit-learn.org/stable/modules/ensemble.html#forest</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Random_forest">https://en.wikipedia.org/wiki/Random_forest</a></li>
</ul>

<p><em>SVM:</em></p>

<ul>
  <li><a href="http://scikit-learn.org/stable/modules/svm.html">http://scikit-learn.org/stable/modules/svm.html</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Support_vector_machine">https://en.wikipedia.org/wiki/Support_vector_machine</a></li>
</ul>

<p><em>OOB Score:</em></p>

<ul>
  <li><a href="https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr">https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr</a></li>
  <li><a href="http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html">http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html</a></li>
</ul>

<p><em>UCI-ML dataset location:</em></p>

<ul>
  <li><a href="https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones">https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones</a></li>
</ul>

<p><em>Scikit-Learn:</em></p>

<ul>
  <li><a href="http://scikit-learn.org/stable/index.html">http://scikit-learn.org/stable/index.html</a></li>
</ul>

<p><em>Github Page:</em></p>

<ul>
  <li><a href="https://github.com/nilesh-patil/HumanActivityRecognition">https://github.com/nilesh-patil/HumanActivityRecognition</a></li>
</ul>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="sensors" /><category term="model" /><category term="RandomForest" /><category term="project" /><summary type="html"><![CDATA[Activity detection from sensor data]]></summary></entry><entry><title type="html">Visualizing distributions</title><link href="https://www.nilesh42.science/posts/visualizing-and-comparing-distributions/" rel="alternate" type="text/html" title="Visualizing distributions" /><published>2017-01-15T01:09:55+05:30</published><updated>2017-01-14T23:49:19+05:30</updated><id>https://www.nilesh42.science/posts/visualizing-and-comparing-distributions</id><content type="html" xml:base="https://www.nilesh42.science/posts/visualizing-and-comparing-distributions/"><![CDATA[<p>Once you have your data, usually you start by building summaries, checking for outliers, anomalies in the data &amp; visualizing it from different angles. Here, we’ll look at a few common approaches to visualize distributions (in a highly general sense).</p>

<h2 id="connect-to-data">Connect to data</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">import</span> <span class="nn">sqlite3</span>


<span class="n">db_path</span> <span class="o">=</span> <span class="s">'./data/world-development-indicators/database.sqlite'</span>

<span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">db_path</span><span class="p">)</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="n">db</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT name FROM sqlite_master WHERE type='table';"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">db</span><span class="p">.</span><span class="n">fetchall</span><span class="p">())</span>

<span class="n">data_countries</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql_query</span><span class="p">(</span><span class="s">'select * from Country'</span><span class="p">,</span><span class="n">conn</span><span class="p">)</span>
<span class="n">data_series</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql_query</span><span class="p">(</span><span class="s">'select * from Series'</span><span class="p">,</span><span class="n">conn</span><span class="p">)</span>
<span class="n">data_indicators</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql_query</span><span class="p">(</span><span class="s">'select * from Indicators'</span><span class="p">,</span><span class="n">conn</span><span class="p">)</span>

</code></pre></div></div>

<h2 id="histogram">Histogram</h2>

<h3 id="data-prep">Data Prep</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_indicators</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Life expectancy at birth, female (years)'</span><span class="p">,</span>
                       <span class="s">'Life expectancy at birth, male (years)'</span><span class="p">,</span>
                       <span class="s">'Life expectancy at birth, total (years)'</span><span class="p">]</span>
<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">[[</span><span class="s">'feature'</span><span class="p">,</span><span class="s">'type'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">[</span><span class="s">'IndicatorName'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">', '</span><span class="p">,</span><span class="n">expand</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="plot">Plot</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nbins</span> <span class="o">=</span> <span class="mi">15</span>
<span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">palette</span><span class="o">=</span><span class="s">"pastel"</span><span class="p">,</span>
        <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span><span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">),</span>
            <span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">500</span><span class="p">})</span>

<span class="c1"># sns.distplot is removed in seaborn &gt;= 0.12; use sns.histplot with kde=True
</span><span class="n">sns</span><span class="p">.</span><span class="n">histplot</span><span class="p">(</span><span class="n">data_plot</span><span class="p">.</span><span class="n">Value</span><span class="p">[</span><span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="o">==</span><span class="s">'female (years)'</span><span class="p">],</span> <span class="n">bins</span><span class="o">=</span><span class="n">nbins</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">histplot</span><span class="p">(</span><span class="n">data_plot</span><span class="p">.</span><span class="n">Value</span><span class="p">[</span><span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="o">==</span><span class="s">'male (years)'</span><span class="p">],</span> <span class="n">bins</span><span class="o">=</span><span class="n">nbins</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">histplot</span><span class="p">(</span><span class="n">data_plot</span><span class="p">.</span><span class="n">Value</span><span class="p">[</span><span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="o">==</span><span class="s">'total (years)'</span><span class="p">],</span> <span class="n">bins</span><span class="o">=</span><span class="n">nbins</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'Female'</span><span class="p">,</span> <span class="s">'Male'</span><span class="p">,</span> <span class="s">'Total'</span><span class="p">],</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.12</span><span class="p">,</span><span class="mf">1.04</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlim</span><span class="p">((</span><span class="mi">25</span><span class="p">,</span><span class="mi">100</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span><span class="n">linestyle</span><span class="o">=</span><span class="s">'-.'</span><span class="p">,</span><span class="n">linewidth</span><span class="o">=</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Life expectancy at birth ( In years )'</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/01.histogram.png" alt="Histogram of life expectancy at birth by sex across countries" class="center-image" height="300px" width="850px" /></p>

<h2 id="scatter-plot">Scatter Plot</h2>

<h3 id="data-prep-1">Data Prep</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">selected_indicators</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Unemployment, female (% of female labor force)'</span><span class="p">,</span>
                       <span class="s">'Unemployment, male (% of male labor force)'</span><span class="p">]</span>

<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">[[</span><span class="s">'feature'</span><span class="p">,</span><span class="s">'type'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">[</span><span class="s">'IndicatorName'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">', '</span><span class="p">,</span><span class="n">expand</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'type'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' \(% of male labor force\)'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'type'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' \(% of female labor force\)'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">'Value'</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="s">'CountryName'</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="s">'type'</span><span class="p">)</span>

</code></pre></div></div>

<h3 id="plot-1">Plot</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">palette</span><span class="o">=</span><span class="s">"pastel"</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span><span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">7</span><span class="p">,</span><span class="mi">5</span><span class="p">),</span>
            <span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">500</span><span class="p">})</span>

<span class="n">sns</span><span class="p">.</span><span class="n">lmplot</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">'female'</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">'male'</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">,</span> <span class="n">fit_reg</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">x_jitter</span><span class="o">=</span><span class="mf">1.5</span><span class="p">,</span> <span class="n">y_jitter</span><span class="o">=</span><span class="mf">1.5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span><span class="mi">40</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span><span class="mi">40</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'-.'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Unemployment (% of total)'</span><span class="p">,)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'./plots/02.scatter.png'</span><span class="p">,</span><span class="n">orientation</span><span class="o">=</span><span class="s">'landscape'</span><span class="p">,</span><span class="n">dpi</span><span class="o">=</span><span class="mi">500</span><span class="p">);</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/02.scatter.png" alt="Scatter plot of male vs female unemployment rates by country" class="center-image" height="500px" width="750px" /></p>

<h2 id="density-plot">Density plot</h2>

<h3 id="data-prep-2">Data Prep</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_indicators</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Mortality rate, adult, female (per 1,000 female adults)'</span><span class="p">,</span>
                       <span class="s">'Mortality rate, adult, male (per 1,000 male adults)'</span><span class="p">]</span>

<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">[[</span><span class="s">'feature'</span><span class="p">,</span><span class="s">'type'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">[</span><span class="s">'IndicatorName'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">', adult, '</span><span class="p">,</span><span class="n">expand</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'type'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' \(per 1,000 female adults\)'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'type'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="nb">type</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' \(per 1,000 male adults\)'</span><span class="p">,</span><span class="s">''</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">'Value'</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="s">'CountryName'</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="s">'type'</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="plot-2">Plot</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">palette</span><span class="o">=</span><span class="s">"pastel"</span><span class="p">,</span>
        <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span>
            <span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">10</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span>
            <span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">200</span>
           <span class="p">})</span>

<span class="c1"># seaborn &gt;= 0.12 deprecates positional Series args; use keyword form
</span><span class="n">sns</span><span class="p">.</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data_plot</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">'male'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">kdeplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">data_plot</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">'female'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span><span class="n">linestyle</span><span class="o">=</span><span class="s">'-.'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Mortality rate'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span><span class="mf">0.006</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlim</span><span class="p">((</span><span class="o">-</span><span class="mi">100</span><span class="p">,</span><span class="mi">700</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'./03.density.png'</span><span class="p">);</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/03.density.png" alt="Kernel density estimate of adult mortality rates for male vs female populations" class="center-image" height="500px" width="750px" /></p>

<h2 id="boxplot">Boxplot</h2>

<h3 id="data-prep-3">Data prep</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_indicators</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Merchandise trade (% of GDP)'</span><span class="p">]</span>

<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'Region'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">right</span><span class="o">=</span><span class="n">data_countries</span><span class="p">,</span><span class="n">on</span><span class="o">=</span><span class="s">'CountryCode'</span><span class="p">,</span><span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">)[</span><span class="s">'Region'</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="plot-3">Plot</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns_order</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">data_plot</span><span class="p">.</span><span class="n">Region</span><span class="p">.</span><span class="n">unique</span><span class="p">())</span>  <span class="c1"># fixed: was `scolumns_order` (typo); sort() → sorted()
</span>
<span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">palette</span><span class="o">=</span><span class="s">"pastel"</span><span class="p">,</span>
        <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span>
            <span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">10</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span><span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">200</span>
           <span class="p">})</span>

<span class="n">sns</span><span class="p">.</span><span class="n">boxplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Region'</span><span class="p">,</span>
            <span class="n">y</span><span class="o">=</span><span class="s">'Value'</span><span class="p">,</span>
            <span class="n">palette</span><span class="o">=</span><span class="s">'autumn'</span><span class="p">,</span>
            <span class="n">order</span><span class="o">=</span><span class="n">columns_order</span><span class="p">,</span>
            <span class="n">width</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
            <span class="n">fliersize</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
            <span class="n">data</span><span class="o">=</span><span class="n">data_plot</span><span class="p">);</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span><span class="n">linestyle</span><span class="o">=</span><span class="s">'-.'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Merchandise trade'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'% of GDP'</span><span class="p">);</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'./04.boxplot.png'</span><span class="p">);</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/04.boxplot.png" alt="Box plot of merchandise trade as percent of GDP by world region" class="center-image" height="500px" width="950px" /></p>

<h2 id="violin-plot">Violin plot</h2>

<h3 id="data-prep-4">Data prep</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_indicators</span> <span class="o">=</span> <span class="p">[</span> <span class="s">'CO2 emissions from gaseous fuel consumption (% of total)'</span><span class="p">,</span>
                        <span class="s">'CO2 emissions from liquid fuel consumption (% of total)'</span><span class="p">]</span>

<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'Region'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">right</span><span class="o">=</span><span class="n">data_countries</span><span class="p">,</span><span class="n">on</span><span class="o">=</span><span class="s">'CountryCode'</span><span class="p">,</span><span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">)[</span><span class="s">'Region'</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="plot-4">Plot</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.patches</span> <span class="k">as</span> <span class="n">mpatches</span>

<span class="n">columns_order</span> <span class="o">=</span> <span class="n">sort</span><span class="p">(</span><span class="n">data_plot</span><span class="p">.</span><span class="n">Region</span><span class="p">.</span><span class="n">unique</span><span class="p">())</span>

<span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">palette</span><span class="o">=</span><span class="s">"pastel"</span><span class="p">,</span>
        <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span>
            <span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">16</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span><span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">250</span>
           <span class="p">})</span>
<span class="n">sns</span><span class="p">.</span><span class="n">violinplot</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span><span class="s">'Region'</span><span class="p">,</span>
               <span class="n">y</span><span class="o">=</span><span class="s">'Value'</span><span class="p">,</span>
               <span class="n">hue</span><span class="o">=</span><span class="s">'IndicatorName'</span><span class="p">,</span>
               <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span>
               <span class="n">inner</span><span class="o">=</span><span class="s">"quart"</span><span class="p">,</span>
               <span class="n">palette</span><span class="o">=</span><span class="p">{</span><span class="s">"CO2 emissions from gaseous fuel consumption (% of total)"</span><span class="p">:</span> <span class="s">"y"</span><span class="p">,</span>
                        <span class="s">"CO2 emissions from liquid fuel consumption (% of total)"</span><span class="p">:</span> <span class="s">"b"</span><span class="p">},</span>
               <span class="n">data</span><span class="o">=</span><span class="n">data_plot</span><span class="p">,</span>
               <span class="n">split</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span><span class="n">linestyle</span><span class="o">=</span><span class="s">'-.'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'CO$_2$ emission'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'% of total'</span><span class="p">)</span>

<span class="n">gas_patch</span> <span class="o">=</span> <span class="n">mpatches</span><span class="p">.</span><span class="n">Patch</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'yellow'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Gaseous'</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">liquid_patch</span> <span class="o">=</span> <span class="n">mpatches</span><span class="p">.</span><span class="n">Patch</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'skyblue'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Liquid'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">handles</span><span class="o">=</span><span class="p">[</span><span class="n">gas_patch</span><span class="p">,</span> <span class="n">liquid_patch</span><span class="p">],</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.99</span><span class="p">),</span> <span class="n">fontsize</span><span class="o">=</span><span class="s">'x-large'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'./plots/05.violinplot.png'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">250</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">);</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/05.violinplot.png" alt="Violin plot of gaseous vs liquid CO2 emissions by region" class="center-image" height="750px" width="950px" /></p>

<h2 id="heatmap">Heatmap</h2>

<h3 id="data-prep-5">Data prep</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_indicators_export</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">'Merchandise exports to developing economies in East Asia &amp; Pacific (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to developing economies in Latin America &amp; the Caribbean (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to developing economies in Middle East &amp; North Africa (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to developing economies in South Asia (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to developing economies in Sub-Saharan Africa (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to developing economies outside region (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to developing economies within region (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to economies in the Arab World (% of total merchandise exports)'</span><span class="p">,</span>
    <span class="s">'Merchandise exports to high-income economies (% of total merchandise exports)'</span>
<span class="p">]</span>

<span class="n">selected_indicators_imports</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">'Merchandise imports from developing economies in East Asia &amp; Pacific (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from developing economies in Latin America &amp; the Caribbean (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from developing economies in Middle East &amp; North Africa (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from developing economies in South Asia (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from developing economies in Sub-Saharan Africa (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from developing economies outside region (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from developing economies within region (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from economies in the Arab World (% of total merchandise imports)'</span><span class="p">,</span>
    <span class="s">'Merchandise imports from high-income economies (% of total merchandise imports)'</span>
<span class="p">]</span>

<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators_export</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'Region'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">right</span><span class="o">=</span><span class="n">data_countries</span><span class="p">,</span><span class="n">on</span><span class="o">=</span><span class="s">'CountryCode'</span><span class="p">,</span><span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">)[</span><span class="s">'Region'</span><span class="p">]</span>
<span class="n">data_export</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">'Value'</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="s">'Region'</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="s">'IndicatorName'</span><span class="p">)</span>


<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators_imports</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'Region'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">right</span><span class="o">=</span><span class="n">data_countries</span><span class="p">,</span><span class="n">on</span><span class="o">=</span><span class="s">'CountryCode'</span><span class="p">,</span><span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">)[</span><span class="s">'Region'</span><span class="p">]</span>
<span class="n">data_import</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">'Value'</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="s">'Region'</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="s">'IndicatorName'</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="plot-5">Plot</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span>
            <span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">20</span><span class="p">,</span><span class="mi">8</span><span class="p">),</span>
            <span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">250</span>
           <span class="p">})</span>
<span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">imports</span><span class="p">,</span> <span class="n">exports</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">im1</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">data_import</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span><span class="n">xlabels</span><span class="p">],</span>
                  <span class="n">ax</span><span class="o">=</span><span class="n">imports</span><span class="p">,</span>
                  <span class="n">center</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
                  <span class="n">cbar</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
                  <span class="n">cmap</span><span class="o">=</span><span class="s">"YlGnBu"</span><span class="p">)</span>
<span class="n">imports</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">ylabels</span><span class="p">)</span>
<span class="n">imports</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>
<span class="n">imports</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>
<span class="n">imports</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Imports :'</span><span class="p">);</span>

<span class="n">im2</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">data_export</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span><span class="n">xlabels</span><span class="p">],</span>
                  <span class="n">ax</span><span class="o">=</span><span class="n">exports</span><span class="p">,</span>
                  <span class="n">center</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
                  <span class="n">yticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
                  <span class="n">cbar</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
                  <span class="n">cmap</span><span class="o">=</span><span class="s">"YlGnBu"</span><span class="p">)</span>
<span class="n">exports</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>
<span class="n">exports</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>
<span class="n">exports</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Exports :'</span><span class="p">);</span>
<span class="n">fig</span><span class="p">.</span><span class="n">subplots_adjust</span><span class="p">(</span><span class="n">wspace</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">hspace</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="n">mappable</span> <span class="o">=</span> <span class="n">im1</span><span class="p">.</span><span class="n">get_children</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">fig</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">mappable</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="p">[</span><span class="n">imports</span><span class="p">,</span><span class="n">exports</span><span class="p">],</span><span class="n">orientation</span> <span class="o">=</span> <span class="s">'vertical'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'./plots/06.heatmap.png'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">250</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">);</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/06.heatmap.png" alt="Heatmap of merchandise imports and exports by region" class="center-image" height="500px" width="1000px" /></p>

<h2 id="rugs">Rugs</h2>

<h3 id="data-prep-6">Data prep</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected_indicators</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Merchandise trade (% of GDP)'</span><span class="p">]</span>

<span class="n">countries</span> <span class="o">=</span> <span class="n">data_countries</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">[</span><span class="n">data_countries</span><span class="p">.</span><span class="n">Region</span><span class="o">!=</span><span class="s">''</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">IndicatorName</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">selected_indicators</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_indicators</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">condition</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">CountryCode</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">countries</span><span class="p">)</span>
<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">condition</span><span class="p">,:]</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">data_plot</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'CountryName'</span><span class="p">,</span><span class="s">'IndicatorName'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">last</span><span class="p">()</span>
<span class="n">data_plot</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">data_plot</span><span class="p">[</span><span class="s">'Region'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data_plot</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">right</span><span class="o">=</span><span class="n">data_countries</span><span class="p">,</span><span class="n">on</span><span class="o">=</span><span class="s">'CountryCode'</span><span class="p">,</span><span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">)[</span><span class="s">'Region'</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="plot-6">Plot</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns_order</span> <span class="o">=</span> <span class="n">sort</span><span class="p">(</span><span class="n">data_plot</span><span class="p">.</span><span class="n">Region</span><span class="p">.</span><span class="n">unique</span><span class="p">())</span>

<span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
        <span class="n">palette</span><span class="o">=</span><span class="s">"pastel"</span><span class="p">,</span>
        <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">rc</span><span class="o">=</span><span class="p">{</span>
            <span class="s">'figure.figsize'</span><span class="p">:(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">),</span><span class="s">'figure.dpi'</span><span class="p">:</span><span class="mi">500</span>
           <span class="p">})</span>

<span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">FacetGrid</span><span class="p">(</span><span class="n">data_plot</span><span class="p">,</span>
                  <span class="n">col</span><span class="o">=</span><span class="s">"Region"</span><span class="p">,</span>
                  <span class="n">col_wrap</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
                  <span class="n">col_order</span><span class="o">=</span><span class="n">columns_order</span><span class="p">,</span><span class="n">subplot_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'ylim'</span><span class="p">:(</span><span class="mi">0</span><span class="p">,</span><span class="mf">0.02</span><span class="p">)})</span>
<span class="c1"># sns.distplot removed in seaborn &gt;= 0.12; use sns.kdeplot with rug=True
</span><span class="n">g</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">sns</span><span class="p">.</span><span class="n">kdeplot</span><span class="p">,</span> <span class="s">"Value"</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">sns</span><span class="p">.</span><span class="n">rugplot</span><span class="p">,</span> <span class="s">"Value"</span><span class="p">);</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'./plots/07.rugplot.png'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">);</span>
</code></pre></div></div>

<p><img src="/images/blog/distributions/07.rugplot.png" alt="Rug plots of merchandise trade as percent of GDP faceted by region" class="center-image" height="600px" width="1000px" /></p>]]></content><author><name>Nilesh Patil</name></author><category term="blog" /><category term="data" /><category term="distribution" /><category term="visualization" /><category term="seaborn" /><summary type="html"><![CDATA[Common visualization examples for distributions]]></summary></entry></feed>