Adaptive OpenMP scheduling in SimuCell3D for tissue mechanics simulations

A sheet of cells decides to fold

A flat layer of identical cells can stay one layer thick. It can also buckle, pile into a stratified mass, or roll itself into a closed tube. The cells’ chemistry alone does not decide how this process continues. The exact shape is defined by mechanics: each cell pushes back on its volume, pulls its surface taut under cortical tension, and sticks to its neighbors with some adhesion, and the tissue settles into whatever geometry balances those forces. So if you want to understand how a gut tubule closes or how a spheroid hollows into a fluid-filled vesicle, you are not tracking a gene. You are solving for the shape a few thousand deformable surfaces fall into when they are squeezed against each other in three dimensions. That is hard to measure in a real embryo and harder to reason about by hand, which is why people simulate it.

Each cell in SimuCell3D is a closed triangulated surface, on average 121 nodes and 238 faces; neighbors interact through a contact model handling adhesion and volumetric exclusion. Cell shape is not constrained by the representation.

SimuCell3D takes the literal version of that picture. Every cell is a closed triangulated surface, around 121 nodes and 238 faces on average, free to deform into whatever shape the forces demand. Each membrane carries an energy potential with four terms - internal pressure, area elasticity, surface tension, and bending stiffness - and the cells interact through a contact model that handles adhesion and volumetric exclusion. Steve Runser, Roman Vetter, and Dagmar Iber built it at ETH Zurich’s D-BSSE and published it in Nature Computational Science in 2024. The paper’s benchmark grows a tissue from a single cell to roughly 125,000 cells in about a day of compute. At that resolution the number that tends to bind is how many cells you can reach before the compute budget runs out.

I forked SimuCell3D to push that number. Under the same wall-clock budget, my fork reached 19,958 cells where the v1.0 baseline stopped at 9,693. A second run on different hardware repeated the pattern and, carried further, reached 26,534 cells against 12,851 for the v1 baseline beside it. Both are close to twice the tissue for the same compute budget. What moved the number was the order in which threads pick up work, so I built an adaptive OpenMP scheduler to balance it. The integrator and the time step stayed where they were. Instrumenting that scheduler to measure the win is where the more informative result turned up: most of the machinery it added barely engaged. Its CoV thresholds never tripped and its GROWTH phase never fired, and the speedup that remains is shared between the new schedules and a plainer contact-detection change that shipped alongside them, which this run cannot cleanly separate. I label the two runs RUN01, the headline, and RUN02, the reproduction, and carry both through the figures below.

The threads were mostly standing around

top had been lying to me the whole time: eight cores pegged near 800% utilization, the textbook picture of a CPU-bound job. A tracing profiler told the real story. On the worst steps, close to a third of that “busy” time was threads standing at a barrier, waiting for one overloaded thread to finish an oversized slice.

That idle third is invisible to top because it reports aggregate CPU time. 800% across eight cores looks identical whether one thread sprints while seven idle or all eight share the load evenly. Only a per-thread trace tells you which. If you train models, you already know this failure by another name: a data-parallel step where one worker drew all the long sequences while the rest sat idle at the all-reduce barrier. Under the profiler the idle time concentrated in contact detection, the phase that dominates the runtime. The imbalance never crashed anything; it just quietly ate throughput as the tissue grew.

The bottleneck lived in the OpenMP directives. Static mode leans entirely on fixed scheduling: bare #pragma omp parallel for loops in solver.cpp with no schedule() clause (on GCC and Clang the default is static, equal-sized contiguous chunks), an explicit schedule(static) in time_integration.cpp, and another bare parallel loop in the contact model. Static scheduling slices the cells into equal index ranges up front - thread 0 takes cells 0-15, thread 1 takes 16-31, and so on - and never rebalances. For a tissue simulation that assumption fails on the second cell.

The costs really do diverge. Each cell is a triangulated mesh, and meshes drift apart in cost as the tissue evolves. A freshly divided child has fewer faces than its parent; a growing cell has more; a cell wedged into a contact pocket spends far longer in its per-face loops than an isolated one. Picture two cells on the same step: a fresh low-face child, and a crowded neighbor jammed against four others. Static scheduling hands them out by index range, blind to how much each one costs, so whichever thread drew that crowded cell on step one keeps drawing its kind for the rest of the run. I could not fix that without first measuring it, and what I needed was one fast number that says, right now, how lumpy the work is.

A single number for how uneven the work is

The coefficient of variation (CoV = σ/μ) measures how spread out a distribution is relative to its mean. When every cell costs the same, CoV is zero and static scheduling is optimal. As some cells grow much heavier than others, CoV climbs and dynamic scheduling starts to win. The catch: you cannot measure CoV by running the loop, because running it is the work you are trying to schedule. The cost has to be estimated before the loop starts.

So the estimator in src/solver.cpp is a weighted sum of structural features per cell: a base_cost proportional to face count; a contact_cost scaled by a compile-time constant for the active contact model (0.25 for node-face springs, 0.28 for node-node coupling, 0.32 for face-face coupling); an integration_cost of 0.65 x base_cost for dynamic cells; plus smaller terms for polarization, growth, and mesh quality. The coefficients were fit by hand against measured per-cell timings and rounded to two decimals. Run the two cells from a moment ago through it. The fresh child, a handful of faces and no contacts, scores a low base_cost and almost no contact_cost; the crowded neighbor scores several times higher on both. CoV is just how far apart those scores spread across the whole tissue.

None of this is precise, and it does not need to be. The accurate alternative, profiling every cell’s real cost each step, would cost more than the imbalance it removes. A cheap structural proxy wins as long as it stays roughly monotonic with real cost and runs fast. Both hold: the estimator is O(N_cells) and returns a single float.

The first surprise is that the machinery barely engaged. On both runs the adaptive CoV peaked at 0.144 in RUN01 and 0.157 in RUN02, and never reached the 0.4 and 0.6 chunk-band thresholds built into the scheduler. The chunk divisor stayed at 4, its coarsest setting, for both experiments. The high-CoV dynamic-chunking regime the system is designed for was never exercised on either run, which makes both a strong stress test for cell count while leaving the scheduler’s high-heterogeneity branch untested.

Workload CoV versus tissue size for both adaptive runs against the Static v1 baseline, with 0.4 and 0.6 threshold lines that no curve reaches. — Workload CoV versus tissue size (log x). Solid blue is RUN02's adaptive scheduler to 26,534 cells, faint blue RUN01's, red dashed the Static (v1) baseline to v1's 12,851-cell reach. The dashed lines at 0.4 and 0.6 are the scheduler's own chunk-band thresholds (from `calculate_optimal_chunk_size()`); no curve reaches either, so the adaptive chunk divisor stays at its coarsest setting throughout. RUN01 adaptive: mean 0.106, median 0.122, max 0.144; RUN02 adaptive max 0.157. Static: mean 0.164, median 0.161, max 0.212.

Three modes, four loops, three phases, and what actually ran

The scheduler picks among three OpenMP modes. static hands each thread a fixed pile up front: fast, but one thread can get stuck with all the slow cells. dynamic gives everyone a shared queue they pull from as they finish: no idle threads, but a small per-grab cost. guided is the middle ground, big grabs first, shrinking toward the tail. If you have ever tuned dynamic batching or a work-stealing pool, this is the same trade-off: granularity against coordination overhead.

calculate_optimal_chunk_size() turns the CoV directly into a granularity via a divisor: 4 for CoV ≤ 0.4 (mild imbalance, coarse chunks), 10 for 0.4 < CoV ≤ 0.6 (moderate), 20 for CoV > 0.6 (high imbalance, fine chunks). Then chunk = max(1, min(num_cells / (num_threads x divisor), 100)):

// CoV -> chunk granularity (src/solver.cpp, paraphrased)
int divisor = cov <= 0.4 ? 4      // mild imbalance -> coarse chunks
            : cov <= 0.6 ? 10     // moderate
            :              20;    // high imbalance -> fine chunks
// below 100 cells, always uses divisor=4 (hardcoded fast path)
int chunk = std::max(1, std::min(num_cells / (num_threads * divisor), 100));

A bigger divisor means smaller chunks, and a smaller chunk is finer load balancing. Since CoV never crossed 0.4 on either run, that divisor never left 4.

Different loop categories have different workload shapes, so they should not share one schedule. initialize_per_loop_schedules() sets four fixed structural assignments: contact detection runs omp_sched_dynamic, the most irregular phase, its cost varying with mesh density and contact geometry; time integration runs omp_sched_guided, more uniform per-cell cost, so guided’s shrinking grabs capture most of the benefit without dynamic’s per-grab overhead; mesh updates run omp_sched_static, the one category where per-cell cost genuinely is uniform and static maximizes cache locality; and cell division runs omp_sched_dynamic with a finer chunk, the rarest and most variable phase. Most hot #pragma omp parallel for loops in the fork carry schedule(runtime), which is what enables this late binding.

Phases are the other axis. They decide when in the tissue’s life to shift the global default. The scheduler tracks three: INITIALIZATION (dynamic, thread count exceeds task count), GROWTH (triggered when the recent division rate exceeds 0.01, where the rate is $\text{recent divisions} / (N_{\text{cells}} \times 50)$; then dynamic if CoV > 0.4, else guided), and HOMEOSTASIS (static if cell count > 1,000, else guided). Every 50 iterations (COV_UPDATE_INTERVAL = 50) the solver recalculates the CoV and reconsiders the phase.

The phases tell the same story. In the maximum-cell fast-growth run, grep GROWTH against the adaptive performance-diagnostics log returns zero matches. On this workload the adaptive scheduler spent 41 samples in INITIALIZATION and 167 in HOMEOSTASIS, never touching the phase the GROWTH branch was written for; RUN02 split 40 and 179 the same way. The workload is what kept it dormant: the run metadata points to parameters/parameters_128k_fast_growth.xml, but the phase detector still never saw a recent division rate high enough to cross the threshold.

The scheduler defines three phases but the fast-growth workload only entered two. Both runs ran INITIALIZATION while the tissue was tiny, then jumped straight to HOMEOSTASIS and stayed; the division rate never crossed the 0.01 that triggers GROWTH, so the branch written for it never ran.

About 2x the cells per compute budget

I treat what follows as two independent runs and keep their numbers separate. RUN01, the 8-core run, is the headline: 19,958 adaptive cells against 9,693 for the v1 baseline. RUN02, on a 7-core machine and carried until the adaptive process was killed at about 98 GB of RAM, is a reproduction that happens to reach further: 26,534 adaptive cells against 12,851 for v1, a 2.06x ratio matching RUN01 exactly. The 26,534 is where memory ran out, and the two machines differ, so on the wall-clock plots compare the shapes and treat the absolute times as machine-specific.

Cell count versus wall-clock, both axes log2 so every gridline is one doubling. RUN01 (8 cores, solid) reached 19,958 adaptive cells against 9,693 for v1; RUN02 (7 cores, dashed) reached 26,534 against 12,851. About 2x more cells for the same budget in each. The two runs are on different machines, so compare the shapes: adaptive growth is a near-straight line, cells proportional to t^0.88. Each doubling still costs more than the last, one stretching from about 3 hours at 2k-4k cells to about 14 hours at 8k-16k.

At matched cell counts the picture is cleaner than the raw endpoints: adaptive is faster in every usable band after 25 cells, with one midrange dip around 250-499 cells in RUN01 that RUN02 does not repeat.

Matched-cell speedup by cell-count band: adaptive stays roughly 2 to 2.75x faster than v1, with one midrange dip in RUN01. — Matched-cell speedup, binned by cell count: median adaptive IPS divided by median v1 IPS in the same band, for RUN01 (blue) and RUN02 (grey). RUN01 peaks at 2.75x (150-249 cells) with a single dip to 1.50x at 250-499; RUN02 holds a steadier 2.2 to 2.6x, the dip does not return, and it extends to a 10k-20k band RUN01 never reached. The 10-24 cell band is omitted because v1 has only one cleaned sample there in RUN01.

The physics barely moves. Median pressure deviation between the two modes is 2.93% across 389 matched iterations in the cleaned biological outputs (RUN01), and 2.33% across 403 in RUN02. Reordering the work changes only how fast the simulation reaches that answer, and leaves the answer itself untouched.

Where the 2x actually came from, and what is still open

Break one mean iteration down by phase and you can see where each scheduler spends its time.

Share of mean iteration time per phase, Static (v1) to Adaptive, cells > 100. Contact detection falls from 82% of the iteration under Static to 54% (RUN01) and 57% (RUN02); polarization and internal forces become the largest remaining phase because contact detection shrank around them, while that work itself held flat. Static: contact 82 / polarization plus forces 14 / time integration 3 / mesh 2. Adaptive RUN01: 54 / 33 / 8 / 5. RUN02: 57 / 30 / 7 / 6.

Contact detection drops from ~82% of mean iteration time under Static to ~54% in RUN01 and ~57% in RUN02. Polarization and internal forces rise to the second-largest phase because contact detection got faster around them; the work itself held flat. Time integration stays a minor 3 to 8% throughout.

Here is the caveat, stated once and in full. This run supports the matched-cell throughput comparison, but it does not isolate the scheduler from the one contact-detection change that was actually live. The runs used the default detection path: the config sets no contact_detection_algorithm, and the code default is uspg. On that path the fork’s real change is the spatial grid itself, whose per-voxel storage switched from a std::forward_list to a std::vector of vectors for cache locality (the source claims a 1.3-2x gain on that container). So the 82% to 54% drop bundles the new schedules with that grid change, and I cannot separate the two here.

The two fancier contact-detection features are in the fork but never ran on this workload. Morton-code sorting of faces and the Sweep-and-Prune broad-phase (the one gated on ADAPTIVE_SAP_CELL_THRESHOLD = 500) both live only behind the non-default sweep_and_prune and adaptive settings; under the default uspg, the face-face model takes its direct-grid path and calls neither. So they join the GROWTH phase and the CoV thresholds on the same list: implemented, confirmed in the source, and never exercised by these runs. What actually moved the number was the schedule and the humbler vector-of-vectors grid.

A few other things stayed open, and I will come back to them when my day job leaves room:

Thread Sanitizer is not in CI yet. Only ASan and UBSan run. The cell-division parallel section still produces enough benign-looking races that TSan is noisy, and that noise needs triaging before it can gate the build.
The Python wrapper has not caught up. The new --schedule= and --diagnostics-csv= CLI flags exist on the C++ binary but are not exposed through the pybind11 wrapper, so simucell3d_wrapper cannot reach the new knobs.
assert() in hot paths. Over 300 assert() calls still live in production src/ (all runtime assert(), none static_assert). Several critical paths are converted; the rest are a slower migration.
A lot of the built machinery is in current config, so throughput is sort of lower bound. The 0.4/0.6 chunk-band thresholds, the GROWTH phase detector, and the two non-default detection algorithms (Morton-sorted USPG and Sweep-and-Prune) are all confirmed in the source, but none fired on the fast-growth runs that reached 19,958 and then 26,534 cells. A workload with sharper per-cell cost heterogeneity, and a run that sets contact_detection_algorithm away from the default, are the right ways to load those branches.

Fork on GitHub Project page SimuCell3D paper (Nature, 2024)

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)