Reading the Doppler angle off B-mode: a deep-learning replication, tuned two ways
Published:
A spectral-Doppler velocity measurement is only as good as one angle. Blood velocity is recovered from the Doppler equation,
\[f_d = \frac{2 f_0 \, v \cos\theta}{c},\]where $\theta$ is the angle between the ultrasound beam and the direction of flow. The $\cos\theta$ sits right on top of the velocity, so a sloppy angle is a sloppy velocity — and angle correction is set by hand on the scanner. Vascular labs miss it often enough that accreditation reviewers flag improper angle correction in a large share of applications.
In 2019, Patil & Anand asked whether a network could read $\theta$ straight off a grayscale B-mode image of the carotid — no color Doppler, no segmentation — and reported a best model at under 3° mean absolute error, $R^2 \approx 0.99$. I am the first author. Seven years later I rebuilt the whole thing from scratch in Keras 3 / JAX, test-first, with one goal: replicate it cleanly, understand why it works, and then push the estimator as far as it will go.
Results at a glance (84 carotid images, ~10 volunteers; Apple M4 Max; Keras 3 / JAX; everything built test-first):
| | | |—|—| | Replication | A frozen DenseNet201 + a small head reproduces the paper’s Table I at 5.84% MAPE (3.77° MAE) — once you fix the pooling. The win was never the backbone. | | The pooling insight | Global average pooling is rotation-invariant — wrong for an orientation target. An orientation-preserving grid pooling head lifts the frozen backbone from ~14% to 5.84% MAPE with no fine-tuning. | | Best estimator | An Optuna-tuned 5-model ensemble reaches 2.79% MAPE / 1.96° MAE ($R^2$ 0.995) under image-level sampling, and 8.53% / 5.93° ($R^2$ 0.952) under the stricter patient-level sampling. | | Clinical-grade | Split-conformal 90% intervals are ±20.5° at 95% empirical coverage; Bland–Altman shows +4.3° bias vs the reference reading; test-time augmentation cuts base-image error 7.8° → 4.7°. |
Replication: it was the pooling, not the backbone
Before improving anything I had to actually hit Table I. My first frozen-feature runs landed around 14% MAPE — nowhere near the paper. The instinct is to blame “frozen features can’t reach fine-tuned numbers.” That instinct was wrong.
The culprit was global average pooling. Averaging a convolutional feature map over all spatial positions is partly rotation-invariant — which is exactly the wrong inductive bias when the thing you are predicting is an orientation. The pooling was throwing away the spatial layout that encodes the vessel’s direction. Swapping in an orientation-preserving grid pooling (average-pool the feature map down to a small $G\times G$ grid, then flatten, instead of collapsing to a single vector) dropped frozen DenseNet201 from ~14% to 5.84% MAPE — reproducing Table I with no fine-tuning at all.
That is the central deep-learning lesson of the rebuild: a frozen ImageNet backbone already sees the vessel orientation; you just have to not pool it away.
Two ways to ask “how good is it?”
The original pipeline manufactures its training data the standard way for a tiny clinical cohort: take 84 base images, rotate each through $[-60°, +60°]$ in 5° steps to make 25 labelled variants (the label is just the base angle plus the rotation), giving ~2,100 images. That rotation sweep is doing double duty — it is both the augmentation and the label generator.
Once you have a synthetic set like that, there are two honest ways to score a model, and they answer different questions. I implemented both behind one config flag:
- Image-level sampling — the paper’s protocol: draw the train/test split over the augmented image corpus. This measures angle accuracy across the full population of orientations and imaging conditions the augmentation spans. It is the standard computer-vision way to evaluate a model on a synthetic/augmented dataset.
- Patient-level sampling — hold out whole volunteers, so the test set is anatomy the model has never seen at any rotation. This is a stricter lens: cross-subject generalization.
Neither is “the wrong one.” Image-level tells you how well the estimator interpolates across the imaging manifold; patient-level tells you how it travels to a new patient. A complete write-up reports both — and, crucially, tunes both to their own best rather than tuning one and quoting the other.

The same frozen grid-pooling models under two sampling protocols. Image-level sampling scores accuracy across the augmented-image population; patient-level sampling is the harder cross-subject regime. Reporting both is the honest thing to do for a small cohort.
How far the estimator climbs
With the pooling fixed, the question becomes: what is the best estimator I can build under each protocol?

Frozen backbone bake-off. DenseNet201 beats ConvNeXt and EfficientNetV2 — newer is not better for frozen small-data transfer.
First, newer backbones do not help. A frozen-feature bake-off across the modern zoo — ConvNeXt, EfficientNetV2 — leaves DenseNet201 on top. With 84 images, the foundation-model-scale encoders have nothing to grip.
What did help was treating it as a small, careful tuning problem. I ran an Optuna TPE search over the head and optimizer separately under each protocol’s own cross-validation — run cheaply against cached frozen features, so each trial is a shallow head fit and not a backbone pass (one feature extraction per backbone serves both protocols). Then a stacked ensemble of the five tuned backbones, scored out-of-fold:
- Image-level sampling: the tuned ensemble reaches 2.79% MAPE / 1.96° MAE ($R^2$ 0.995) — better than the paper’s best single model, and a clean improvement on the 5.84% frozen replication.
- Patient-level sampling: the tuned ensemble reaches 8.53% MAPE / 5.93° MAE ($R^2$ 0.952), with the out-of-fold mean at 9.89%.

The arc from the frozen replication through the Optuna-tuned ensemble. Tuning the head and ensembling — not a bigger backbone — is what moves the number.
Tuning also made the members well-calibrated enough that even a plain mean ensemble works — the untuned mean was a useless 21.9%.
Clinical-grade, and honest about it
A point estimate is not a clinical tool; a calibrated interval is. Three things, all computed post-hoc on the held-out predictions:
- Conformal intervals. Split-conformal with a patient-disjoint calibration set gives 90% prediction intervals of ±20.5° with 95.2% empirical coverage — valid, and honest about how wide an image-only estimate really is.
- Agreement. A Bland–Altman analysis against the reference reading shows a small +4.3° bias. One caveat I refuse to paper over: there is exactly one human reading per image, so this is method-versus-reference, not inter-observer agreement. I will not fabricate a second radiologist to make the plot look more clinical.
- Test-time augmentation. Averaging predictions over rotations of the same frame (de-rotating each, then circular-averaging) cuts the per-image MAE from 7.8° to 4.7° — free accuracy at inference, no retraining.

Calibration. Empirical coverage sits at or above nominal at every level — the intervals are valid, not optimistic.
And because “the model attends to the vessel” is a claim, not a hope, I rendered Grad-CAM over the trained image→angle pipeline. The attention does land on vessel-wall structure rather than on speckle in the corners.

Grad-CAM on the trained DenseNet201 pipeline — where the network looks when it reads the angle.
What I left on the table (honestly)
This whole rebuild ran on a laptop, and I kept it honest about the ceiling. End-to-end fine-tuning of the strong backbones OOMs on the Apple GPU; the modern self-supervised encoders (DINOv2, a medical-ultrasound foundation model) need a CUDA box and a reworked dependency stack. Those are real next steps, not silent omissions — and a classical, hand-crafted structure-tensor baseline (MAE 3.16°, with a learned+classical circular fusion at 2.72°) is a reminder that the learned model and a purely geometric cue capture partly complementary information.
The code is a typed, fully test-driven Keras-3 library — the two sampling protocols, the pooling fix, the Optuna tuning, the ensembling, and the conformal/Bland–Altman/calibration/Grad-CAM stack — with every figure regenerated from results/ by script. The interactive write-up lives on the project site. The one sentence I would attach to the original paper: a frozen backbone already sees the angle — the work is in not pooling it away, and in tuning and calibrating what’s left.