May 6, 2026
Commits
1efeee9 — Self-heal zombie training jobs and finer-grained progress bar
Fixes stuck running rows on the ML training dashboard. A new _probe_running_job(job_id, job) helper asks Modal whether a running FunctionCall is still alive; reaper kills, OOMs, and other silent terminations now land as failed with the exception captured. Wired into both /ml/training/jobs/{id} and /ml/training/runs/{id} so dashboard reloads self-clear stuck rows.
The trainer now emits total_steps (= state.max_steps) on epoch_started and step events so the dashboard renders step 64/72 instead of the confusing step 64/24 (global_step vs per-epoch denominator).
Progress bar in ml-training.html fills proportional to global_step / total_steps, with two layered tick overlays: per-step (faint) and per-epoch (bolder), driven by --steps / --epochs CSS custom properties. Tick density capped at 200 to avoid solid-black rendering on very long runs.
Changed:
dev/ml/ml_endpoint.py—_probe_running_jobhelper for zombie detection; wired into job + run GET endpoints (+68/-29)dev/ml/training/trainer.py— emittotal_stepsonepoch_startedand step events (+5)dev/static_site/templates/demos/ml-training.html— proportional progress bar with step/epoch tick overlays (+73/-7)