← All docs changelog/2026-05-06.md

May 6, 2026

Commits

1efeee9 — Self-heal zombie training jobs and finer-grained progress bar

Fixes stuck running rows on the ML training dashboard. A new _probe_running_job(job_id, job) helper asks Modal whether a running FunctionCall is still alive; reaper kills, OOMs, and other silent terminations now land as failed with the exception captured. Wired into both /ml/training/jobs/{id} and /ml/training/runs/{id} so dashboard reloads self-clear stuck rows.

The trainer now emits total_steps (= state.max_steps) on epoch_started and step events so the dashboard renders step 64/72 instead of the confusing step 64/24 (global_step vs per-epoch denominator).

Progress bar in ml-training.html fills proportional to global_step / total_steps, with two layered tick overlays: per-step (faint) and per-epoch (bolder), driven by --steps / --epochs CSS custom properties. Tick density capped at 200 to avoid solid-black rendering on very long runs.

Changed:

  • dev/ml/ml_endpoint.py_probe_running_job helper for zombie detection; wired into job + run GET endpoints (+68/-29)
  • dev/ml/training/trainer.py — emit total_steps on epoch_started and step events (+5)
  • dev/static_site/templates/demos/ml-training.html — proportional progress bar with step/epoch tick overlays (+73/-7)