Apr 22, 2026

Commits

`86034e2` — Simplify model versioning — shared Volume, config-driven env pins

Collapsed three overlapping version-resolution mechanisms (_current_{env}.json pointers, per-env Volumes, S3 registry preflight) into one: a single Modal Volume living in the main environment, mounted cross-env by every app. Active version per env is now declared in dev/core/model_versions.py as either a literal "vN" or "latest", where "latest" resolves through a single {slot}/_latest.json pointer that the trainer advances on successful runs.

Also fixes a trainer bug where a partial HuggingFace download (directory exists but no weight files) was treated as "already complete" — now checks for actual .safetensors / .bin files.

CI preflight manifest jobs (preflight_manifest_beta, preflight_manifest_main) removed — S3 registry no longer drives path resolution.

New files:

core/model_versions.py — MODEL_VERSIONS config + resolve_version() helper

Changed:

core/core.py — ml_model_volume pinned to environment_name="main" for cross-env mount
ml/ml_endpoint.py — diagnose now surfaces _latest.json
ml/mobile/litert.py — manifest resolves via MODEL_VERSIONS instead of "newest LiteRT in registry"
ml/training/model_registry.py — drop _current_{env}.json and legacy fallback; uses MODEL_VERSIONS then _latest.json
ml/training/trainer.py — _has_weight_files() guard replaces naive "listdir is non-empty" skip check
.gitlab-ci.yml — drop preflight_manifest_{beta,main} jobs

`d202c02` — Short-circuit convert-litert endpoint to skip GPU cold start

The POST /ml/mobile/convert-litert endpoint was spawning an A10G container only to hit NotImplementedError in litert.py. Now returns HTTP 501 at the endpoint level before any GPU allocation, avoiding a ~60s cold-start penalty. The modal_convert_litert Modal function remains as scaffold for when litert-torch E-series support stabilizes upstream.

Changed:

ml/ml_endpoint.py — early 501 return for convert-litert

`d14eecb` — Guard vLLM output indexing against empty results

VllmE2B/VllmE4B/Vllm31B.generate() and the _make_vllm_generate factory accessed outputs[0].outputs[0].text with no bounds check. An empty outputs list (oversized prompt, OOM, tokenizer failure) raised IndexError, surfacing as an opaque 500 through .remote.aio(). Now each path checks both list depths, logs a warning with prompt length and max_tokens, and returns "" so requests complete with a structured response.

Changed:

ml/ml_endpoint.py — bounds checks on vLLM output indexing

Apr 22, 2026

Commits

86034e2 — Simplify model versioning — shared Volume, config-driven env pins

d202c02 — Short-circuit convert-litert endpoint to skip GPU cold start

d14eecb — Guard vLLM output indexing against empty results

`86034e2` — Simplify model versioning — shared Volume, config-driven env pins

`d202c02` — Short-circuit convert-litert endpoint to skip GPU cold start

`d14eecb` — Guard vLLM output indexing against empty results