Apr 16, 2026
Commits
3c68e07 — Wire up real vLLM serving for TS_mobile2B (Gemma 4 E2B)
First real vLLM model serving, replacing the stub with live Gemma 4 E2B inference on A10G GPUs. Uses Modal's @modal.cls() + .remote.aio() RPC pattern instead of @modal.web_server (which has 303 redirect issues for POST requests).
Key discoveries during bringup:
- vLLM 0.19.0 pins
transformers<5but Gemma 4 needs>=5.5.0— install after vLLM viarun_commands - T4 GPUs lack shared memory for vLLM's Triton attention kernels — A10G required for E2B
- vLLM's
--limit-mm-per-promptexpects JSON, not key=value min_containers=0with 600s scaledown window for cost control
Changed:
core/core.py—aiohttpadded to mlImagegateway/providers.py—stream_model_slot()routes to real vLLM via.remote.aio()ml_endpoint.py—VllmE2BModal class (A10G), diagnostic tools (diagnose_volume,test_vllm_startup)serving/vllm_server.py— real vLLM HTTP calls with stub fallback
2a7f598 — Fix USERS_BASE hostname rewrite to work on prod domains
The old .replace('ml-', 'users-') only matched environments with an ml-<env> prefix and silently no-op'd on prod (ml.grizzlebear.io), causing /login POSTs to hit the ML service and return 404 right after sign-in on /comparison and /dashboard.
Changed:
comparison.html— fixed USERS_BASE derivationdashboard.html— fixed USERS_BASE derivation
40ccb75 — Route E4B and 31B slots through vLLM providers
Completed model-slot dispatch for all three Gemma 4 classes (VllmE2B, VllmE4B, Vllm31B). Previously only TS_mobile2B was routed to real vLLM; the others fell through to stubs.
Changed:
gateway/providers.py— dispatch logic for all three vLLM classesserving/vllm_server.py— class definitions for E4B and 31B
b0fd993 — Add iOS LiteRT download pipeline
iOS apps can now fetch Gemma 4 E2B and E4B .litertlm bundles via a manifest endpoint that returns a presigned S3 URL. Base (v0) bundles come from Google's pre-converted HuggingFace repos. Finetuned conversion is scaffolded but stubbed until litert-torch E-series support stabilizes.
Architecture:
core/data.pygainsstore_ml_artifact/presigned_ml_artifact_get/head_ml_artifact/delete_ml_artifact— all ML bucket access now routes throughcore/data.pyModelVersionextended withexport_format,file_size_bytes,sha256,s3_key- New
ml/mobile/litert.pycentralizes all LiteRT flow: HF → Volume → S3
New endpoints:
POST /ml/mobile/ingest-litert— admin, one-time per slotPOST /ml/mobile/convert-litert— admin, per finetuned version (stub)GET /ml/mobile/manifest/{slot}— iOS client: version, size, sha256, presigned URL
Verified on dev: TS_mobile2B v0 (2.58 GB), TS_mobile4B v0 (3.65 GB), TS_Modal correctly returns 404 (cloud-only).
New files:
ml/mobile/litert.py— LiteRT download, upload, conversion pipeline
Changed:
core/data.py— ML artifact S3 helpersml_endpoint.py— three new/ml/mobile/routestraining/model_registry.py—ModelVersionschema extensionsML_PIPELINE_SPEC.md— LiteRT pipeline documentation
4ac3982 — docs and iterations/improvements logs
Updated project documentation and added IMPROVE.md for tracking iteration ideas.
Changed:
docs/architecture.md,docs/services.md,docs/changelog/— doc updatesIMPROVE.md— new iterations/improvements log