Week 16 — Apr 14-20, 2026
Summary
Major vLLM serving milestone: all three Gemma 4 model slots now serve real inference via Modal .remote.aio() RPC. New iOS LiteRT download pipeline delivers on-device model bundles via presigned S3 URLs. Production hostname bug fixed for auth flow on comparison/dashboard pages.
7 commits (dev + jh) | 21 files changed | +2,737 / -865 lines
Highlights
Real vLLM Serving — All 3 Gemma 4 Slots (Apr 16)
Moved from stubs to live inference for all three Gemma 4 classes. Uses Modal's @modal.cls() + .remote.aio() pattern (not @web_server, which has 303 redirect issues). E2B runs on A10G (T4 lacks shared memory for Triton attention). min_containers=0 with 600s scaledown for cost control.
iOS LiteRT Download Pipeline (Apr 16)
New /ml/mobile/ endpoints let iOS apps fetch .litertlm model bundles. Base v0 bundles come from HuggingFace pre-converted repos. Manifest endpoint returns version, file size, sha256, and presigned S3 URL. Verified on dev: E2B (2.58 GB), E4B (3.65 GB).
Model Comparison UI (Apr 14)
New /comparison page for side-by-side streaming evaluation across all 6 models: Gemini, Claude, OpenAI (teachers) and TS_Modal, TS_mobile4B, TS_mobile2B (Gemma 4 slots). Each model has a version selector. Uses SSE multiplexing from chat.py.
Volume Environment Isolation Fix (Apr 14)
_current.json version pointer now namespaced as _current_{env}.json so dev/jh/main track active model versions independently on the shared Volume.
Production Hostname Fix (Apr 16)
USERS_BASE derivation in comparison.html and dashboard.html used .replace('ml-', 'users-') which silently failed on prod domains (no -env suffix), causing 404 on login. Fixed to handle both patterns.
SSE_ONLY Encryption Removal (Apr 14, jh branch)
Removed SSE_ONLY encryption type. Video recordings stored as PLAINTEXT until a proper per-account video encryption strategy is designed.
Daily Breakdown
Apr 14 (3 commits)
2a7024aAdd model comparison demo + fix Volume env isolationb9b7b53Add project documentation — architecture, service catalog, and changelogs9bc71f9Remove SSE_ONLY encryption type — video encryption strategy TBD (jh branch)
Apr 16 (5 commits)
3c68e07Wire up real vLLM serving for TS_mobile2B (Gemma 4 E2B)2a7f598Fix USERS_BASE hostname rewrite to work on prod domains40ccb75Route E4B and 31B slots through vLLM providersb0fd993Add iOS LiteRT download pipeline4ac3982docs and iterations/improvements logs
New Files
| File | Purpose |
|---|---|
dev/ml/comparison.html |
Side-by-side streaming comparison UI for all 6 models |
dev/ml/mobile/litert.py |
LiteRT download, upload, conversion pipeline for iOS |
Modified Files
dev/core/core.py—aiohttpadded to mlImage;transformers>=5.5.0overridedev/core/data.py— ML artifact S3 helpers (store_ml_artifact,presigned_ml_artifact_get, etc.), SSE_ONLY removaldev/ml/gateway/chat.py— Routes Claude, OpenAI, model slots; timing metadatadev/ml/gateway/providers.py— Provider adapters for Claude/OpenAI/vLLM; all 3 slot dispatchdev/ml/ml_endpoint.py—VllmE2B/VllmE4B/Vllm31BModal classes;/comparisonroute;/ml/mobile/endpoints; diagnostic toolsdev/ml/serving/vllm_server.py— Real vLLM inference with stub fallback; E4B/31B classesdev/ml/training/model_registry.py— Env-aware_current_{env}.json;ModelVersionschema extensionsdev/ml/comparison.html— USERS_BASE hostname fixdev/ml/dashboard.html— USERS_BASE hostname fixspecs/ML_PIPELINE_SPEC.md— Updated with env isolation docs, LiteRT pipeline