Apr 16, 2026

Commits

`3c68e07` — Wire up real vLLM serving for TS_mobile2B (Gemma 4 E2B)

First real vLLM model serving, replacing the stub with live Gemma 4 E2B inference on A10G GPUs. Uses Modal's @modal.cls() + .remote.aio() RPC pattern instead of @modal.web_server (which has 303 redirect issues for POST requests).

Key discoveries during bringup:

vLLM 0.19.0 pins transformers<5 but Gemma 4 needs >=5.5.0 — install after vLLM via run_commands
T4 GPUs lack shared memory for vLLM's Triton attention kernels — A10G required for E2B
vLLM's --limit-mm-per-prompt expects JSON, not key=value
min_containers=0 with 600s scaledown window for cost control

Changed:

core/core.py — aiohttp added to mlImage
gateway/providers.py — stream_model_slot() routes to real vLLM via .remote.aio()
ml_endpoint.py — VllmE2B Modal class (A10G), diagnostic tools (diagnose_volume, test_vllm_startup)
serving/vllm_server.py — real vLLM HTTP calls with stub fallback

`2a7f598` — Fix USERS_BASE hostname rewrite to work on prod domains

The old .replace('ml-', 'users-') only matched environments with an ml-<env> prefix and silently no-op'd on prod (ml.grizzlebear.io), causing /login POSTs to hit the ML service and return 404 right after sign-in on /comparison and /dashboard.

Changed:

comparison.html — fixed USERS_BASE derivation
dashboard.html — fixed USERS_BASE derivation

`40ccb75` — Route E4B and 31B slots through vLLM providers

Completed model-slot dispatch for all three Gemma 4 classes (VllmE2B, VllmE4B, Vllm31B). Previously only TS_mobile2B was routed to real vLLM; the others fell through to stubs.

Changed:

gateway/providers.py — dispatch logic for all three vLLM classes
serving/vllm_server.py — class definitions for E4B and 31B

`b0fd993` — Add iOS LiteRT download pipeline

iOS apps can now fetch Gemma 4 E2B and E4B .litertlm bundles via a manifest endpoint that returns a presigned S3 URL. Base (v0) bundles come from Google's pre-converted HuggingFace repos. Finetuned conversion is scaffolded but stubbed until litert-torch E-series support stabilizes.

Architecture:

core/data.py gains store_ml_artifact / presigned_ml_artifact_get / head_ml_artifact / delete_ml_artifact — all ML bucket access now routes through core/data.py
ModelVersion extended with export_format, file_size_bytes, sha256, s3_key
New ml/mobile/litert.py centralizes all LiteRT flow: HF → Volume → S3

New endpoints:

POST /ml/mobile/ingest-litert — admin, one-time per slot
POST /ml/mobile/convert-litert — admin, per finetuned version (stub)
GET /ml/mobile/manifest/{slot} — iOS client: version, size, sha256, presigned URL

Verified on dev: TS_mobile2B v0 (2.58 GB), TS_mobile4B v0 (3.65 GB), TS_Modal correctly returns 404 (cloud-only).

New files:

ml/mobile/litert.py — LiteRT download, upload, conversion pipeline

Changed:

core/data.py — ML artifact S3 helpers
ml_endpoint.py — three new /ml/mobile/ routes
training/model_registry.py — ModelVersion schema extensions
ML_PIPELINE_SPEC.md — LiteRT pipeline documentation

`4ac3982` — docs and iterations/improvements logs

Updated project documentation and added IMPROVE.md for tracking iteration ideas.

Changed:

docs/architecture.md, docs/services.md, docs/changelog/ — doc updates
IMPROVE.md — new iterations/improvements log

Apr 16, 2026

Commits

3c68e07 — Wire up real vLLM serving for TS_mobile2B (Gemma 4 E2B)

2a7f598 — Fix USERS_BASE hostname rewrite to work on prod domains

40ccb75 — Route E4B and 31B slots through vLLM providers

b0fd993 — Add iOS LiteRT download pipeline

4ac3982 — docs and iterations/improvements logs

`3c68e07` — Wire up real vLLM serving for TS_mobile2B (Gemma 4 E2B)

`2a7f598` — Fix USERS_BASE hostname rewrite to work on prod domains

`40ccb75` — Route E4B and 31B slots through vLLM providers

`b0fd993` — Add iOS LiteRT download pipeline

`4ac3982` — docs and iterations/improvements logs