← All docs changelog/2026-04-16.md

Apr 16, 2026

Commits

3c68e07 — Wire up real vLLM serving for TS_mobile2B (Gemma 4 E2B)

First real vLLM model serving, replacing the stub with live Gemma 4 E2B inference on A10G GPUs. Uses Modal's @modal.cls() + .remote.aio() RPC pattern instead of @modal.web_server (which has 303 redirect issues for POST requests).

Key discoveries during bringup:

  • vLLM 0.19.0 pins transformers<5 but Gemma 4 needs >=5.5.0 — install after vLLM via run_commands
  • T4 GPUs lack shared memory for vLLM's Triton attention kernels — A10G required for E2B
  • vLLM's --limit-mm-per-prompt expects JSON, not key=value
  • min_containers=0 with 600s scaledown window for cost control

Changed:

  • core/core.pyaiohttp added to mlImage
  • gateway/providers.pystream_model_slot() routes to real vLLM via .remote.aio()
  • ml_endpoint.pyVllmE2B Modal class (A10G), diagnostic tools (diagnose_volume, test_vllm_startup)
  • serving/vllm_server.py — real vLLM HTTP calls with stub fallback

2a7f598 — Fix USERS_BASE hostname rewrite to work on prod domains

The old .replace('ml-', 'users-') only matched environments with an ml-<env> prefix and silently no-op'd on prod (ml.grizzlebear.io), causing /login POSTs to hit the ML service and return 404 right after sign-in on /comparison and /dashboard.

Changed:

  • comparison.html — fixed USERS_BASE derivation
  • dashboard.html — fixed USERS_BASE derivation

40ccb75 — Route E4B and 31B slots through vLLM providers

Completed model-slot dispatch for all three Gemma 4 classes (VllmE2B, VllmE4B, Vllm31B). Previously only TS_mobile2B was routed to real vLLM; the others fell through to stubs.

Changed:

  • gateway/providers.py — dispatch logic for all three vLLM classes
  • serving/vllm_server.py — class definitions for E4B and 31B

b0fd993 — Add iOS LiteRT download pipeline

iOS apps can now fetch Gemma 4 E2B and E4B .litertlm bundles via a manifest endpoint that returns a presigned S3 URL. Base (v0) bundles come from Google's pre-converted HuggingFace repos. Finetuned conversion is scaffolded but stubbed until litert-torch E-series support stabilizes.

Architecture:

  • core/data.py gains store_ml_artifact / presigned_ml_artifact_get / head_ml_artifact / delete_ml_artifact — all ML bucket access now routes through core/data.py
  • ModelVersion extended with export_format, file_size_bytes, sha256, s3_key
  • New ml/mobile/litert.py centralizes all LiteRT flow: HF → Volume → S3

New endpoints:

  • POST /ml/mobile/ingest-litert — admin, one-time per slot
  • POST /ml/mobile/convert-litert — admin, per finetuned version (stub)
  • GET /ml/mobile/manifest/{slot} — iOS client: version, size, sha256, presigned URL

Verified on dev: TS_mobile2B v0 (2.58 GB), TS_mobile4B v0 (3.65 GB), TS_Modal correctly returns 404 (cloud-only).

New files:

  • ml/mobile/litert.py — LiteRT download, upload, conversion pipeline

Changed:

  • core/data.py — ML artifact S3 helpers
  • ml_endpoint.py — three new /ml/mobile/ routes
  • training/model_registry.pyModelVersion schema extensions
  • ML_PIPELINE_SPEC.md — LiteRT pipeline documentation

4ac3982 — docs and iterations/improvements logs

Updated project documentation and added IMPROVE.md for tracking iteration ideas.

Changed:

  • docs/architecture.md, docs/services.md, docs/changelog/ — doc updates
  • IMPROVE.md — new iterations/improvements log