June 2, 2026
Major ML sprint: laid the full foundation for the three-track investigation system — data contracts, perception (SAM2 + Cosmos 3), and reasoning. 17 code commits.
Commits
f4d4700 — ml(contracts): add v0 data contracts + validator (SH-0a)
Three JSON Schema contracts that keep the three investigation tracks aligned: capture-bundle (Track 1 iOS capture → Track 2 perception), scene-facts (Track 2 perception → Track 3 reasoning), and data-pile / KB (progressive-markdown front-matter validation for the reasoning knowledge base). Includes a validator script that checks all valid/invalid fixtures.
Added:
dev/ml/contracts/— README,validate.py,requirements.txtdev/ml/contracts/capture-bundle/— schema + valid/invalid fixturesdev/ml/contracts/scene-facts/— schema + valid/invalid fixturesdev/ml/contracts/data-pile/— schema + valid/invalid fixturesJustfile—validate-contractsrecipe
7aaa2f4 — ml(contracts): add shared loader for v0 contracts
Shared loader module for the three contracts. Loads and caches schemas, validates data against them, and parses markdown front-matter for the data-pile contract.
Added:
dev/ml/contracts/__init__.pydev/ml/contracts/loader.py— schema loading, validation, front-matter parsing (+79)
da50e4f — ml(contracts): scene-facts 0.2.0 — obb optional for 2D-only perception
Relaxes objects[].obb from required to optional and adds optional bbox_2d + primary_keyframe_ref fields, so 2D-only perception (SAM2 on keyframes) validates before Stage-2 depth/pose fusion fills the 3D bounding box.
Changed:
dev/ml/contracts/scene-facts/scene-facts.schema.json— relax obb, add 2D fields (+17/-3)dev/ml/contracts/README.md— changelog entry (+4)
Added:
dev/ml/contracts/scene-facts/examples/valid/scene_facts.2d_only.example.json(+40)
b76a6ce — ml(perception): scaffold perception harness skeleton (S0-T2a)
Stage 0 perception skeleton: a run(video_path) harness that loads a model, runs it, assembles a scene_facts document, and validates it against the scene-facts contract. Ships with a no-op model that emits a schema-valid but empty scene facts doc. Includes acceptance script, Modal volume storage helper, and scratch notes.
Added:
dev/ml/perception/__init__.py,harness.py,storage.pydev/ml/perception/models/__init__.py,models/noop.pydev/ml/perception/scripts/acceptance.pydev/ml/perception/SCRATCH.md
6a6a733 — ml(perception): integrate SAM2 masks + track IDs (S1-T2a)
Integrates the SAM2 video segmentation model into the perception pipeline. Wraps the SAM2 predictor behind a Modal app (grizzlebear-sam2-jh) that loads the model into GPU memory and exposes an infer method. Produces per-frame segmentation masks with persistent track IDs. Includes an eval script that runs SAM2 on stock video and measures mask quality.
Added:
dev/ml/perception/models/sam2_model.py— SAM2 model wrapper (+100)dev/ml/perception/sam2_app.py— Modal app for SAM2 inference (+176)dev/ml/perception/eval_sam2.py— evaluation harness (+73)
768a74d — ml(perception): real-clip ingestion for SAM2 eval (S1-T2a)
Adds a clip upload pipeline: extract keyframes from video, upload to Modal volume, and run SAM2 eval against real property inspection footage instead of stock video. Includes a .gitignore for the stock clip directory and a README explaining the ingestion workflow.
Added:
dev/ml/perception/keyframes.py— keyframe extraction (+50)dev/ml/perception/scripts/upload_clip.py— upload pipeline (+44)dev/ml/perception/stock/README.md,stock/.gitignore
Changed:
dev/ml/perception/sam2_app.py— support real-clip input (+70/-28)
1d962ea — ml(perception): Cosmos 3 reasoner captions + scene Q&A eval (S1-T2b)
Adds Cosmos 3 (NVIDIA video-language model) as a second perception source: deploys the Cosmos Reasoner NIM as a Modal app, wraps it in a client that sends video frames and scene-specific questions, and runs an eval that compares Cosmos captions against ground truth. Includes a structured question set for property inspection scenes.
Added:
dev/ml/perception/cosmos/__init__.py,cosmos/client.py,cosmos/questions.jsondev/ml/perception/cosmos_nim_app.py— Cosmos 3 NIM Modal app (+74)dev/ml/perception/eval_cosmos.py— evaluation harness (+108)dev/ml/perception/scripts/eval_cosmos.py— CLI runner (+88)
4f74c41 — ml(perception): fix Cosmos NIM secret shape for nvcr pull (S1-T2b)
Fixes the Modal secret configuration so the Cosmos 3 NIM container image can pull from NVIDIA's container registry (nvcr.io).
Changed:
dev/ml/perception/cosmos_nim_app.py— secret shape fix (+19/-15)
f47bd16 — ml(perception): deploy real Cosmos 3 NIM + record live S1-T2b eval
Deploys the real (not mocked) Cosmos 3 NIM and records a live evaluation run with artifacts.
Changed:
dev/ml/perception/cosmos/client.py— production endpoint (+4)dev/ml/perception/cosmos_nim_app.py— production config (+75/-14)dev/ml/perception/scripts/eval_cosmos.py— artifact recording (+12/-2)
f9d7510 — ml(perception): add show_report viewer for Cosmos eval artifacts
Adds a show_report CLI command that renders Cosmos eval artifacts (question-answer pairs, scores, latencies) as a formatted report.
Changed:
dev/ml/perception/scripts/eval_cosmos.py— report viewer (+23)
472c13c — ml(perception): multimodal LLM-judge hallucination metric (S1-T2b upgrade)
Adds a hallucination judge: a separate LLM call that cross-checks Cosmos captions against the input frames to detect hallucinated objects or spatial claims. The judge score feeds into the eval metrics alongside the existing accuracy/relevance scores.
Added:
dev/ml/perception/cosmos/judge.py— LLM-judge hallucination metric (+51)
Changed:
dev/ml/perception/eval_cosmos.py— integrate judge into eval pipeline (+21)dev/ml/perception/scripts/eval_cosmos.py— report judge scores (+41/-1)
f8c35e8 — ml(perception): merge SAM2 + Cosmos into populated scene_facts (S1-T2c)
Merges SAM2 segmentation masks (object bounding boxes, track IDs) with Cosmos 3 captions (room labels, surface descriptions) into a fully populated scene_facts document that validates against the scene-facts contract. This is the first time the perception pipeline produces real, structured output.
Added:
dev/ml/perception/merge.py— SAM2 + Cosmos merge logic (+105)dev/ml/perception/scripts/merge_scene_facts.py— CLI merge runner (+113)
53da940 — ml(reasoning): scaffold reasoning runner + stub LLM (S0-T3a)
Stage 0 reasoning skeleton: loads a data pile (progressive-markdown KB), validates each doc's front-matter against the data-pile contract, answers a question set via a stub LLM, and emits results. Pure/local — no Modal, no network. Includes hand-authored fixture pile and acceptance script.
Added:
dev/ml/reasoning/__init__.py,runner.py,stub_llm.pydev/ml/reasoning/fixtures/— question set + fixture pile (index, rooms, surfaces)dev/ml/reasoning/scripts/acceptance.py
fa49511 — ml(reasoning): grounded-vs-video-only A/B eval (S1-T3a)
A/B evaluation comparing grounded reasoning (knowledge-base-augmented) vs video-only (no KB context) answers. Uses the same question set against both conditions and measures answer quality via LLM judge.
Added:
dev/ml/reasoning/ab_eval.py— A/B evaluation harness (+67)dev/ml/reasoning/fixtures/metric_questions.json— metric question set (+9)dev/ml/reasoning/scripts/ab_eval.py— CLI A/B runner (+103)
593cab3 — ml(SH-0b): add /health endpoint + jh smoke test
Adds a /health endpoint to the ML service for basic liveness checks. Includes a smoke test script.
Added:
dev/ml/scripts/smoke_health.py— health endpoint smoke test (+62)
Changed:
dev/ml/ml_endpoint.py—/healthGET endpoint (+16)
98e753f — docs(ml): add development plan (task/test pairs) to jh
Development plan document mapping every ML task to its acceptance test.
Added:
dev/ml/development-plan-task-test-pairs.md(+112)