June 2, 2026

Major ML sprint: laid the full foundation for the three-track investigation system — data contracts, perception (SAM2 + Cosmos 3), and reasoning. 17 code commits.

Commits

`f4d4700` — ml(contracts): add v0 data contracts + validator (SH-0a)

Three JSON Schema contracts that keep the three investigation tracks aligned: capture-bundle (Track 1 iOS capture → Track 2 perception), scene-facts (Track 2 perception → Track 3 reasoning), and data-pile / KB (progressive-markdown front-matter validation for the reasoning knowledge base). Includes a validator script that checks all valid/invalid fixtures.

Added:

dev/ml/contracts/ — README, validate.py, requirements.txt
dev/ml/contracts/capture-bundle/ — schema + valid/invalid fixtures
dev/ml/contracts/scene-facts/ — schema + valid/invalid fixtures
dev/ml/contracts/data-pile/ — schema + valid/invalid fixtures
Justfile — validate-contracts recipe

`7aaa2f4` — ml(contracts): add shared loader for v0 contracts

Shared loader module for the three contracts. Loads and caches schemas, validates data against them, and parses markdown front-matter for the data-pile contract.

Added:

dev/ml/contracts/__init__.py
dev/ml/contracts/loader.py — schema loading, validation, front-matter parsing (+79)

`da50e4f` — ml(contracts): scene-facts 0.2.0 — obb optional for 2D-only perception

Relaxes objects[].obb from required to optional and adds optional bbox_2d + primary_keyframe_ref fields, so 2D-only perception (SAM2 on keyframes) validates before Stage-2 depth/pose fusion fills the 3D bounding box.

Changed:

dev/ml/contracts/scene-facts/scene-facts.schema.json — relax obb, add 2D fields (+17/-3)
dev/ml/contracts/README.md — changelog entry (+4)

Added:

dev/ml/contracts/scene-facts/examples/valid/scene_facts.2d_only.example.json (+40)

`b76a6ce` — ml(perception): scaffold perception harness skeleton (S0-T2a)

Stage 0 perception skeleton: a run(video_path) harness that loads a model, runs it, assembles a scene_facts document, and validates it against the scene-facts contract. Ships with a no-op model that emits a schema-valid but empty scene facts doc. Includes acceptance script, Modal volume storage helper, and scratch notes.

Added:

dev/ml/perception/__init__.py, harness.py, storage.py
dev/ml/perception/models/__init__.py, models/noop.py
dev/ml/perception/scripts/acceptance.py
dev/ml/perception/SCRATCH.md

`6a6a733` — ml(perception): integrate SAM2 masks + track IDs (S1-T2a)

Integrates the SAM2 video segmentation model into the perception pipeline. Wraps the SAM2 predictor behind a Modal app (grizzlebear-sam2-jh) that loads the model into GPU memory and exposes an infer method. Produces per-frame segmentation masks with persistent track IDs. Includes an eval script that runs SAM2 on stock video and measures mask quality.

Added:

dev/ml/perception/models/sam2_model.py — SAM2 model wrapper (+100)
dev/ml/perception/sam2_app.py — Modal app for SAM2 inference (+176)
dev/ml/perception/eval_sam2.py — evaluation harness (+73)

`768a74d` — ml(perception): real-clip ingestion for SAM2 eval (S1-T2a)

Adds a clip upload pipeline: extract keyframes from video, upload to Modal volume, and run SAM2 eval against real property inspection footage instead of stock video. Includes a .gitignore for the stock clip directory and a README explaining the ingestion workflow.

Added:

dev/ml/perception/keyframes.py — keyframe extraction (+50)
dev/ml/perception/scripts/upload_clip.py — upload pipeline (+44)
dev/ml/perception/stock/README.md, stock/.gitignore

Changed:

dev/ml/perception/sam2_app.py — support real-clip input (+70/-28)

`1d962ea` — ml(perception): Cosmos 3 reasoner captions + scene Q&A eval (S1-T2b)

Adds Cosmos 3 (NVIDIA video-language model) as a second perception source: deploys the Cosmos Reasoner NIM as a Modal app, wraps it in a client that sends video frames and scene-specific questions, and runs an eval that compares Cosmos captions against ground truth. Includes a structured question set for property inspection scenes.

Added:

dev/ml/perception/cosmos/__init__.py, cosmos/client.py, cosmos/questions.json
dev/ml/perception/cosmos_nim_app.py — Cosmos 3 NIM Modal app (+74)
dev/ml/perception/eval_cosmos.py — evaluation harness (+108)
dev/ml/perception/scripts/eval_cosmos.py — CLI runner (+88)

`4f74c41` — ml(perception): fix Cosmos NIM secret shape for nvcr pull (S1-T2b)

Fixes the Modal secret configuration so the Cosmos 3 NIM container image can pull from NVIDIA's container registry (nvcr.io).

Changed:

dev/ml/perception/cosmos_nim_app.py — secret shape fix (+19/-15)

`f47bd16` — ml(perception): deploy real Cosmos 3 NIM + record live S1-T2b eval

Deploys the real (not mocked) Cosmos 3 NIM and records a live evaluation run with artifacts.

Changed:

dev/ml/perception/cosmos/client.py — production endpoint (+4)
dev/ml/perception/cosmos_nim_app.py — production config (+75/-14)
dev/ml/perception/scripts/eval_cosmos.py — artifact recording (+12/-2)

`f9d7510` — ml(perception): add show_report viewer for Cosmos eval artifacts

Adds a show_report CLI command that renders Cosmos eval artifacts (question-answer pairs, scores, latencies) as a formatted report.

Changed:

dev/ml/perception/scripts/eval_cosmos.py — report viewer (+23)

`472c13c` — ml(perception): multimodal LLM-judge hallucination metric (S1-T2b upgrade)

Adds a hallucination judge: a separate LLM call that cross-checks Cosmos captions against the input frames to detect hallucinated objects or spatial claims. The judge score feeds into the eval metrics alongside the existing accuracy/relevance scores.

Added:

dev/ml/perception/cosmos/judge.py — LLM-judge hallucination metric (+51)

Changed:

dev/ml/perception/eval_cosmos.py — integrate judge into eval pipeline (+21)
dev/ml/perception/scripts/eval_cosmos.py — report judge scores (+41/-1)

`f8c35e8` — ml(perception): merge SAM2 + Cosmos into populated scene_facts (S1-T2c)

Merges SAM2 segmentation masks (object bounding boxes, track IDs) with Cosmos 3 captions (room labels, surface descriptions) into a fully populated scene_facts document that validates against the scene-facts contract. This is the first time the perception pipeline produces real, structured output.

Added:

dev/ml/perception/merge.py — SAM2 + Cosmos merge logic (+105)
dev/ml/perception/scripts/merge_scene_facts.py — CLI merge runner (+113)

`53da940` — ml(reasoning): scaffold reasoning runner + stub LLM (S0-T3a)

Stage 0 reasoning skeleton: loads a data pile (progressive-markdown KB), validates each doc's front-matter against the data-pile contract, answers a question set via a stub LLM, and emits results. Pure/local — no Modal, no network. Includes hand-authored fixture pile and acceptance script.

Added:

dev/ml/reasoning/__init__.py, runner.py, stub_llm.py
dev/ml/reasoning/fixtures/ — question set + fixture pile (index, rooms, surfaces)
dev/ml/reasoning/scripts/acceptance.py

`fa49511` — ml(reasoning): grounded-vs-video-only A/B eval (S1-T3a)

A/B evaluation comparing grounded reasoning (knowledge-base-augmented) vs video-only (no KB context) answers. Uses the same question set against both conditions and measures answer quality via LLM judge.

Added:

dev/ml/reasoning/ab_eval.py — A/B evaluation harness (+67)
dev/ml/reasoning/fixtures/metric_questions.json — metric question set (+9)
dev/ml/reasoning/scripts/ab_eval.py — CLI A/B runner (+103)

`593cab3` — ml(SH-0b): add /health endpoint + jh smoke test

Adds a /health endpoint to the ML service for basic liveness checks. Includes a smoke test script.

Added:

dev/ml/scripts/smoke_health.py — health endpoint smoke test (+62)

Changed:

dev/ml/ml_endpoint.py — /health GET endpoint (+16)

`98e753f` — docs(ml): add development plan (task/test pairs) to jh

Development plan document mapping every ML task to its acceptance test.

Added:

dev/ml/development-plan-task-test-pairs.md (+112)

June 2, 2026

Commits

f4d4700 — ml(contracts): add v0 data contracts + validator (SH-0a)

7aaa2f4 — ml(contracts): add shared loader for v0 contracts

da50e4f — ml(contracts): scene-facts 0.2.0 — obb optional for 2D-only perception

b76a6ce — ml(perception): scaffold perception harness skeleton (S0-T2a)

6a6a733 — ml(perception): integrate SAM2 masks + track IDs (S1-T2a)

768a74d — ml(perception): real-clip ingestion for SAM2 eval (S1-T2a)

1d962ea — ml(perception): Cosmos 3 reasoner captions + scene Q&A eval (S1-T2b)

4f74c41 — ml(perception): fix Cosmos NIM secret shape for nvcr pull (S1-T2b)

f47bd16 — ml(perception): deploy real Cosmos 3 NIM + record live S1-T2b eval

f9d7510 — ml(perception): add show_report viewer for Cosmos eval artifacts

472c13c — ml(perception): multimodal LLM-judge hallucination metric (S1-T2b upgrade)

f8c35e8 — ml(perception): merge SAM2 + Cosmos into populated scene_facts (S1-T2c)

53da940 — ml(reasoning): scaffold reasoning runner + stub LLM (S0-T3a)

fa49511 — ml(reasoning): grounded-vs-video-only A/B eval (S1-T3a)

593cab3 — ml(SH-0b): add /health endpoint + jh smoke test

98e753f — docs(ml): add development plan (task/test pairs) to jh

`f4d4700` — ml(contracts): add v0 data contracts + validator (SH-0a)

`7aaa2f4` — ml(contracts): add shared loader for v0 contracts

`da50e4f` — ml(contracts): scene-facts 0.2.0 — obb optional for 2D-only perception

`b76a6ce` — ml(perception): scaffold perception harness skeleton (S0-T2a)

`6a6a733` — ml(perception): integrate SAM2 masks + track IDs (S1-T2a)

`768a74d` — ml(perception): real-clip ingestion for SAM2 eval (S1-T2a)

`1d962ea` — ml(perception): Cosmos 3 reasoner captions + scene Q&A eval (S1-T2b)

`4f74c41` — ml(perception): fix Cosmos NIM secret shape for nvcr pull (S1-T2b)

`f47bd16` — ml(perception): deploy real Cosmos 3 NIM + record live S1-T2b eval

`f9d7510` — ml(perception): add show_report viewer for Cosmos eval artifacts

`472c13c` — ml(perception): multimodal LLM-judge hallucination metric (S1-T2b upgrade)

`f8c35e8` — ml(perception): merge SAM2 + Cosmos into populated scene_facts (S1-T2c)

`53da940` — ml(reasoning): scaffold reasoning runner + stub LLM (S0-T3a)

`fa49511` — ml(reasoning): grounded-vs-video-only A/B eval (S1-T3a)

`593cab3` — ml(SH-0b): add /health endpoint + jh smoke test

`98e753f` — docs(ml): add development plan (task/test pairs) to jh