Week 23 — Jun 2 – Jun 8, 2026

Summary

Major ML investigation system sprint: three new subsystems (contracts, perception, reasoning) were built from scratch to form a three-track architecture for property inspection understanding. Data contracts (v0) define the interfaces between tracks using JSON Schema: capture-bundle (iOS → perception), scene-facts (perception → reasoning), and data-pile (progressive-markdown KB). The perception pipeline integrates SAM2 video segmentation (object masks + track IDs) and Cosmos 3 video-language model (room/surface captions + Q&A), merging their outputs into a validated scene_facts document — the first structured perception output. A multimodal LLM-judge hallucination metric cross-checks Cosmos captions against input frames. The reasoning module scaffolds a grounded Q&A runner over progressive-markdown knowledge bases with A/B eval comparing grounded vs video-only answers. A /health endpoint was added to the ML service.

17 code commits | 50+ new files | ~+3,200 lines

Highlights

ML Data Contracts (v0)

Three JSON Schema contracts that keep the investigation tracks aligned:

Contract	Flow	Schema
capture-bundle	Track 1 (iOS capture) → Track 2 (perception)	`capture-bundle.schema.json`
scene-facts	Track 2 (perception) → Track 3 (reasoning)	`scene-facts.schema.json`
data-pile / KB	Tracks 1+2 → Track 3 (reasoning)	`kb-frontmatter.schema.json`

All schemas are JSON Schema (draft 2020-12), ARKit world frame (right-handed, y-up, meters). A shared loader module handles schema loading, validation, and markdown front-matter parsing. The scene-facts contract was updated to v0.2.0 to make obb optional for 2D-only perception.

Perception Pipeline (SAM2 + Cosmos 3)

Two complementary perception models deployed as Modal apps:

SAM2 (grizzlebear-sam2-jh): video segmentation producing per-frame masks with persistent track IDs. Supports real-clip ingestion (keyframe extraction → volume upload → inference).
Cosmos 3 (grizzlebear-cosmos-jh): NVIDIA video-language model (Cosmos Reasoner NIM) deployed as a Modal app. Produces room labels, surface descriptions, and scene Q&A answers.

The merge step (perception/merge.py) combines SAM2 bounding boxes with Cosmos captions into a fully populated scene_facts document. A hallucination judge cross-checks Cosmos outputs against input frames via a separate LLM call.

Reasoning Module

Stage 0 reasoning runner: loads a data pile (progressive-markdown KB), validates each doc's front-matter against the data-pile contract, answers questions via a stub LLM. Pure/local — no Modal, no network. Stage 1 adds an A/B eval comparing grounded (KB-augmented) vs video-only reasoning.

ML Health Endpoint

GET /health on the ML service returns liveness status. Includes a smoke test script at dev/ml/scripts/smoke_health.py.

Daily Breakdown

Jun 2 (17 code commits)

f4d4700 ml(contracts): add v0 data contracts + validator (+988)
7aaa2f4 ml(contracts): add shared loader for v0 contracts (+86)
da50e4f ml(contracts): scene-facts 0.2.0 — obb optional for 2D-only perception (+58/-3)
b76a6ce ml(perception): scaffold perception harness skeleton (+319/-3)
6a6a733 ml(perception): integrate SAM2 masks + track IDs (+399)
768a74d ml(perception): real-clip ingestion for SAM2 eval (+200/-28)
1d962ea ml(perception): Cosmos 3 reasoner captions + scene Q&A eval (+457)
4f74c41 ml(perception): fix Cosmos NIM secret shape for nvcr pull (+19/-15)
f47bd16 ml(perception): deploy real Cosmos 3 NIM + record live eval (+75/-14)
f9d7510 ml(perception): add show_report viewer for Cosmos eval artifacts (+23)
472c13c ml(perception): multimodal LLM-judge hallucination metric (+112/-1)
f8c35e8 ml(perception): merge SAM2 + Cosmos into populated scene_facts (+224)
53da940 ml(reasoning): scaffold reasoning runner + stub LLM (+308)
fa49511 ml(reasoning): grounded-vs-video-only A/B eval (+185)
593cab3 ml(SH-0b): add /health endpoint + jh smoke test (+78)
98e753f docs(ml): add development plan (+112)

Modified Files (key changes)

ML Contracts

dev/ml/contracts/ — new: 3 JSON Schema contracts (capture-bundle, scene-facts, data-pile) + shared loader + validator

ML Perception

dev/ml/perception/harness.py — new: Stage 0 perception harness skeleton
dev/ml/perception/models/sam2_model.py — new: SAM2 video segmentation wrapper
dev/ml/perception/sam2_app.py — new: Modal app for SAM2 inference
dev/ml/perception/cosmos/client.py — new: Cosmos 3 NIM client
dev/ml/perception/cosmos/judge.py — new: LLM-judge hallucination metric
dev/ml/perception/cosmos_nim_app.py — new: Cosmos 3 NIM Modal app
dev/ml/perception/eval_cosmos.py — new: Cosmos evaluation harness
dev/ml/perception/merge.py — new: SAM2 + Cosmos → scene_facts merge
dev/ml/perception/keyframes.py — new: keyframe extraction from video
dev/ml/perception/storage.py — new: Modal volume storage helpers

ML Reasoning

dev/ml/reasoning/runner.py — new: data-pile loader + question answering
dev/ml/reasoning/stub_llm.py — new: stub LLM for Stage 0
dev/ml/reasoning/ab_eval.py — new: grounded vs video-only A/B eval

ML Service

dev/ml/ml_endpoint.py — /health endpoint
dev/ml/scripts/smoke_health.py — new: health endpoint smoke test