Sapiens2 + Cosmos 3 — Summary & Comparison for a Video-Streaming Vision App

Prepared June 1, 2026. Companion to the Cosmos 3 study plan.

The target app: stream video into an endpoint, then surface reasoning / world understanding / object detection / measurement on the processed video. This doc summarizes Sapiens2, then compares it head-to-head with Cosmos 3 against those four capabilities, and proposes an architecture.

Sapiens2 — Summary

Meta AI released Sapiens2 on April 27, 2026 — the second generation of its human-centric vision foundation model. The one-sentence version: it's the best open model for precise, dense, pixel-level understanding of people, and it does nothing for non-human objects.

What it does (five tasks, one frozen backbone + lightweight heads):

Pose estimation — 308-keypoint full-body skeleton, including a dense 243-keypoint face and 40-keypoint hands. This is far beyond typical 17-keypoint COCO pose.
Body-part segmentation — 29 semantic classes (added eyeglasses), sharp boundaries.
Pointmap estimation — per-pixel 3D point in the camera frame (XYZ). This is the "measuring" primitive — but note it's focal-length-normalized + a learned scale scalar, not absolute metric (see "Pointmap limitations" below). Better than relative depth (it's a consistent 3D shape), but not centimeters out of the box.
Surface normal estimation — per-pixel unit normals; 5B hits 6.73° mean angular error (4K variant median just 3.08°).
Albedo estimation — true diffuse color/skin tone independent of lighting.

Scale & training:

Sizes: 0.4B, 0.8B, 1B, 5B params, native 1K resolution; hierarchical windowed-attention variants extend to 4K. The 5B is the highest-FLOPs ViT reported (15.7 TFLOPs).
Trained on Humans-1B (1B curated human images, filtered from ~4B), pretrained with a joint MAE + DINOv3-style contrastive objective so it keeps both fine texture fidelity (for albedo/normals) and high-level semantics.
Benchmarks: 82.3 mAP pose, 82.5 mIoU segmentation; the frozen 5B backbone beats even DINOv3-7B (a larger general-purpose model) across every human task.

Nature of the model — important for app design:

It's an image / per-frame model (a ViT backbone with dense-prediction heads). There is no temporal/video module — for video you run it frame-by-frame (optionally with your own tracking/smoothing).
It's a specialist: humans only. It is not a VLM, not a reasoner, not an open-vocabulary object detector. No text in, no language out.
Open weights (facebook/sapiens2 on HF), code at facebookresearch/sapiens2, paper. The 0.4B is small enough for real-time per-frame inference on a single mid-range GPU; the 5B/4K variants are batch/offline-grade.

Pointmap limitations (important for "measuring")

Two distinct constraints often get conflated — both matter if you're relying on Sapiens2 for measurement:

It's humans-only by training, not by runtime masking. The pointmap head emits a dense per-pixel 3D point over the whole image, but it was supervised entirely on synthetic human assets (the paper defines the pointmap/normal/albedo losses over human-form pixels). There is no segmentation gate inside the model. So on background and non-human objects the output is out-of-distribution and unreliable — the model still emits numbers there, they're just not meaningful. In practice you use the body-part segmentation head as a mask and keep the pointmap only where it says "human." That masking is your job downstream, not the model's.
It's scale-normalized, not absolute metric. Because metric scale is ambiguous without known camera intrinsics, Sapiens2 predicts a focal-length-normalized pointmap P̃(u) plus a learned scalar s (P̂ = s·P̃), and is even evaluated "in focal-length normalized canonical coordinates." So you get metrically consistent 3D shape of the person — correct proportions, limb ratios, surface geometry, self-consistent depth — but not guaranteed centimeters unless you supply known intrinsics or a reference scale.

Consequence: Sapiens2 is excellent for relative/structural human measurement (proportions, pose-derived dimensions, 3D surface, normals), masked to human pixels. It is not a general scene depth estimator and not an absolute-metric tape measure. For room/object measurement — or absolute units even on a person — add a dedicated metric-depth model (Depth-Anything-Metric, stereo/RGB-D) and/or camera calibration.

How the two models actually differ

They're almost orthogonal — different jobs, not competitors.

Dimension	Cosmos 3	Sapiens2
Origin / date	NVIDIA, Jun 1 2026	Meta, Apr 27 2026
Core identity	World foundation model: reasoning VLM + physics-aware video/action generator	Human-centric dense-prediction backbone
Input → output	text/image/video/audio/action → text (reasoning), video, action	image (per frame) → pose / segmentation / pointmap / normals / albedo
Subject scope	General scenes (robots, vehicles, warehouses, indoor spaces, objects, people)	People only
Reasoning / language	✅ chain-of-thought, captioning, Q&A about a scene	❌ none
World understanding	✅ qualitative — causality, motion, "what's happening / what next"	⚠️ only the human geometry within a frame
Object detection	⚠️ can describe/localize loosely via VLM, but not a precise detector	❌ not a detector (segments human parts only)
Measurement	⚠️ qualitative spatial reasoning ("how far is X from Y"), not metric	✅ 3D pointmap + normals — humans only, and scale-normalized (not absolute metric)
Temporal modeling	✅ native video understanding & generation	❌ per-frame; bring your own tracking
Sizes	8B (Nano) / 32B (Super)	0.4B / 0.8B / 1B / 5B (+4K)
Real-time latency	Reasoner is heavy → keyframes/clips, not every frame	0.4B → genuine per-frame real-time
Deployment	NIM container (reasoner today) or HF Diffusers	HF weights + task heads (your own serving)

Mapping to your four app capabilities

Reasoning → Cosmos 3 reasoner. This is its home turf; Sapiens2 can't do it at all.
World understanding → Cosmos 3 for the scene-level narrative; Sapiens2 for precise human-body understanding inside the scene. Complementary.
Object detection → neither is the right tool. Cosmos can loosely describe objects; Sapiens2 only segments human parts. You need a dedicated detector/tracker (Grounding DINO / YOLO / SAM 2 for open-vocab detection + masks + tracking).
Measuring → Sapiens2 for people (the pointmap is the standout feature — but it's scale-normalized human geometry, masked to human pixels, not absolute metric; see "Pointmap limitations"). For arbitrary objects, Sapiens2 doesn't apply and Cosmos is only qualitative — add a general metric-depth model (e.g., Depth-Anything-Metric or stereo/RGB-D), and supply camera intrinsics/a reference if you need absolute units anywhere.

The key takeaway: no single model covers all four. Cosmos 3 and Sapiens2 cover different quadrants well, and "object detection" + "general-object measurement" fall between them — so a production app is an ensemble, not a single endpoint.

Usefulness vs. usability for the streaming app

Usefulness (capability fit)

Cosmos 3 is the most useful for the "reasoning / world understanding" half: it's the only one that can watch a clip and tell you what's happening and why. It also future-proofs you toward action/event prediction. But it's a generalist brain, weak on pixel precision and metric numbers.
Sapiens2 is the most useful for the "measurement / human analytics" half: nothing open matches its pose/geometry precision, and the pointmap is close to what "measuring" wants — as long as the thing being measured is a person, you mask to human pixels, and you treat the output as scale-normalized geometry rather than absolute units.
Object detection is a genuine gap in both; treat it as a third component from day one.

Usability (effort to ship)

Cosmos 3 — higher ceiling, higher cost. The reasoner NIM is a one-command Docker container with an OpenAI-compatible API (very usable), but it's an 8B–32B model: heavy GPU, real latency. You won't run it on every frame — you sample keyframes or short clips. Generator NIM isn't out yet.
Sapiens2 — lower friction for inference, more glue required. The 0.4B runs per-frame in real time on modest hardware, but Meta ships weights + task heads, not a packaged serving container — you build the endpoint, batching, and tracking yourself. Outputs are dense tensors (heatmaps, masks, pointmaps) you must post-process into overlays.
Latency reality for streaming: run the cheap dense models (Sapiens2-0.4B, detector/tracker) on the per-frame hot path; run Cosmos reasoning asynchronously on sampled keyframes/short clips and overlay its narrative a beat behind the live geometry. Don't put a 32B VLM in the frame loop.

Recommended architecture

            ┌──────────────────────────────────────────────────────────┐
  video ───▶│  Ingest (WebRTC / RTSP) → decode                           │
  stream    └──────────────────────────────────────────────────────────┘
                          │ every frame
            ┌──────────────────────────────────────────────────────────┐
            │  FRAME GOVERNOR (temporal-redundancy gate)                 │
            │  • perceptual hash + optical-flow / scene-change detection │
            │  • tags each frame: {keyframe?, motion score, ROI changed} │
            │  NOTE: this is YOUR gate. Cosmos's EVS is internal to the  │
            │  reasoner (token pruning) and is NOT reusable here.        │
            └───┬───────────────────┬───────────────────────┬───────────┘
       every    │          keyframes │             keyframes/│clips
       frame    │          only      │             cadence   │
            ┌───▼───────────────┐ ┌──▼────────────────┐ ┌────▼──────────────┐
            │ TRACK PATH (RT)   │ │ KEYFRAME PATH      │ │ SLOW PATH (async) │
            │                   │ │                    │ │                   │
            │ • SAM 2 mask      │ │ • Detector re-seed │ │ • Cosmos 3        │
            │   propagation     │ │   (G-DINO/YOLO)    │ │   Reasoner NIM    │
            │   via streaming   │ │   → boxes for SAM2 │ │   → reasoning,    │
            │   memory          │ │ • Metric-depth     │ │     captions,     │
            │   (wants temporal │ │   (per-frame,      │ │     events        │
            │    continuity —   │ │    stateless)      │ │                   │
            │    don't starve   │ │ • Sapiens2-0.4B    │ │ (sparser cadence; │
            │    it of frames)  │ │   pose/parts/      │ │  lags ~1 beat)    │
            │                   │ │   pointmap (mask   │ │                   │
            │                   │ │   via seg head;    │ │                   │
            │                   │ │   scale-norm.)     │ │                   │
            └─────────┬─────────┘ └─────────┬──────────┘ └────────┬──────────┘
                      │ masks/IDs           │ boxes,depth,geom    │ narrative
            ┌─────────▼─────────────────────▼─────────────────────▼─────────┐
            │  Fusion + overlay service → annotated video + JSON results     │
            │  (boxes, masks, skeletons, measurements, scene reasoning)      │
            └────────────────────────────────────────────────────────────────┘

Why this shape:

Frame governor owns temporal-redundancy reduction (see the dedicated section below) and decides which frames trigger expensive work.
SAM 2 runs on the track path and propagates masks via its own streaming memory — fed continuous frames, re-seeded on keyframes.
Detector re-seeds object boxes on keyframes (and feeds SAM2's prompts); owns the object-detection gap neither foundation model fills.
Sapiens2-0.4B owns precise human pose/geometry on the keyframe path. Mask its pointmap with the segmentation head and treat it as scale-normalized; convert to absolute units only with known intrinsics or a reference.
Metric-depth model covers measurement of non-human objects (Sapiens2's blind spot) — per-frame and stateless, so it benefits most directly from keyframe gating.
Cosmos 3 reasoner owns the language/reasoning layer asynchronously, so a heavy VLM never blocks live rendering.
All converge in a fusion layer that produces both the annotated video and a structured JSON stream the app can render.

Frame governor: temporal-redundancy reduction (and why EVS doesn't help here)

The instinct to reuse Cosmos's video-frame optimization for your other models is reasonable, but the mechanism doesn't transfer. Worth being precise:

What Cosmos's EVS actually is. "Efficient Video Sampling" (arXiv 2510.14624) prunes temporally static patches — sub-frame spatial regions unchanged across consecutive frames — at the token level, inside the reasoner, before the transformer attends over them. It preserves positional identity, needs no retraining, and cuts time-to-first-token up to ~4×. Crucially, its output is a shorter sequence of video tokens consumed by Cosmos's own transformer — not RGB frames, and not even whole frames. There is no tap to pull frames out and route them to SAM2 or a depth model. Treat EVS as a private latency knob for the reasoner; it is not a frame source.

The idea is reusable; the implementation is per-model. Build your own frame governor (perceptual hash + optical-flow/scene-change detection) that tags each frame. Then each consumer treats it differently:

Consumer	State?	What it wants from the governor
Metric-depth model	Stateless, per-frame	Keyframe gating helps directly — skip near-duplicate frames, run on keyframes, hold/interpolate/smooth between. Do not apply EVS-style patch pruning: depth needs the whole frame for geometric context.
SAM 2	Stateful (streaming memory)	Opposite instinct — feed it continuous frames. Higher frame rates give better tracking stability; aggressive frame-dropping causes drift and lost tracks. Run expensive detection/re-seeding on keyframes only and let SAM2's memory propagate masks on intervening frames. SAM2 is the temporal-redundancy optimization for segmentation.
Sapiens2	Stateless, per-frame	Keyframe gating helps; add your own light tracking/smoothing across frames since Sapiens2 has no temporal module.
Cosmos 3 reasoner	—	Runs on a sparse keyframe/clip cadence anyway; let its internal EVS handle intra-clip token redundancy.

One-line rule: a VLM tolerates patch-level token pruning that pixel models (depth, SAM2) don't — so don't reuse EVS; gate at the frame level upstream, and let stateful SAM2 keep its continuity.

Hosting: Cosmos reasoner as a NIM on Modal/ECS/EC2 (no Kubernetes — see the Cosmos study plan); Sapiens2 + detector + depth as your own GPU services (Modal functions or a Triton/vLLM-style server). Start with Sapiens2-0.4B + a YOLO/SAM2 detector for the real-time path and the Cosmos Nano reasoner on keyframes; scale model sizes up once the pipeline works.

Bottom line

Cosmos 3 = the reasoning and world-understanding brain (general scenes, language, events). Heavier, async, keyframe-driven.
Sapiens2 = the precision instrument for people (pose, parts, 3D pointmap, surface detail). Light, real-time, per-frame — but humans only, masked via segmentation, and scale-normalized rather than absolute metric.
Object detection and general-object measurement belong to neither — budget for a dedicated detector/tracker and a metric-depth model.
Ship them as an ensemble with a fast per-frame hot path and an async reasoning path, not as one model behind one endpoint.

Sources

Sapiens2 launch writeup: https://www.marktechpost.com/2026/04/27/meta-ai-releases-sapiens2-a-high-resolution-human-centric-vision-model-for-pose-segmentation-normals-pointmap-and-albedo/
Sapiens2 paper: https://arxiv.org/abs/2604.21681
Sapiens2 weights: https://huggingface.co/collections/facebook/sapiens2 · Repo: https://github.com/facebookresearch/sapiens2
Original Sapiens (ECCV 2024): https://arxiv.org/abs/2408.12569
Cosmos 3 technical blog: https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/
Cosmos 3 HF launch: https://huggingface.co/blog/nvidia/cosmos-3-for-physical-ai
Efficient Video Sampling (EVS) paper: https://arxiv.org/abs/2510.14624
SAM2 temporal sampling / streaming memory (TSMS-SAM2): https://arxiv.org/abs/2508.05829