SAM 3.1: Realtime Video Segmentation in Apps

The Latency Wall Just Fell

Every previous version of Segment Anything was a research toy in the same shape: drop in an image, get back a mask, marvel at the quality, then walk away because it could not keep up with a 30 fps camera feed. The first SAM was 600 ms per image on an A100. SAM 2 brought streaming video tracking but still cost 90+ ms per frame on consumer hardware. SAM 3.1, announced by Meta this week, is the first version that fits inside the 33 ms budget you actually have if you want to run alongside a webcam, a Zoom feed, or a live stream.

For broader context, pair this with Claude Computer Use: AI That Controls Your Desktop and GPT-5.4 for Developers: The Production Guide; those companion pieces show where this fits in the wider AI developer workflow.

That single change unlocks a category of products that has been blocked for two years. Realtime background replacement that does not look like 2018 Snapchat. Sports analytics that label every player and the ball without a green screen. Drone footage with persistent object IDs. Surgery assistance that tracks instruments across occlusions. The model is the same family of promptable masks, but the engineering work to integrate it is genuinely different now, and most teams will get it wrong on the first pass.

This post is the version of the docs I wish existed: what 3.1 actually changes, the minimum viable code to run it on a video stream, the gotchas that will eat your weekend, and how to stitch it into a real product instead of a demo.

What SAM 3.1 Actually Ships

The headline number from the Meta announcement is a 4x speedup over SAM 2 at the same mask quality, with a smaller distilled variant (sam3.1-tiny) that runs at over 60 fps on a single L4. There are three concrete improvements worth pulling out of the marketing copy.

First, the memory module that tracks objects across frames is now causal and incremental. SAM 2 reprocessed a sliding window of frames every step. SAM 3.1 keeps a compressed memory bank and updates it in a single forward pass per frame. That is the change responsible for most of the speedup.

Second, the prompt encoder accepts text. You can say segment the red car and get a mask without clicking. Quality is below CLIP-segment style models on noisy footage but good enough for constrained product surfaces.

Third, the model exports cleanly to ONNX and CoreML out of the box. Meta is shipping the conversion scripts in the repo, which is a real shift from previous releases where the community had to figure it out.

What it does not ship: a hosted API. You run this yourself. That is fine, and arguably better, because the latency wins disappear the moment you add a network round trip.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Apr 29, 2026 • 13 min read

Shipping OpenAI Symphony in Prod: A Real-World Guide

Apr 29, 2026 • 12 min read

Tool Use in the Claude API: Production Patterns for Reliable Agents

Apr 29, 2026 • 12 min read

Vercel's Agentic Infrastructure Stack Explained

Apr 29, 2026 • 12 min read

The Minimum Viable Pipeline

Here is what a real integration looks like. Install the SDK, load the tiny variant, and stream frames through it.

import cv2
import torch
from sam3 import SAM3VideoPredictor

predictor = SAM3VideoPredictor.from_pretrained(
    "facebook/sam3.1-tiny",
    device="cuda",
    dtype=torch.float16,
)

cap = cv2.VideoCapture(0)
state = predictor.init_state()

# Prompt once on the first frame: click the object you want to track.
ret, frame = cap.read()
predictor.add_point_prompt(state, frame, point=(640, 360), label=1, obj_id=1)

while True:
    ret, frame = cap.read()
    if not ret:
        break
    masks = predictor.track(state, frame)  # dict[obj_id -> mask]
    overlay = predictor.visualize(frame, masks)
    cv2.imshow("sam3.1", overlay)
    if cv2.waitKey(1) == 27:
        break

That is it. Twenty lines. On an RTX 4090 this runs at roughly 90 fps. On an M3 Max via the CoreML export it runs at 35 fps, which is the threshold I care about for anything user-facing.

The track call is the hot path. The two failure modes you will hit are obvious in hindsight. If you push frames faster than the model can consume them you will silently drop frames in OpenCV's buffer, so always read with a queue and timestamp. If your prompt object leaves the frame and comes back the memory bank degrades, so expose a re-prompt affordance in the UI rather than assuming the model can recover forever.

The Gotchas Nobody Mentions

The SAM 3.1 weights are 380 MB for the tiny variant and 1.4 GB for the base. Cold start on a Lambda-style serverless runtime is not viable. You want a long-running worker, ideally with a GPU pinned. If your product is bursty, a Modal or RunPod backend with autoscaling and a 60-second idle timeout is the cheapest sane option I have found.

Mixed precision is required. fp32 inference is roughly 2.4x slower with no quality benefit. Use torch.float16 on NVIDIA, torch.bfloat16 on Hopper, and the default fp16 in the CoreML export on Apple Silicon. The numbers in the model card are all fp16 numbers.

The text prompt path is tempting but slower than the point prompt path on the first frame because it routes through a separate text encoder. If you can capture a single click, do that instead. Reserve text prompts for batch jobs.

Audio sync is your problem. The model only handles video. If you are building a streaming product, every frame you process is a frame your audio pipeline has been waiting on. Buffer audio against frame timestamps, not wall clock, or you will ship something with 200 ms of lipsync drift.

Where This Fits in the Agent Stack

Vision models like SAM tend to live one of two places in a product. Either as a one-shot preprocessing step that turns a video into structured data (timestamped object tracks, bounding boxes, masks) that an LLM agent then reasons about, or as an inline filter inside a realtime UI loop. SAM 3.1 is the first version where the second pattern is actually tractable.

For the preprocessing pattern, you do not need realtime. Run the base model offline, write masks and tracks to a JSON sidecar, and feed that to your downstream agent. This is the workflow we use to chop long-form video into shareable segments inside Clips, our DD product for turning podcast and YouTube footage into vertical clips. The agent reads the track data, picks a focal subject, reframes the crop, and exports. SAM 3.1's speedup means the offline pass takes minutes instead of hours on a typical hour-long source.

For the realtime pattern, the question is what your agent does with the masks. The interesting answer is usually some form of selective generation: segment the speaker, regenerate only the background, composite. That is a content pipeline, and it is exactly the surface Content is built around — automated B-roll generation, background swaps, and visual consistency checks across long video projects.

If you want a deeper architectural walkthrough of how these vision steps slot into a multi-agent video pipeline, I covered the full system on the Developers Digest YouTube channel.

Wiring It Into a Real Product

The non-obvious part of shipping a SAM 3.1-backed feature is not the model. It is the queue, the worker, the cache, and the failure path. Here is the shape that has worked.

A frontend pushes frames or video URLs into a job queue. A worker pool of GPU instances pulls jobs, runs SAM 3.1, and writes mask outputs to object storage as a packed video file (RLE-encoded masks, one per object, codec'd as h264 alpha) plus a manifest JSON. The frontend polls or subscribes for completion. The agent that consumes the masks reads the manifest, never the raw masks, because masks are huge and the manifest is enough for most decisions.

Cache aggressively at the input hash level. SAM is deterministic given a fixed prompt and frame, so identical inputs should never run twice. We see roughly 40% cache hits on real workloads because users re-process the same source video with different prompts, and the prompt-conditional cache key catches that.

Re-prompts are a UX problem, not a model problem. Build the affordance for users to correct a track mid-stream early. A model that is right 95% of the time still produces visibly broken output 5% of the time, and there is no amount of tuning that fixes the long tail. The right answer is letting the user click once to recover.

What To Watch Next

Three things to keep an eye on over the next two months. First, whether the open-source community ports SAM 3.1 to WebGPU. The base model is small enough that browser inference is plausible, and that would collapse the operational story for a lot of indie products. Second, whether Meta releases a finetuning recipe for domain-specific data. The current weights are general-purpose and predictably weak on medical imagery, satellite footage, and underwater video. Third, whether the text-prompt quality improves enough to fully replace point prompts in production. That would unblock a lot of zero-touch automation.

For now, the right move is to take an existing video product, find the place where you said "we cannot do this realtime," and try it. The latency wall is gone. What you build on top of that is the interesting part.

Flagship: Cloudflare Feature Flags for AI Apps

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

What Is the Model Context Protocol? A 2026 Primer

The Latency Wall Just Fell

What SAM 3.1 Actually Ships

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Shipping OpenAI Symphony in Prod: A Real-World Guide

Tool Use in the Claude API: Production Patterns for Reliable Agents

Vercel's Agentic Infrastructure Stack Explained

The Minimum Viable Pipeline

The Gotchas Nobody Mentions

Where This Fits in the Agent Stack

Wiring It Into a Real Product

What To Watch Next

Comments

Related Tools

LocalAI

Bolt

Vercel AI SDK

Mastra

Apps from Developers Digest

Video Clipper

DD UTM

Related Guides

Getting Started with DevDigest CLI

Claude Code Setup Guide

MCP Servers Explained

Related Posts

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

Flagship: Cloudflare Feature Flags for AI Apps

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Vercel's Agentic Infrastructure Stack Explained

DESIGN.md: The Contract That Keeps AI Agents On Brand

Qwen3.6-27B: A 27-Billion-Parameter Dense Model That Actually Codes

Get Smarter About AI Dev

Flagship: Cloudflare Feature Flags for AI Apps

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

What Is the Model Context Protocol? A 2026 Primer

The Latency Wall Just Fell

What SAM 3.1 Actually Ships

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Shipping OpenAI Symphony in Prod: A Real-World Guide

Tool Use in the Claude API: Production Patterns for Reliable Agents

Vercel's Agentic Infrastructure Stack Explained

The Minimum Viable Pipeline

The Gotchas Nobody Mentions

Where This Fits in the Agent Stack

Wiring It Into a Real Product

What To Watch Next

Comments

Related Tools

LocalAI

Bolt

Vercel AI SDK

Mastra

Apps from Developers Digest

Video Clipper

DD UTM

Related Guides

Getting Started with DevDigest CLI

Claude Code Setup Guide

MCP Servers Explained

Related Posts

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

Flagship: Cloudflare Feature Flags for AI Apps

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Vercel's Agentic Infrastructure Stack Explained

DESIGN.md: The Contract That Keeps AI Agents On Brand

Qwen3.6-27B: A 27-Billion-Parameter Dense Model That Actually Codes

Get Smarter About AI Dev