
TL;DR
SAM 3.1 finally hits the latency budget for realtime video. Here is how to wire Meta's new segmentation model into a production pipeline without melting your GPU.
Read next
Cloudflare Flagship is feature flags built for AI: model swaps, agent gates, and prompt rollouts as first-class primitives. Here is how to use it without rebuilding your control plane.
9 min readOpus 4.7 is here. Sharper coding, longer agentic runs, better tool use, and a price that finally makes Opus livable for production. Here's everything devs need to know.
10 min readMCP isn't just a plugin format - it's a full JSON-RPC protocol for connecting LLMs to tools, resources, and prompts. Here's how it works under the hood, sourced from the official spec.
12 min readEvery previous version of Segment Anything was a research toy in the same shape: drop in an image, get back a mask, marvel at the quality, then walk away because it could not keep up with a 30 fps camera feed. The first SAM was 600 ms per image on an A100. SAM 2 brought streaming video tracking but still cost 90+ ms per frame on consumer hardware. SAM 3.1, announced by Meta this week, is the first version that fits inside the 33 ms budget you actually have if you want to run alongside a webcam, a Zoom feed, or a live stream.
For broader context, pair this with Claude Computer Use: AI That Controls Your Desktop and GPT-5.4 for Developers: The Production Guide; those companion pieces show where this fits in the wider AI developer workflow.
That single change unlocks a category of products that has been blocked for two years. Realtime background replacement that does not look like 2018 Snapchat. Sports analytics that label every player and the ball without a green screen. Drone footage with persistent object IDs. Surgery assistance that tracks instruments across occlusions. The model is the same family of promptable masks, but the engineering work to integrate it is genuinely different now, and most teams will get it wrong on the first pass.
This post is the version of the docs I wish existed: what 3.1 actually changes, the minimum viable code to run it on a video stream, the gotchas that will eat your weekend, and how to stitch it into a real product instead of a demo.
The headline number from the Meta announcement is a 4x speedup over SAM 2 at the same mask quality, with a smaller distilled variant (sam3.1-tiny) that runs at over 60 fps on a single L4. There are three concrete improvements worth pulling out of the marketing copy.
First, the memory module that tracks objects across frames is now causal and incremental. SAM 2 reprocessed a sliding window of frames every step. SAM 3.1 keeps a compressed memory bank and updates it in a single forward pass per frame. That is the change responsible for most of the speedup.
Second, the prompt encoder accepts text. You can say segment the red car and get a mask without clicking. Quality is below CLIP-segment style models on noisy footage but good enough for constrained product surfaces.
Third, the model exports cleanly to ONNX and CoreML out of the box. Meta is shipping the conversion scripts in the repo, which is a real shift from previous releases where the community had to figure it out.
What it does not ship: a hosted API. You run this yourself. That is fine, and arguably better, because the latency wins disappear the moment you add a network round trip.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 13 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Here is what a real integration looks like. Install the SDK, load the tiny variant, and stream frames through it.
import cv2
import torch
from sam3 import SAM3VideoPredictor
predictor = SAM3VideoPredictor.from_pretrained(
"facebook/sam3.1-tiny",
device="cuda",
dtype=torch.float16,
)
cap = cv2.VideoCapture(0)
state = predictor.init_state()
# Prompt once on the first frame: click the object you want to track.
ret, frame = cap.read()
predictor.add_point_prompt(state, frame, point=(640, 360), label=1, obj_id=1)
while True:
ret, frame = cap.read()
if not ret:
break
masks = predictor.track(state, frame) # dict[obj_id -> mask]
overlay = predictor.visualize(frame, masks)
cv2.imshow("sam3.1", overlay)
if cv2.waitKey(1) == 27:
break
That is it. Twenty lines. On an RTX 4090 this runs at roughly 90 fps. On an M3 Max via the CoreML export it runs at 35 fps, which is the threshold I care about for anything user-facing.
The track call is the hot path. The two failure modes you will hit are obvious in hindsight. If you push frames faster than the model can consume them you will silently drop frames in OpenCV's buffer, so always read with a queue and timestamp. If your prompt object leaves the frame and comes back the memory bank degrades, so expose a re-prompt affordance in the UI rather than assuming the model can recover forever.
The SAM 3.1 weights are 380 MB for the tiny variant and 1.4 GB for the base. Cold start on a Lambda-style serverless runtime is not viable. You want a long-running worker, ideally with a GPU pinned. If your product is bursty, a Modal or RunPod backend with autoscaling and a 60-second idle timeout is the cheapest sane option I have found.
Mixed precision is required. fp32 inference is roughly 2.4x slower with no quality benefit. Use torch.float16 on NVIDIA, torch.bfloat16 on Hopper, and the default fp16 in the CoreML export on Apple Silicon. The numbers in the model card are all fp16 numbers.
The text prompt path is tempting but slower than the point prompt path on the first frame because it routes through a separate text encoder. If you can capture a single click, do that instead. Reserve text prompts for batch jobs.
Audio sync is your problem. The model only handles video. If you are building a streaming product, every frame you process is a frame your audio pipeline has been waiting on. Buffer audio against frame timestamps, not wall clock, or you will ship something with 200 ms of lipsync drift.
Vision models like SAM tend to live one of two places in a product. Either as a one-shot preprocessing step that turns a video into structured data (timestamped object tracks, bounding boxes, masks) that an LLM agent then reasons about, or as an inline filter inside a realtime UI loop. SAM 3.1 is the first version where the second pattern is actually tractable.
For the preprocessing pattern, you do not need realtime. Run the base model offline, write masks and tracks to a JSON sidecar, and feed that to your downstream agent. This is the workflow we use to chop long-form video into shareable segments inside Clips, our DD product for turning podcast and YouTube footage into vertical clips. The agent reads the track data, picks a focal subject, reframes the crop, and exports. SAM 3.1's speedup means the offline pass takes minutes instead of hours on a typical hour-long source.
For the realtime pattern, the question is what your agent does with the masks. The interesting answer is usually some form of selective generation: segment the speaker, regenerate only the background, composite. That is a content pipeline, and it is exactly the surface Content is built around — automated B-roll generation, background swaps, and visual consistency checks across long video projects.
If you want a deeper architectural walkthrough of how these vision steps slot into a multi-agent video pipeline, I covered the full system on the Developers Digest YouTube channel.
The non-obvious part of shipping a SAM 3.1-backed feature is not the model. It is the queue, the worker, the cache, and the failure path. Here is the shape that has worked.
A frontend pushes frames or video URLs into a job queue. A worker pool of GPU instances pulls jobs, runs SAM 3.1, and writes mask outputs to object storage as a packed video file (RLE-encoded masks, one per object, codec'd as h264 alpha) plus a manifest JSON. The frontend polls or subscribes for completion. The agent that consumes the masks reads the manifest, never the raw masks, because masks are huge and the manifest is enough for most decisions.
Cache aggressively at the input hash level. SAM is deterministic given a fixed prompt and frame, so identical inputs should never run twice. We see roughly 40% cache hits on real workloads because users re-process the same source video with different prompts, and the prompt-conditional cache key catches that.
Re-prompts are a UX problem, not a model problem. Build the affordance for users to correct a track mid-stream early. A model that is right 95% of the time still produces visibly broken output 5% of the time, and there is no amount of tuning that fixes the long tail. The right answer is letting the user click once to recover.
Three things to keep an eye on over the next two months. First, whether the open-source community ports SAM 3.1 to WebGPU. The base model is small enough that browser inference is plausible, and that would collapse the operational story for a lot of indie products. Second, whether Meta releases a finetuning recipe for domain-specific data. The current weights are general-purpose and predictably weak on medical imagery, satellite footage, and underwater video. Third, whether the text-prompt quality improves enough to fully replace point prompts in production. That would unblock a lot of zero-touch automation.
For now, the right move is to take an existing video product, find the place where you said "we cannot do this realtime," and try it. The latency wall is gone. What you build on top of that is the interesting part.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source OpenAI API replacement. Runs LLMs, vision, voice, image, and video models on any hardware - no GPU require...
View ToolStackBlitz's in-browser AI app builder. Full-stack apps from a prompt - runs Node.js, installs packages, and deploys....
View ToolThe TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolTypeScript-first AI agent framework. Workflows, RAG, tool use, evals, and integrations. Built for production Node.js app...
View ToolInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
Opus 4.7 is here. Sharper coding, longer agentic runs, better tool use, and a price that finally makes Opus livable for...

Cloudflare Flagship is feature flags built for AI: model swaps, agent gates, and prompt rollouts as first-class primitiv...

A hands-on developer guide to Mercury 2 from Inception Labs. OpenAI-compatible API, reasoning levels, tool use, structur...

Vercel just declared the agent stack: AI Gateway, Sandbox, Flags, and Microfrontends. Here is how the four primitives co...

A repo-root DESIGN.md gives Claude Code, Codex, and other agents the design rules they need to honor so generated UI doe...

Alibaba's newest Qwen release claims flagship-level coding in a 27B dense model. Here is why dense matters, where it fit...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.