Transformers.js: Run AI Models Directly in the Browser

TL;DR
Transformers.js lets you run machine learning models in the browser with zero backend. Here is how to use it for text generation, speech recognition, image classification, and semantic search.
Official Sources
| Resource | Link |
|---|---|
| Transformers.js Documentation | huggingface.co/docs/transformers.js |
| Transformers.js GitHub | github.com/huggingface/transformers.js |
| NPM Package | npmjs.com/package/@huggingface/transformers |
| Compatible Models | huggingface.co/models?library=transformers.js |
| Transformers.js V3 Announcement | huggingface.co/blog/transformersjs-v3 |
| WebGPU Browser Support | caniuse.com/webgpu |
Every AI workflow you have seen runs on a server somewhere. You send a prompt, wait for a response, and pay per token. Transformers.js flips that model. It runs machine learning models directly in the browser using WebAssembly and WebGPU. No API keys. No server. No per-token billing.
The library is built by Hugging Face and mirrors their Python transformers library. Transformers.js v3 shipped in October 2024 with WebGPU support (up to 100x faster than WASM), 120 supported architectures, and over 1,200 pre-converted models on the Hugging Face Hub. V4 is now available with even more models - the community has already shipped browser demos for LFM2.5 1.2B reasoning models, Voxtral real-time speech transcription, and Nemotron Nano.
Under the hood, Transformers.js uses the ONNX runtime to run models. Any model converted to ONNX format works, and Hugging Face Hub has thousands of compatible models tagged with transformers.js.
This guide covers the practical use cases that matter for web developers.
Install
npm install @huggingface/transformersThat is it. No Python, no Docker, no GPU drivers. The models are downloaded as ONNX files and cached in the browser on first use.
The Pipeline API
Every task in Transformers.js starts with pipeline(). You pick a task type, specify a model, and call the resulting function with your input.
import { pipeline } from "@huggingface/transformers";
const classifier = await pipeline(
"sentiment-analysis",
"Xenova/distilbert-base-uncased-finetuned-sst-2-english"
);
const result = await classifier("I love building with AI tools.");
// [{ label: "POSITIVE", score: 0.9998 }]
The first call downloads and caches the model. Subsequent calls are instant. Models range from 5MB to 500MB+ depending on the architecture.
Enable WebGPU for Speed
WebGPU gives you GPU-accelerated inference in the browser. Add device: "webgpu" to your pipeline options.
const extractor = await pipeline(
"feature-extraction",
"mixedbread-ai/mxbai-embed-xsmall-v1",
{ device: "webgpu" }
);
WebGPU support is around 70% globally. Chrome and Edge support it natively. Firefox requires the dom.webgpu.enabled flag. Safari requires the WebGPU feature flag. The library falls back to WebAssembly automatically when WebGPU is not available, so your code works everywhere - it just runs faster with WebGPU.
Use Case: Semantic Search
This is the killer feature for web developers. Instead of keyword matching with libraries like fuse.js, you can embed your content and search by meaning.
import { pipeline } from "@huggingface/transformers";
const extractor = await pipeline(
"feature-extraction",
"mixedbread-ai/mxbai-embed-xsmall-v1",
{ device: "webgpu" }
);
// Embed your content (do this once, cache the vectors)
const docs = [
"How to set up Claude Code with CLAUDE.md",
"Building REST APIs with Express and TypeScript",
"Running Whisper locally for speech recognition",
];
const docEmbeddings = await extractor(docs, {
pooling: "mean",
normalize: true,
});
// Embed the search query
const query = "configure AI coding agent";
const queryEmbedding = await extractor([query], {
pooling: "mean",
normalize: true,
});
// Compute cosine similarity and rank
function cosineSimilarity(a: number[], b: number[]): number {
return a.reduce((sum, val, i) => sum + val * b[i], 0);
}
const queryVec = queryEmbedding.tolist()[0];
const scores = docEmbeddings.tolist().map((vec: number[], i: number) => ({
doc: docs[i],
score: cosineSimilarity(queryVec, vec),
}));
scores.sort((a, b) => b.score - a.score);
// "How to set up Claude Code with CLAUDE.md" ranks first
The user searches for "configure AI coding agent" and the Claude Code article ranks first, even though no keywords match. That is semantic search.
Newsletter
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.
From the archive
DeepSeek R1 and V3: The Developer's Guide to Open-Source AI
Mar 26, 2026 • 9 min read
Llama 4: The Complete Developer's Guide to Meta's Open Source Models
Mar 26, 2026 • 10 min read
The DevDigest App Ecosystem
Mar 22, 2026 • 4 min read
AI Agents Explained: A TypeScript Developer's Guide
Mar 19, 2026 • 6 min read
Use Case: Speech Recognition
Run OpenAI's Whisper model in the browser. Users record audio, and you transcribe it without sending anything to a server.
const transcriber = await pipeline(
"automatic-speech-recognition",
"onnx-community/whisper-tiny.en",
{ device: "webgpu" }
);
const result = await transcriber(audioBlob);
console.log(result.text);
// "The quick brown fox jumps over the lazy dog"
The whisper-tiny.en model is 40MB. For better accuracy, use whisper-small.en at 240MB. Both run in real time on modern hardware with WebGPU.
Use Case: Image Classification
Classify images without uploading them to a server. Useful for content moderation, auto-tagging, or building visual search.
const classifier = await pipeline(
"image-classification",
"onnx-community/mobilenetv4_conv_small.e2400_r224_in1k",
{ device: "webgpu" }
);
const result = await classifier(imageElement);
// [{ label: "laptop", score: 0.87 }, { label: "keyboard", score: 0.06 }]
The MobileNet model is under 20MB and classifies images in milliseconds.
Use Case: Text Generation
Run small language models directly in the browser. This is not GPT-4 class, but it is useful for autocomplete, content suggestions, and creative features that do not need to be perfect.
import { pipeline } from "@huggingface/transformers";
const generator = await pipeline(
"text-generation",
"HuggingFaceTB/SmolLM2-360M-Instruct"
);
const output = await generator("Explain WebGPU in one sentence:", {
max_new_tokens: 50,
temperature: 0.7,
});
SmolLM2 at 360M parameters is small enough for the browser and smart enough for light tasks. For the Vercel AI SDK, there is a dedicated provider:
import { streamText } from "ai";
import { transformersJS } from "@browser-ai/transformers-js";
const result = streamText({
model: transformersJS("HuggingFaceTB/SmolLM2-360M-Instruct"),
prompt: "Explain WebGPU in one sentence.",
});
Use Case: Zero-Shot Classification
Classify text into categories you define at runtime, without any training data.
const classifier = await pipeline(
"zero-shot-classification",
"Xenova/mobilebert-uncased-mnli"
);
const result = await classifier(
"How do I deploy a Next.js app to Vercel?",
["deployment", "authentication", "database", "testing"]
);
// { labels: ["deployment", ...], scores: [0.94, ...] }
This is useful for auto-routing support questions, categorizing user feedback, or building smart content filters.
What to Know Before Shipping
Model size matters. A 50MB model download on first visit is fine for a tool page. It is not fine for a landing page. Lazy-load models after the page renders, and show a loading state.
Cache aggressively. Models are cached in the browser's Cache API after first download. Subsequent visits load from cache in milliseconds. Set proper cache headers if you are self-hosting models.
WebGPU is not everywhere. Always provide a WebAssembly fallback. Transformers.js does this automatically, but inference will be slower on CPU.
Quantization reduces size. Most models on Hugging Face Hub have quantized variants (q4, q8, fp16). Use the smallest quantization that meets your accuracy needs.
const pipe = await pipeline("feature-extraction", "model-name", {
dtype: "q4", // Quantized to 4-bit
});
Do not replace your API for complex tasks. Transformers.js is excellent for embeddings, classification, and small generative tasks. For complex multi-step reasoning, you still want Claude or GPT on the server. That said, V4 demos are pushing the boundary - Hugging Face's community has shipped 1.2B parameter reasoning models and real-time speech transcription running entirely in the browser.
Practical Architecture
The pattern that works for production web apps:
- Heavy reasoning - Server-side (Claude API, GPT API)
- Search and similarity - Client-side (Transformers.js embeddings)
- Classification and tagging - Client-side (Transformers.js zero-shot)
- Speech input - Client-side (Transformers.js Whisper)
- Image understanding - Client-side (Transformers.js CLIP/MobileNet)
This hybrid approach gives you the best of both worlds: powerful reasoning from cloud APIs and instant, private, zero-cost inference for everything else.
Frequently Asked Questions
Does Transformers.js work with Next.js?
Yes. Import it in client components ("use client") and load models after the component mounts. Server-side rendering will fail since the library needs browser APIs. Use dynamic imports with ssr: false for pages that depend on it.
How big are the models?
Model sizes range from 5MB (tiny classifiers) to 500MB+ (large language models). For most browser use cases, you want models under 100MB. Embedding models like mxbai-embed-xsmall-v1 are around 30MB. Whisper tiny is 40MB. There are over 1,200 pre-converted models on the Hugging Face Hub ready to use.
Is WebGPU required?
No. Transformers.js falls back to WebAssembly automatically. WebGPU makes inference faster (often 5-10x), but everything works without it. Chrome and Edge support WebGPU today.
Can I fine-tune models with Transformers.js?
No. Transformers.js is inference-only. Fine-tune your model using the Python transformers library, then convert to ONNX format using Optimum and load it in Transformers.js for inference. Many models on Hugging Face Hub are already converted and tagged with transformers.js.
How does it compare to TensorFlow.js?
Transformers.js focuses specifically on transformer models from Hugging Face Hub. TensorFlow.js is a general-purpose ML framework. If you want to run pretrained NLP, vision, or audio models, Transformers.js is simpler and has better model support. If you need custom model architectures or training in the browser, use TensorFlow.js.
Further Reading:
- Transformers.js v3 Announcement - WebGPU support, 120 architectures, 1,200+ models
- Transformers.js V4 Demos - Live demos including reasoning models and real-time speech
- Transformers.js Documentation - Official API reference and guides
- Compatible Models on Hugging Face Hub - Browse all models tagged for Transformers.js
- Llama 4 Developers Guide - Server-side alternative for local inference
- Vercel AI SDK Guide - Build AI apps with the Vercel AI SDK (has Transformers.js integration)
Read next
Vercel AI SDK: Build Streaming AI Apps in TypeScript
The AI SDK is the fastest way to add streaming AI responses to your Next.js app. Here is how to use it with Claude, GPT, and open source models.
5 min readWhat Is MCP (Model Context Protocol)? A TypeScript Developer's Guide
MCP lets AI agents connect to databases, APIs, and tools. Here is what it is and how to use it in your TypeScript projects.
5 min readHow to Use Claude Code with Next.js
A practical guide to using Claude Code in Next.js projects. CLAUDE.md config for App Router, common workflows, sub-agents, MCP servers, and TypeScript tips that actually save time.
14 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.









