Transformers.js: Run AI Models Directly in the Browser

Q: Does Transformers.js work with Next.js?

Yes. Import it in client components (`"use client"`) and load models after the component mounts. Server-side rendering will fail since the library needs browser APIs. Use dynamic imports with `ssr: false` for pages that depend on it.

Official Sources

Resource	Link
Transformers.js Documentation	huggingface.co/docs/transformers.js
Transformers.js GitHub	github.com/huggingface/transformers.js
NPM Package	npmjs.com/package/@huggingface/transformers
Compatible Models	huggingface.co/models?library=transformers.js
Transformers.js V3 Announcement	huggingface.co/blog/transformersjs-v3
WebGPU Browser Support	caniuse.com/webgpu

Every AI workflow you have seen runs on a server somewhere. You send a prompt, wait for a response, and pay per token. Transformers.js flips that model. It runs machine learning models directly in the browser using WebAssembly and WebGPU. No API keys. No server. No per-token billing.

The library is built by Hugging Face and mirrors their Python transformers library. Transformers.js v3 shipped in October 2024 with WebGPU support (up to 100x faster than WASM), 120 supported architectures, and over 1,200 pre-converted models on the Hugging Face Hub. V4 is now available with even more models - the community has already shipped browser demos for LFM2.5 1.2B reasoning models, Voxtral real-time speech transcription, and Nemotron Nano.

Under the hood, Transformers.js uses the ONNX runtime to run models. Any model converted to ONNX format works, and Hugging Face Hub has thousands of compatible models tagged with transformers.js.

This guide covers the practical use cases that matter for web developers.

Install

npm install @huggingface/transformers

That is it. No Python, no Docker, no GPU drivers. The models are downloaded as ONNX files and cached in the browser on first use.

The Pipeline API

Every task in Transformers.js starts with pipeline(). You pick a task type, specify a model, and call the resulting function with your input.

import { pipeline } from "@huggingface/transformers";

const classifier = await pipeline(
  "sentiment-analysis",
  "Xenova/distilbert-base-uncased-finetuned-sst-2-english"
);

const result = await classifier("I love building with AI tools.");
// [{ label: "POSITIVE", score: 0.9998 }]

The first call downloads and caches the model. Subsequent calls are instant. Models range from 5MB to 500MB+ depending on the architecture.

Enable WebGPU for Speed

WebGPU gives you GPU-accelerated inference in the browser. Add device: "webgpu" to your pipeline options.

const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" }
);

WebGPU support is around 70% globally. Chrome and Edge support it natively. Firefox requires the dom.webgpu.enabled flag. Safari requires the WebGPU feature flag. The library falls back to WebAssembly automatically when WebGPU is not available, so your code works everywhere - it just runs faster with WebGPU.

Use Case: Semantic Search

This is the killer feature for web developers. Instead of keyword matching with libraries like fuse.js, you can embed your content and search by meaning.

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" }
);

// Embed your content (do this once, cache the vectors)
const docs = [
  "How to set up Claude Code with CLAUDE.md",
  "Building REST APIs with Express and TypeScript",
  "Running Whisper locally for speech recognition",
];
const docEmbeddings = await extractor(docs, {
  pooling: "mean",
  normalize: true,
});

// Embed the search query
const query = "configure AI coding agent";
const queryEmbedding = await extractor([query], {
  pooling: "mean",
  normalize: true,
});

// Compute cosine similarity and rank
function cosineSimilarity(a: number[], b: number[]): number {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

const queryVec = queryEmbedding.tolist()[0];
const scores = docEmbeddings.tolist().map((vec: number[], i: number) => ({
  doc: docs[i],
  score: cosineSimilarity(queryVec, vec),
}));

scores.sort((a, b) => b.score - a.score);
// "How to set up Claude Code with CLAUDE.md" ranks first

The user searches for "configure AI coding agent" and the Claude Code article ranks first, even though no keywords match. That is semantic search.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

Mar 26, 2026 • 9 min read

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

Mar 26, 2026 • 10 min read

The DevDigest App Ecosystem

Mar 22, 2026 • 4 min read

AI Agents Explained: A TypeScript Developer's Guide

Mar 19, 2026 • 6 min read

Use Case: Speech Recognition

Run OpenAI's Whisper model in the browser. Users record audio, and you transcribe it without sending anything to a server.

const transcriber = await pipeline(
  "automatic-speech-recognition",
  "onnx-community/whisper-tiny.en",
  { device: "webgpu" }
);

const result = await transcriber(audioBlob);
console.log(result.text);
// "The quick brown fox jumps over the lazy dog"

The whisper-tiny.en model is 40MB. For better accuracy, use whisper-small.en at 240MB. Both run in real time on modern hardware with WebGPU.

Use Case: Image Classification

Classify images without uploading them to a server. Useful for content moderation, auto-tagging, or building visual search.

const classifier = await pipeline(
  "image-classification",
  "onnx-community/mobilenetv4_conv_small.e2400_r224_in1k",
  { device: "webgpu" }
);

const result = await classifier(imageElement);
// [{ label: "laptop", score: 0.87 }, { label: "keyboard", score: 0.06 }]

The MobileNet model is under 20MB and classifies images in milliseconds.

Use Case: Text Generation

Run small language models directly in the browser. This is not GPT-4 class, but it is useful for autocomplete, content suggestions, and creative features that do not need to be perfect.

import { pipeline } from "@huggingface/transformers";

const generator = await pipeline(
  "text-generation",
  "HuggingFaceTB/SmolLM2-360M-Instruct"
);

const output = await generator("Explain WebGPU in one sentence:", {
  max_new_tokens: 50,
  temperature: 0.7,
});

SmolLM2 at 360M parameters is small enough for the browser and smart enough for light tasks. For the Vercel AI SDK, there is a dedicated provider:

import { streamText } from "ai";
import { transformersJS } from "@browser-ai/transformers-js";

const result = streamText({
  model: transformersJS("HuggingFaceTB/SmolLM2-360M-Instruct"),
  prompt: "Explain WebGPU in one sentence.",
});

Use Case: Zero-Shot Classification

Classify text into categories you define at runtime, without any training data.

const classifier = await pipeline(
  "zero-shot-classification",
  "Xenova/mobilebert-uncased-mnli"
);

const result = await classifier(
  "How do I deploy a Next.js app to Vercel?",
  ["deployment", "authentication", "database", "testing"]
);
// { labels: ["deployment", ...], scores: [0.94, ...] }

This is useful for auto-routing support questions, categorizing user feedback, or building smart content filters.

What to Know Before Shipping

Model size matters. A 50MB model download on first visit is fine for a tool page. It is not fine for a landing page. Lazy-load models after the page renders, and show a loading state.

Cache aggressively. Models are cached in the browser's Cache API after first download. Subsequent visits load from cache in milliseconds. Set proper cache headers if you are self-hosting models.

WebGPU is not everywhere. Always provide a WebAssembly fallback. Transformers.js does this automatically, but inference will be slower on CPU.

Quantization reduces size. Most models on Hugging Face Hub have quantized variants (q4, q8, fp16). Use the smallest quantization that meets your accuracy needs.

const pipe = await pipeline("feature-extraction", "model-name", {
  dtype: "q4", // Quantized to 4-bit
});

Do not replace your API for complex tasks. Transformers.js is excellent for embeddings, classification, and small generative tasks. For complex multi-step reasoning, you still want Claude or GPT on the server. That said, V4 demos are pushing the boundary - Hugging Face's community has shipped 1.2B parameter reasoning models and real-time speech transcription running entirely in the browser.

Practical Architecture

The pattern that works for production web apps:

Heavy reasoning - Server-side (Claude API, GPT API)
Search and similarity - Client-side (Transformers.js embeddings)
Classification and tagging - Client-side (Transformers.js zero-shot)
Speech input - Client-side (Transformers.js Whisper)
Image understanding - Client-side (Transformers.js CLIP/MobileNet)

This hybrid approach gives you the best of both worlds: powerful reasoning from cloud APIs and instant, private, zero-cost inference for everything else.

Frequently Asked Questions

Does Transformers.js work with Next.js?

Yes. Import it in client components ("use client") and load models after the component mounts. Server-side rendering will fail since the library needs browser APIs. Use dynamic imports with ssr: false for pages that depend on it.

How big are the models?

Model sizes range from 5MB (tiny classifiers) to 500MB+ (large language models). For most browser use cases, you want models under 100MB. Embedding models like mxbai-embed-xsmall-v1 are around 30MB. Whisper tiny is 40MB. There are over 1,200 pre-converted models on the Hugging Face Hub ready to use.

Is WebGPU required?

No. Transformers.js falls back to WebAssembly automatically. WebGPU makes inference faster (often 5-10x), but everything works without it. Chrome and Edge support WebGPU today.

Can I fine-tune models with Transformers.js?

No. Transformers.js is inference-only. Fine-tune your model using the Python transformers library, then convert to ONNX format using Optimum and load it in Transformers.js for inference. Many models on Hugging Face Hub are already converted and tagged with transformers.js.

How does it compare to TensorFlow.js?

Transformers.js focuses specifically on transformer models from Hugging Face Hub. TensorFlow.js is a general-purpose ML framework. If you want to run pretrained NLP, vision, or audio models, Transformers.js is simpler and has better model support. If you need custom model architectures or training in the browser, use TensorFlow.js.

Further Reading:

Transformers.js v3 Announcement - WebGPU support, 120 architectures, 1,200+ models
Transformers.js V4 Demos - Live demos including reasoning models and real-time speech
Transformers.js Documentation - Official API reference and guides
Compatible Models on Hugging Face Hub - Browse all models tagged for Transformers.js
Llama 4 Developers Guide - Server-side alternative for local inference
Vercel AI SDK Guide - Build AI apps with the Vercel AI SDK (has Transformers.js integration)

Official Sources

Resource	Link
Transformers.js Documentation	huggingface.co/docs/transformers.js
Transformers.js GitHub	github.com/huggingface/transformers.js
NPM Package	npmjs.com/package/@huggingface/transformers
Compatible Models	huggingface.co/models?library=transformers.js
Transformers.js V3 Announcement	huggingface.co/blog/transformersjs-v3
WebGPU Browser Support	caniuse.com/webgpu

Under the hood, Transformers.js uses the ONNX runtime to run models. Any model converted to ONNX format works, and Hugging Face Hub has thousands of compatible models tagged with transformers.js.

This guide covers the practical use cases that matter for web developers.

Install

npm install @huggingface/transformers

That is it. No Python, no Docker, no GPU drivers. The models are downloaded as ONNX files and cached in the browser on first use.

The Pipeline API

Every task in Transformers.js starts with pipeline(). You pick a task type, specify a model, and call the resulting function with your input.

import { pipeline } from "@huggingface/transformers";

const classifier = await pipeline(
  "sentiment-analysis",
  "Xenova/distilbert-base-uncased-finetuned-sst-2-english"
);

const result = await classifier("I love building with AI tools.");
// [{ label: "POSITIVE", score: 0.9998 }]

The first call downloads and caches the model. Subsequent calls are instant. Models range from 5MB to 500MB+ depending on the architecture.

Enable WebGPU for Speed

WebGPU gives you GPU-accelerated inference in the browser. Add device: "webgpu" to your pipeline options.

const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" }
);

Use Case: Semantic Search

This is the killer feature for web developers. Instead of keyword matching with libraries like fuse.js, you can embed your content and search by meaning.

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" }
);

// Embed your content (do this once, cache the vectors)
const docs = [
  "How to set up Claude Code with CLAUDE.md",
  "Building REST APIs with Express and TypeScript",
  "Running Whisper locally for speech recognition",
];
const docEmbeddings = await extractor(docs, {
  pooling: "mean",
  normalize: true,
});

// Embed the search query
const query = "configure AI coding agent";
const queryEmbedding = await extractor([query], {
  pooling: "mean",
  normalize: true,
});

// Compute cosine similarity and rank
function cosineSimilarity(a: number[], b: number[]): number {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

const queryVec = queryEmbedding.tolist()[0];
const scores = docEmbeddings.tolist().map((vec: number[], i: number) => ({
  doc: docs[i],
  score: cosineSimilarity(queryVec, vec),
}));

scores.sort((a, b) => b.score - a.score);
// "How to set up Claude Code with CLAUDE.md" ranks first

The user searches for "configure AI coding agent" and the Claude Code article ranks first, even though no keywords match. That is semantic search.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

Mar 26, 2026 • 9 min read

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

Mar 26, 2026 • 10 min read

The DevDigest App Ecosystem

Mar 22, 2026 • 4 min read

AI Agents Explained: A TypeScript Developer's Guide

Mar 19, 2026 • 6 min read

Use Case: Speech Recognition

Run OpenAI's Whisper model in the browser. Users record audio, and you transcribe it without sending anything to a server.

const transcriber = await pipeline(
  "automatic-speech-recognition",
  "onnx-community/whisper-tiny.en",
  { device: "webgpu" }
);

const result = await transcriber(audioBlob);
console.log(result.text);
// "The quick brown fox jumps over the lazy dog"

The whisper-tiny.en model is 40MB. For better accuracy, use whisper-small.en at 240MB. Both run in real time on modern hardware with WebGPU.

Use Case: Image Classification

Classify images without uploading them to a server. Useful for content moderation, auto-tagging, or building visual search.

const classifier = await pipeline(
  "image-classification",
  "onnx-community/mobilenetv4_conv_small.e2400_r224_in1k",
  { device: "webgpu" }
);

const result = await classifier(imageElement);
// [{ label: "laptop", score: 0.87 }, { label: "keyboard", score: 0.06 }]

The MobileNet model is under 20MB and classifies images in milliseconds.

Use Case: Text Generation

Run small language models directly in the browser. This is not GPT-4 class, but it is useful for autocomplete, content suggestions, and creative features that do not need to be perfect.

import { pipeline } from "@huggingface/transformers";

const generator = await pipeline(
  "text-generation",
  "HuggingFaceTB/SmolLM2-360M-Instruct"
);

const output = await generator("Explain WebGPU in one sentence:", {
  max_new_tokens: 50,
  temperature: 0.7,
});

SmolLM2 at 360M parameters is small enough for the browser and smart enough for light tasks. For the Vercel AI SDK, there is a dedicated provider:

import { streamText } from "ai";
import { transformersJS } from "@browser-ai/transformers-js";

const result = streamText({
  model: transformersJS("HuggingFaceTB/SmolLM2-360M-Instruct"),
  prompt: "Explain WebGPU in one sentence.",
});

Use Case: Zero-Shot Classification

Classify text into categories you define at runtime, without any training data.

const classifier = await pipeline(
  "zero-shot-classification",
  "Xenova/mobilebert-uncased-mnli"
);

const result = await classifier(
  "How do I deploy a Next.js app to Vercel?",
  ["deployment", "authentication", "database", "testing"]
);
// { labels: ["deployment", ...], scores: [0.94, ...] }

This is useful for auto-routing support questions, categorizing user feedback, or building smart content filters.

What to Know Before Shipping

Model size matters. A 50MB model download on first visit is fine for a tool page. It is not fine for a landing page. Lazy-load models after the page renders, and show a loading state.

Cache aggressively. Models are cached in the browser's Cache API after first download. Subsequent visits load from cache in milliseconds. Set proper cache headers if you are self-hosting models.

WebGPU is not everywhere. Always provide a WebAssembly fallback. Transformers.js does this automatically, but inference will be slower on CPU.

Quantization reduces size. Most models on Hugging Face Hub have quantized variants (q4, q8, fp16). Use the smallest quantization that meets your accuracy needs.

const pipe = await pipeline("feature-extraction", "model-name", {
  dtype: "q4", // Quantized to 4-bit
});

Practical Architecture

The pattern that works for production web apps:

Heavy reasoning - Server-side (Claude API, GPT API)
Search and similarity - Client-side (Transformers.js embeddings)
Classification and tagging - Client-side (Transformers.js zero-shot)
Speech input - Client-side (Transformers.js Whisper)
Image understanding - Client-side (Transformers.js CLIP/MobileNet)

This hybrid approach gives you the best of both worlds: powerful reasoning from cloud APIs and instant, private, zero-cost inference for everything else.

Frequently Asked Questions

Does Transformers.js work with Next.js?

How big are the models?

Is WebGPU required?

No. Transformers.js falls back to WebAssembly automatically. WebGPU makes inference faster (often 5-10x), but everything works without it. Chrome and Edge support WebGPU today.

Can I fine-tune models with Transformers.js?

How does it compare to TensorFlow.js?

Further Reading:

Transformers.js v3 Announcement - WebGPU support, 120 architectures, 1,200+ models
Transformers.js V4 Demos - Live demos including reasoning models and real-time speech
Transformers.js Documentation - Official API reference and guides
Compatible Models on Hugging Face Hub - Browse all models tagged for Transformers.js
Llama 4 Developers Guide - Server-side alternative for local inference
Vercel AI SDK Guide - Build AI apps with the Vercel AI SDK (has Transformers.js integration)

Official Sources

Install

The Pipeline API

Enable WebGPU for Speed

Use Case: Semantic Search

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

The DevDigest App Ecosystem

AI Agents Explained: A TypeScript Developer's Guide

Use Case: Speech Recognition

Use Case: Image Classification

Use Case: Text Generation

Use Case: Zero-Shot Classification

What to Know Before Shipping

Practical Architecture

Frequently Asked Questions

Does Transformers.js work with Next.js?

How big are the models?

Is WebGPU required?

Can I fine-tune models with Transformers.js?

How does it compare to TensorFlow.js?

Vercel AI SDK: Build Streaming AI Apps in TypeScript

What Is MCP (Model Context Protocol)? A TypeScript Developer's Guide

How to Use Claude Code with Next.js

Related Tools

LlamaIndex

DeepSeek V3.2

Instructor

Playwright

Apps from Developers Digest

Demos

AI Models

AI Model Router

Related Guides

Run AI Models Locally with Ollama and LM Studio

Building Your First MCP Server

Model Aliases - Claude Code

Related Videos

OpenAI's NEW Web Browser ChatGPT Atlas in 2 Minutes

OpenAI's New TypeScript Agents SDK

OpenAI GPT-4o Speech Models in 6 Minutes

Related Posts

Vercel AI SDK: Build Streaming AI Apps in TypeScript

What Is MCP (Model Context Protocol)? A TypeScript Developer's Guide

How to Use Claude Code with Next.js

Create Beautiful UI with Claude Code: The Style Guide Method

What is RAG? Retrieval Augmented Generation Explained

Convex vs Supabase for AI Apps

Build with the member tools

Get Smarter About AI Dev

Official Sources

Install

The Pipeline API

Enable WebGPU for Speed

Use Case: Semantic Search

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

The DevDigest App Ecosystem

AI Agents Explained: A TypeScript Developer's Guide

Use Case: Speech Recognition

Use Case: Image Classification

Use Case: Text Generation

Use Case: Zero-Shot Classification

What to Know Before Shipping

Practical Architecture

Frequently Asked Questions

Does Transformers.js work with Next.js?

How big are the models?

Is WebGPU required?

Can I fine-tune models with Transformers.js?

How does it compare to TensorFlow.js?

Vercel AI SDK: Build Streaming AI Apps in TypeScript

What Is MCP (Model Context Protocol)? A TypeScript Developer's Guide

How to Use Claude Code with Next.js

Related Tools

LlamaIndex

DeepSeek V3.2

Instructor

Playwright

Apps from Developers Digest