
TL;DR
A fair, sourced comparison of the TTS APIs developers reach for in 2026: OpenAI, ElevenLabs, xAI Grok, and Cartesia. Quality vs latency vs price, streaming, voice cloning policies, and whether to route through an AI gateway or go direct.
Direct answer
A fair, sourced comparison of the TTS APIs developers reach for in 2026: OpenAI, ElevenLabs, xAI Grok, and Cartesia. Quality vs latency vs price, streaming, voice cloning policies, and whether to route through an AI gateway or go direct.
Best for
Developers comparing real tool tradeoffs before choosing a stack.
Covers
Verdict, tradeoffs, pricing signals, workflow fit, and related alternatives.
Picking a text-to-speech API is a positioning problem more than a quality problem. Most of the major options sound good now. What separates them is where they sit on the three-way tradeoff between quality, latency, and price, plus two policy questions that decide whether they fit your product at all: streaming and voice cloning.
This is a fair, sourced look at the four APIs developers reach for most in 2026: OpenAI, ElevenLabs, xAI Grok, and Cartesia. Every pricing number below is from a primary source and linked. Where numbers churn, we point you at the page rather than freeze a figure that will drift.
There is no single best TTS API, only the best fit for your latency budget, your quality bar, and your per-character cost ceiling. A voice agent that has to respond in real time cares about time-to-first-audio above all. A batch narration pipeline for articles or podcasts cares about naturalness and price and barely notices latency. Read the comparison through your own workload, not a leaderboard.
OpenAI's current TTS model is gpt-4o-mini-tts, with the older tts-1 and tts-1-hd still available as legacy options. The distinctive feature is steerability: alongside the text and voice, you pass an instructions field like "Speak in a cheerful and positive tone," which shifts delivery without a new voice. Streaming is supported, so you can start playing audio before the full clip is generated. See the text-to-speech guide.
gpt-4o-mini-tts and legacy tts-1 rates there, since OpenAI has been reshaping its audio lineup and posted numbers move.ElevenLabs is the quality-and-cloning specialist, with a model lineup that lets you trade latency for richness. Per its API pricing:
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jul 1, 2026 • 5 min read
Jul 1, 2026 • 6 min read
Jul 1, 2026 • 6 min read
Jul 1, 2026 • 6 min read
xAI shipped standalone Grok Speech to Text and Text to Speech APIs built on the stack behind Grok Voice. The pitch is simple, predictable pricing.
/v1/tts endpoint.grok-voice-latest speech-to-speech option billed at $3.00 per hour for conversational agents.Cartesia's Sonic model competes on raw speed. It advertises sub-90ms latency and a roughly 40ms time-to-first-audio, natively multilingual across 40+ languages, with instant voice cloning from a short clip.
Whatever you pick, prototype with your real content. Naturalness is subjective and workload-specific, and a 30-second test on your actual scripts tells you more than any spec sheet.
If you already send chat traffic through an AI gateway like Vercel AI Gateway, a fair question is whether TTS should ride the same rails. In 2026 the answer is usually no. AI gateways today focus on text generation and embeddings, and their unified request shapes are built around chat completions, not audio streams. Text-to-speech providers each expose their own audio endpoints, streaming formats, and voice parameters, so you generally call the TTS provider directly.
The practical pattern most teams land on: route text and reasoning through a gateway for one key, fallbacks, and spend visibility, and keep a thin direct client per TTS provider. That keeps your audio path close to the provider's streaming API, where the latency wins actually live, while your text stack stays consolidated. Abstract the TTS call behind a small internal interface so swapping providers later is a one-file change, not a refactor.
ElevenLabs Flash/Turbo (around 75ms per its API pricing) and Cartesia Sonic (sub-90ms, with roughly 40ms time-to-first-audio per Cartesia) lead on latency. Grok TTS advertises sub-second latency on its voice docs.
ElevenLabs and Cartesia offer voice cloning, instant from a short clip and professional at higher tiers. OpenAI and Grok ship fixed preset voices without public cloning. Always get consent for the source voice and check each vendor's terms.
Grok TTS is $15.00 per 1M characters (xAI). ElevenLabs is $0.05 per 1K characters for Flash/Turbo and $0.10 per 1K for Multilingual (ElevenLabs). Cartesia is credit-based from a free tier upward (Cartesia). Confirm OpenAI's current rates on its pricing page.
Yes. OpenAI, ElevenLabs, Grok, and Cartesia all support streaming audio so playback can start before the full clip is generated, which is what makes real-time voice agents feel responsive.
Usually not today. Gateways focus on text and embeddings, so call TTS providers directly and keep the audio path close to their streaming APIs. See the Vercel AI Gateway guide for the text side.
Read next
Vercel AI Gateway gives you one API key and string model ids like moonshotai/kimi-k2.5 for hundreds of models. Here is how it works with the AI SDK, what BYOK and OIDC change, the honest tradeoffs, and who should actually use it.
8 min readThe Realtime API uses WebSockets for two-way voice interaction with function calling and stateful conversations. Here is how to set it up and build on it.
8 min readSame-day-verified llm api pricing june 2026: Claude Fable 5, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 compared per million tokens, plus the three caveats that change the math.
10 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
What MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsA practical walk-through of how to design, write, and ship a Claude Code skill - from choosing when to trigger, through allowed-tools, to the steps the agent will actually follow.
Getting StartedInteractive timeline showing what's in context at each turn.
Claude Code
In this video, I demonstrate on how you can quickly set up and integrate custom api endpoints as GPT Action's within OpenAI's new GPT interface which allows you to create custom chat bots....

In this video, I will show you the new online LLMs that are now available through perplexity's api's. I will also run through an example in how you can get started with setting it up within...

#LLM #AI #GorillaLLM #GPT4 #claude2 Welcome to this exciting tutorial where I'll show you how to supercharge your command-line experience with Gorilla CLI! This powerful tool leverages large...

Vercel AI Gateway gives you one API key and string model ids like moonshotai/kimi-k2.5 for hundreds of models. Here is h...

The Realtime API uses WebSockets for two-way voice interaction with function calling and stateful conversations. Here is...

Same-day-verified llm api pricing june 2026: Claude Fable 5, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 compared per milli...

A $500M accidental Claude bill and an open-weights model beating GPT-5.5 at one-sixth the cost point to the same conclus...

A companion guide to the Agents 101 video: a behind-the-scenes walkthrough of building and deploying AI agents fast on V...

A builder's guide to picking a code-execution sandbox for AI agents - E2B, Daytona, Modal, Cloudflare Sandbox, and Verce...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.