Grok TTS is $15.00 per 1M characters ([xAI](https://x.ai/news/grok-stt-and-tts-apis)). ElevenLabs is $0.05 per 1K characters for Flash/Turbo and $0.10 per 1K for Multilingual ([ElevenLabs](https://elevenlabs.io/pricing/api)). Cartesia is credit-based from a free tier upward ([Cartesia](https://cartesia.ai/pricing)). Confirm OpenAI's current rates on its [pricing page](https://platform.openai.com/docs/pricing).

Text-to-Speech APIs for Developers in 2026: What to Actually Use

Q: Which TTS API has the lowest latency?

ElevenLabs Flash/Turbo (around 75ms per its [API pricing](https://elevenlabs.io/pricing/api)) and Cartesia Sonic (sub-90ms, with roughly 40ms time-to-first-audio per [Cartesia](https://cartesia.ai/sonic/)) lead on latency. Grok TTS advertises sub-second latency on its [voice docs](https://docs.x.ai/developers/model-capabilities/audio/voice).

Picking a text-to-speech API is a positioning problem more than a quality problem. Most of the major options sound good now. What separates them is where they sit on the three-way tradeoff between quality, latency, and price, plus two policy questions that decide whether they fit your product at all: streaming and voice cloning.

This is a fair, sourced look at the four APIs developers reach for most in 2026: OpenAI, ElevenLabs, xAI Grok, and Cartesia. Every pricing number below is from a primary source and linked. Where numbers churn, we point you at the page rather than freeze a figure that will drift.

The three-way tradeoff

There is no single best TTS API, only the best fit for your latency budget, your quality bar, and your per-character cost ceiling. A voice agent that has to respond in real time cares about time-to-first-audio above all. A batch narration pipeline for articles or podcasts cares about naturalness and price and barely notices latency. Read the comparison through your own workload, not a leaderboard.

OpenAI

OpenAI's current TTS model is gpt-4o-mini-tts, with the older tts-1 and tts-1-hd still available as legacy options. The distinctive feature is steerability: alongside the text and voice, you pass an instructions field like "Speak in a cheerful and positive tone," which shifts delivery without a new voice. Streaming is supported, so you can start playing audio before the full clip is generated. See the text-to-speech guide.

Voices: a fixed set of preset voices (alloy, coral, and others). No custom voice cloning.
Streaming: yes, real-time audio output via streaming responses.
Cloning policy: none. OpenAI does not offer voice cloning here, which sidesteps consent and likeness questions entirely but limits brand-specific voices.
Pricing: published on the OpenAI pricing page. Confirm the current gpt-4o-mini-tts and legacy tts-1 rates there, since OpenAI has been reshaping its audio lineup and posted numbers move.
Best for: teams already on OpenAI who want good-enough voices, tone steering, and one fewer vendor. OpenAI's usage policy also asks you to disclose to end users that the voice is AI-generated.

ElevenLabs

ElevenLabs is the quality-and-cloning specialist, with a model lineup that lets you trade latency for richness. Per its API pricing:

Flash / Turbo: $0.05 per 1,000 characters, ultra-low latency around 75ms, up to a 40,000-character limit. This is the tier for real-time voice agents.
Multilingual v2 / v3: $0.10 per 1,000 characters, latency around 250-300ms, tuned for the most natural, expressive output.
Streaming: yes.
Cloning policy: instant voice cloning from a short sample and professional voice cloning are core features. That power comes with consent obligations. Read the current ElevenLabs terms before cloning any voice you do not own.
Best for: products where voice quality or a custom cloned voice is the point, and teams that want to dial latency up or down per use case with the same vendor.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Box3D: Erin Catto Releases an Open Source 3D Physics Engine

Jul 1, 2026 • 5 min read

Claude Sonnet 5 vs Sonnet 4.6: Should You Upgrade?

Jul 1, 2026 • 6 min read

Cloudflare's x402 Monetization Gateway Brings Micropayments to the Edge

Jul 1, 2026 • 6 min read

Codex Record & Replay: Turn Screen Recordings Into Reusable Automation Skills

Jul 1, 2026 • 6 min read

xAI Grok

xAI shipped standalone Grok Speech to Text and Text to Speech APIs built on the stack behind Grok Voice. The pitch is simple, predictable pricing.

Pricing: $15.00 per 1 million characters for TTS (roughly $0.015 per 1,000 characters), per xAI's announcement and voice docs.
Voices: 5 expressive voices, with Speech Tags for delivery control and telephony codecs for phone use cases.
Latency: sub-second, on the /v1/tts endpoint.
Streaming: yes, with a separate real-time grok-voice-latest speech-to-speech option billed at $3.00 per hour for conversational agents.
Cloning policy: the public TTS product ships with fixed expressive voices rather than open voice cloning.
Best for: developers who want flat, easy-to-forecast per-character pricing, telephony-ready output, and a single vendor for both transcription and speech.

Cartesia

Cartesia's Sonic model competes on raw speed. It advertises sub-90ms latency and a roughly 40ms time-to-first-audio, natively multilingual across 40+ languages, with instant voice cloning from a short clip.

Pricing: credit-based tiers per the Cartesia pricing page, from a free tier (20K credits/month) up through paid tiers ($5 for 100K credits/month with instant voice cloning, $49 for 1.25M credits/month with pro voice cloning, and higher). Credits map to generated audio, so model your real character volume against a tier.
Streaming: yes, with time-to-first-audio as the headline metric.
Cloning policy: instant voice cloning on paid tiers, professional cloning higher up. As with any cloning vendor, get consent for the source voice.
Best for: latency-critical, real-time voice applications where time-to-first-audio is the metric that makes or breaks the experience.

How to choose

Real-time voice agent: start with ElevenLabs Flash/Turbo or Cartesia Sonic. Both are built for the sub-100ms band. Grok's sub-second TTS is a strong fit when you also want telephony codecs and flat pricing.
Batch narration (articles, podcasts, courses): ElevenLabs Multilingual for maximum naturalness, or Grok and OpenAI for predictable cost at volume where a few hundred milliseconds does not matter.
You need a custom or cloned brand voice: ElevenLabs or Cartesia. OpenAI and Grok ship fixed voices only.
You want one fewer vendor: OpenAI if you are already there, or Grok if you also want its STT.
Predictable cost is the priority: Grok's flat per-character rate is the easiest to forecast; ElevenLabs and Cartesia require modeling character or credit volume against tiers.

Whatever you pick, prototype with your real content. Naturalness is subjective and workload-specific, and a 30-second test on your actual scripts tells you more than any spec sheet.

Routing TTS: gateway or direct?

If you already send chat traffic through an AI gateway like Vercel AI Gateway, a fair question is whether TTS should ride the same rails. In 2026 the answer is usually no. AI gateways today focus on text generation and embeddings, and their unified request shapes are built around chat completions, not audio streams. Text-to-speech providers each expose their own audio endpoints, streaming formats, and voice parameters, so you generally call the TTS provider directly.

The practical pattern most teams land on: route text and reasoning through a gateway for one key, fallbacks, and spend visibility, and keep a thin direct client per TTS provider. That keeps your audio path close to the provider's streaming API, where the latency wins actually live, while your text stack stays consolidated. Abstract the TTS call behind a small internal interface so swapping providers later is a one-file change, not a refactor.

FAQ

Which TTS API has the lowest latency?

ElevenLabs Flash/Turbo (around 75ms per its API pricing) and Cartesia Sonic (sub-90ms, with roughly 40ms time-to-first-audio per Cartesia) lead on latency. Grok TTS advertises sub-second latency on its voice docs.

Which options support voice cloning?

ElevenLabs and Cartesia offer voice cloning, instant from a short clip and professional at higher tiers. OpenAI and Grok ship fixed preset voices without public cloning. Always get consent for the source voice and check each vendor's terms.

What does TTS cost?

Grok TTS is $15.00 per 1M characters (xAI). ElevenLabs is $0.05 per 1K characters for Flash/Turbo and $0.10 per 1K for Multilingual (ElevenLabs). Cartesia is credit-based from a free tier upward (Cartesia). Confirm OpenAI's current rates on its pricing page.

Do these APIs support streaming?

Yes. OpenAI, ElevenLabs, Grok, and Cartesia all support streaming audio so playback can start before the full clip is generated, which is what makes real-time voice agents feel responsive.

Should I route TTS through an AI gateway?

Usually not today. Gateways focus on text and embeddings, so call TTS providers directly and keep the audio path close to their streaming APIs. See the Vercel AI Gateway guide for the text side.