OpenAI Realtime Voice API: Getting Started Guide

Official Sources#

Resource	Link
Realtime API Documentation	platform.openai.com/docs/guides/realtime
Realtime API Reference	platform.openai.com/docs/api-reference/realtime
Realtime Console (Reference App)	github.com/openai/openai-realtime-console
OpenAI Pricing	openai.com/api/pricing
Function Calling Guide	platform.openai.com/docs/guides/function-calling

Most voice AI applications follow a three-step loop: record audio, send it to a transcription API, pass the text to an LLM, send the response to a text-to-speech API, play the audio back. Each hop adds latency. By the time the user hears a response, multiple round trips have happened.

OpenAI's Realtime API removes that overhead. It uses WebSockets to stream audio packets directly between your application and the model. As you speak, tiny audio chunks travel over the socket in real time. The moment you stop, the model already has the full payload and can begin responding. There is no transcription step, no separate TTS call. The model handles everything natively over a single persistent connection.

The result is voice interaction that feels conversational rather than transactional. And because the connection is stateful, the model remembers what was said earlier in the conversation without you manually managing chat history.

How It Works#

The key difference from the standard Chat Completions API is the transport layer. Instead of HTTP request/response pairs, the Realtime API maintains a WebSocket connection. Both sides can send messages at any time.

For the design side of the same problem, read OpenAI Codex: Cloud AI Coding With GPT-5.3 with OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; they show how agent-generated interfaces fail and how to give coding agents better visual constraints.

On the client side, your microphone captures audio and streams small packets to the server. On the server side, a relay process forwards those packets through the WebSocket to OpenAI. The model processes the audio, generates a response, and streams audio packets back. Your application plays them as they arrive.

This architecture means:

No manual transcription - the model receives raw audio directly
No separate TTS - the model generates audio output natively
Stateful conversations - the model tracks conversation history server-side
Function calling - the model can invoke tools mid-conversation, just like Chat Completions

Setting Up the Console App#

OpenAI provides a reference console application that demonstrates the full Realtime API feature set. Clone it to get started:

Terminal

git clone https://github.com/openai/openai-realtime-console.git
cd openai-realtime-console
pnpm install

Create a .env file with your API key:

Terminal

OPENAI_API_KEY=sk-your-api-key-here
REACT_APP_LOCAL_RELAY_SERVER_URL=http://localhost:8081

You need two processes running. The frontend serves the React app, and the relay server handles the WebSocket connection to OpenAI:

Terminal

# Terminal 1 - Frontend
npm start

# Terminal 2 - Relay server
pnpm run relay

Once both are running, open the app in your browser and click "Connect." You should see client and server packet counters incrementing as you speak. Those numbers represent the audio chunks traveling back and forth over the WebSocket.

Why a Relay Server?#

The relay server sits between your frontend and OpenAI's WebSocket endpoint. You need it because:

API key security - your OpenAI key stays on the server, never exposed to the browser
WebSocket proxying - the relay forwards audio packets between the browser WebSocket and the OpenAI WebSocket
Production path - in a deployed app, this is where you would add authentication, rate limiting, and usage tracking

For local development, the relay runs on port 8081. In production, you would deploy this as a Node.js service behind your API.

Function Calling Over Voice#

The most powerful feature of the Realtime API is function calling. You can define tools the same way you would with Chat Completions, and the model will invoke them mid-conversation based on what the user says.

The console app ships with two example tools: set_memory and get_weather.

TypeScript

// Adding tools to the WebSocket connection
const tools = [
  {
    type: "function",
    name: "set_memory",
    description: "Saves important information that the user wants to remember.",
    parameters: {
      type: "object",
      properties: {
        key: {
          type: "string",
          description: "The label for this memory",
        },
        value: {
          type: "string",
          description: "The content to remember",
        },
      },
      required: ["key", "value"],
    },
  },
  {
    type: "function",
    name: "get_weather",
    description: "Gets the current weather for a given location.",
    parameters: {
      type: "object",
      properties: {
        location: {
          type: "string",
          description: "City name or location",
        },
      },
      required: ["location"],
    },
  },
];

When you say "What's the weather in Toronto?", the model recognizes this matches the get_weather tool, extracts the location parameter, and invokes the function. Your code fetches the weather data and sends the result back through the WebSocket. The model then speaks the answer.

The set_memory tool demonstrates persistent state. Say "Remember that I need to buy eggs tomorrow" and the model calls set_memory with the key and value. The stored data renders in the UI and remains available for the rest of the conversation.

Implementing a Tool Handler#

Here is how the weather function works under the hood:

TypeScript

async function handleGetWeather(location: string) {
  // Get coordinates from location name
  const geoResponse = await fetch(
    `https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(location)}&count=1`
  );
  const geoData = await geoResponse.json();
  const { latitude, longitude } = geoData.results[0];

  // Get weather data
  const weatherResponse = await fetch(
    `https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}&current_weather=true`
  );
  const weatherData = await weatherResponse.json();

  return {
    temperature: weatherData.current_weather.temperature,
    windSpeed: weatherData.current_weather.windspeed,
    unit: "celsius",
  };
}

The pattern is identical to function calling in Chat Completions. Define the tool schema, handle the invocation, return structured data. The difference is the transport - everything happens over WebSockets instead of HTTP.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

NotebookLM: Google's AI-Powered Research and Podcast Tool

Oct 3, 2024 • 9 min read

Cursor: The AI-Powered Code Editor That Changed How Developers Work

Aug 22, 2024 • 10 min read

Stateful Conversations#

With the standard Chat Completions API, you manage conversation history yourself. Every request includes the full message array. The Realtime API handles this server-side. The model remembers everything said in the current session.

This means you can have exchanges like:

"What's the weather in New York?" "The current temperature in New York is 17.4 degrees celsius..." "How about Chicago?"

The model understands "how about" refers to weather because it has the conversation context. No message array management on your side.

One thing to watch: sometimes function call responses arrive after the model has already started speaking. This is inherent to the WebSocket architecture - the model may begin generating a response before the tool result comes back. If the tool call is slow, you might get a "I'm unable to retrieve that right now" response, followed by the actual data. A second query will work because the data is now in context.

Building Your Own Tools#

The console app is a starting point. The real value is in the tools you add. Here is a complete example of a tool that fetches stock prices:

TypeScript

const stockTool = {
  type: "function",
  name: "get_stock_price",
  description: "Gets the current stock price for a given ticker symbol.",
  parameters: {
    type: "object",
    properties: {
      ticker: {
        type: "string",
        description: "Stock ticker symbol (e.g., AAPL, GOOGL, MSFT)",
      },
    },
    required: ["ticker"],
  },
};

async function handleGetStockPrice(ticker: string) {
  const response = await fetch(
    `https://api.example.com/stocks/${ticker}/price`
  );
  const data = await response.json();

  return {
    ticker: data.symbol,
    price: data.current_price,
    change: data.daily_change,
    currency: "USD",
  };
}

The tool definition tells the model what the function does and what arguments it needs. When a user says "What's the Apple stock price?", the model matches that to get_stock_price, extracts AAPL as the ticker, and invokes the function. Your code fetches the data, returns it, and the model speaks the result.

More ideas with high utility:

Calendar integration - "Schedule a meeting with Sarah tomorrow at 2pm" triggers a tool that creates a Google Calendar event.

Database queries - "How many users signed up this week?" invokes a tool that runs a SQL query and returns the count.

Smart home control - "Turn off the living room lights" sends a command to your home automation API.

Document retrieval - "What does our refund policy say?" searches a vector database and returns relevant passages.

Each tool follows the same pattern:

Define the function schema with a clear description and typed parameters
Register it when connecting to the WebSocket
Handle the invocation in your relay server
Return structured data that the model can narrate

You can register as many tools as you need. The model selects the right one based on the user's spoken request. If no tool matches, it responds with natural conversation instead.

Understanding the Audio Pipeline#

The WebSocket connection handles three types of messages:

Input audio - raw audio packets from the user's microphone, sent as they are captured. The console app uses navigator.mediaDevices.getUserMedia() to access the microphone and streams 24kHz PCM audio.

Output audio - audio packets from the model, played back through the Web Audio API. Packets arrive in small chunks, and the client buffers them for smooth playback.

Events - JSON messages for function calls, transcriptions, status updates, and conversation management. These are how tool calls get triggered and results get returned.

The console app's ConsolePage.tsx manages all three streams. The audio handling code is not trivial - it deals with buffer management, sample rate conversion, and playback synchronization. If you are building from scratch, starting from the console's audio utilities saves significant effort.

TypeScript

// Simplified audio capture flow
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (event) => {
  const audioData = event.inputBuffer.getChannelData(0);
  // Convert Float32Array to Int16Array for the API
  const pcm16 = float32ToInt16(audioData);
  // Send over WebSocket
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: arrayBufferToBase64(pcm16.buffer),
  }));
};

source.connect(processor);
processor.connect(audioContext.destination);

The model detects speech boundaries automatically. When you stop talking, it recognizes the pause, commits the audio buffer, and begins generating a response. You do not need to implement voice activity detection yourself.

Production Deployment#

The console app runs locally with your API key exposed to the relay server. For production, you need:

Authentication - add a login flow before allowing WebSocket connections. Without it, anyone can generate audio on your API key.
Rate limiting - voice sessions are expensive. Limit concurrent connections and session duration.
Audio handling - the console plays audio directly. A production app might record conversations, generate transcripts, or route audio to other services.
Error recovery - WebSocket connections drop. Implement reconnection logic with exponential backoff.

The relay server architecture already gives you the right separation. Your frontend connects to your relay. Your relay connects to OpenAI. Authentication and billing logic live in the relay layer.

What Makes This Different#

The Realtime API is not just "ChatGPT but with voice." The WebSocket transport fundamentally changes what is possible:

Interrupt handling - you can start speaking while the model is responding, and it will stop and listen
Ambient listening - keep the connection open and the model can respond to ambient conversation
Multi-modal output - the model can respond with both text and audio simultaneously
Sub-second latency - because audio streams directly, there is no transcription or TTS bottleneck

For developers building voice assistants, customer support bots, accessibility tools, or any application where natural conversation matters, the Realtime API is the most capable option available right now. The WebSocket architecture means you can build interactions that feel like talking to a person rather than dictating commands to a machine.

Frequently Asked Questions#

What is the OpenAI Realtime API and how does it differ from Chat Completions?#

The Realtime API uses WebSockets for bidirectional voice interaction instead of HTTP request/response pairs. Audio streams directly between your application and the model with no transcription or text-to-speech steps. The model processes raw audio input and generates audio output natively. The connection is stateful, so the model remembers conversation history server-side without you managing a message array.

Why do I need a relay server for the Realtime API?#

The relay server sits between your frontend and OpenAI's WebSocket endpoint for three reasons: your API key stays on the server and never gets exposed to the browser, the relay forwards audio packets between browser and OpenAI WebSockets, and it provides a production path for adding authentication, rate limiting, and usage tracking. For local development, the relay runs on port 8081.

Does the Realtime API support function calling?#

Yes. You can define tools the same way you would with Chat Completions. Register the tool schema when connecting to the WebSocket, handle invocations in your relay server, and return structured data. The model selects the appropriate tool based on spoken requests and can invoke functions mid-conversation. If no tool matches, it responds with natural conversation instead.

How does the Realtime API handle conversation state?#

The API handles conversation history server-side. The model remembers everything said in the current session without you manually managing chat history. You can have exchanges where the model understands context from earlier turns, like asking "How about Chicago?" after a weather query about New York. The model knows "how about" refers to weather because it has the full conversation context.

What audio format does the Realtime API use?#

The API streams 24kHz PCM audio. Input audio packets from the user's microphone are sent as captured. Output audio from the model arrives in small chunks for playback via Web Audio API. The console app's audio utilities handle buffer management, sample rate conversion, and playback synchronization, which saves significant implementation effort if you're building from scratch.

What latency can I expect with the Realtime API?#

Sub-second latency is possible because audio streams directly over WebSockets. There is no transcription step and no separate TTS call. The model detects speech boundaries automatically and begins generating a response as soon as you stop speaking. You can also interrupt the model mid-response - start speaking while it's responding and it will stop to listen.

What are the main production considerations for deploying the Realtime API?#

Four areas need attention: authentication before allowing WebSocket connections so anyone cannot generate audio on your API key, rate limiting because voice sessions are expensive (limit concurrent connections and session duration), audio handling for recording conversations or generating transcripts, and error recovery with reconnection logic using exponential backoff when WebSocket connections drop.

Can function call responses arrive late in the Realtime API?#

Yes. Sometimes function call responses arrive after the model has already started speaking because the model may begin generating a response before the tool result comes back. If the tool call is slow, you might get "I'm unable to retrieve that right now" followed by the actual data. A second query will work because the data is then in context.

Official Sources#

Resource	Link
Realtime API Documentation	platform.openai.com/docs/guides/realtime
Realtime API Reference	platform.openai.com/docs/api-reference/realtime
Realtime Console (Reference App)	github.com/openai/openai-realtime-console
OpenAI Pricing	openai.com/api/pricing
Function Calling Guide	platform.openai.com/docs/guides/function-calling

How It Works#

This architecture means:

No manual transcription - the model receives raw audio directly
No separate TTS - the model generates audio output natively
Stateful conversations - the model tracks conversation history server-side
Function calling - the model can invoke tools mid-conversation, just like Chat Completions

Setting Up the Console App#

OpenAI provides a reference console application that demonstrates the full Realtime API feature set. Clone it to get started:

Terminal

git clone https://github.com/openai/openai-realtime-console.git
cd openai-realtime-console
pnpm install

Create a .env file with your API key:

Terminal

OPENAI_API_KEY=sk-your-api-key-here
REACT_APP_LOCAL_RELAY_SERVER_URL=http://localhost:8081

You need two processes running. The frontend serves the React app, and the relay server handles the WebSocket connection to OpenAI:

Terminal

# Terminal 1 - Frontend
npm start

# Terminal 2 - Relay server
pnpm run relay

Why a Relay Server?#

The relay server sits between your frontend and OpenAI's WebSocket endpoint. You need it because:

API key security - your OpenAI key stays on the server, never exposed to the browser
WebSocket proxying - the relay forwards audio packets between the browser WebSocket and the OpenAI WebSocket
Production path - in a deployed app, this is where you would add authentication, rate limiting, and usage tracking

For local development, the relay runs on port 8081. In production, you would deploy this as a Node.js service behind your API.

Function Calling Over Voice#

The console app ships with two example tools: set_memory and get_weather.

TypeScript

// Adding tools to the WebSocket connection
const tools = [
  {
    type: "function",
    name: "set_memory",
    description: "Saves important information that the user wants to remember.",
    parameters: {
      type: "object",
      properties: {
        key: {
          type: "string",
          description: "The label for this memory",
        },
        value: {
          type: "string",
          description: "The content to remember",
        },
      },
      required: ["key", "value"],
    },
  },
  {
    type: "function",
    name: "get_weather",
    description: "Gets the current weather for a given location.",
    parameters: {
      type: "object",
      properties: {
        location: {
          type: "string",
          description: "City name or location",
        },
      },
      required: ["location"],
    },
  },
];

Implementing a Tool Handler#

Here is how the weather function works under the hood:

TypeScript

async function handleGetWeather(location: string) {
  // Get coordinates from location name
  const geoResponse = await fetch(
    `https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(location)}&count=1`
  );
  const geoData = await geoResponse.json();
  const { latitude, longitude } = geoData.results[0];

  // Get weather data
  const weatherResponse = await fetch(
    `https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}&current_weather=true`
  );
  const weatherData = await weatherResponse.json();

  return {
    temperature: weatherData.current_weather.temperature,
    windSpeed: weatherData.current_weather.windspeed,
    unit: "celsius",
  };
}

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

NotebookLM: Google's AI-Powered Research and Podcast Tool

Oct 3, 2024 • 9 min read

Cursor: The AI-Powered Code Editor That Changed How Developers Work

Aug 22, 2024 • 10 min read

Stateful Conversations#

This means you can have exchanges like:

"What's the weather in New York?" "The current temperature in New York is 17.4 degrees celsius..." "How about Chicago?"

The model understands "how about" refers to weather because it has the conversation context. No message array management on your side.

Building Your Own Tools#

The console app is a starting point. The real value is in the tools you add. Here is a complete example of a tool that fetches stock prices:

TypeScript

const stockTool = {
  type: "function",
  name: "get_stock_price",
  description: "Gets the current stock price for a given ticker symbol.",
  parameters: {
    type: "object",
    properties: {
      ticker: {
        type: "string",
        description: "Stock ticker symbol (e.g., AAPL, GOOGL, MSFT)",
      },
    },
    required: ["ticker"],
  },
};

async function handleGetStockPrice(ticker: string) {
  const response = await fetch(
    `https://api.example.com/stocks/${ticker}/price`
  );
  const data = await response.json();

  return {
    ticker: data.symbol,
    price: data.current_price,
    change: data.daily_change,
    currency: "USD",
  };
}

More ideas with high utility:

Calendar integration - "Schedule a meeting with Sarah tomorrow at 2pm" triggers a tool that creates a Google Calendar event.

Database queries - "How many users signed up this week?" invokes a tool that runs a SQL query and returns the count.

Smart home control - "Turn off the living room lights" sends a command to your home automation API.

Document retrieval - "What does our refund policy say?" searches a vector database and returns relevant passages.

Each tool follows the same pattern:

Define the function schema with a clear description and typed parameters
Register it when connecting to the WebSocket
Handle the invocation in your relay server
Return structured data that the model can narrate

You can register as many tools as you need. The model selects the right one based on the user's spoken request. If no tool matches, it responds with natural conversation instead.

Understanding the Audio Pipeline#

The WebSocket connection handles three types of messages:

Output audio - audio packets from the model, played back through the Web Audio API. Packets arrive in small chunks, and the client buffers them for smooth playback.

Events - JSON messages for function calls, transcriptions, status updates, and conversation management. These are how tool calls get triggered and results get returned.

TypeScript

// Simplified audio capture flow
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (event) => {
  const audioData = event.inputBuffer.getChannelData(0);
  // Convert Float32Array to Int16Array for the API
  const pcm16 = float32ToInt16(audioData);
  // Send over WebSocket
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: arrayBufferToBase64(pcm16.buffer),
  }));
};

source.connect(processor);
processor.connect(audioContext.destination);

Production Deployment#

The console app runs locally with your API key exposed to the relay server. For production, you need:

Authentication - add a login flow before allowing WebSocket connections. Without it, anyone can generate audio on your API key.
Rate limiting - voice sessions are expensive. Limit concurrent connections and session duration.
Audio handling - the console plays audio directly. A production app might record conversations, generate transcripts, or route audio to other services.
Error recovery - WebSocket connections drop. Implement reconnection logic with exponential backoff.

The relay server architecture already gives you the right separation. Your frontend connects to your relay. Your relay connects to OpenAI. Authentication and billing logic live in the relay layer.

What Makes This Different#

The Realtime API is not just "ChatGPT but with voice." The WebSocket transport fundamentally changes what is possible:

Interrupt handling - you can start speaking while the model is responding, and it will stop and listen
Ambient listening - keep the connection open and the model can respond to ambient conversation
Multi-modal output - the model can respond with both text and audio simultaneously
Sub-second latency - because audio streams directly, there is no transcription or TTS bottleneck

Official Sources#

How It Works#

Setting Up the Console App#

Why a Relay Server?#

Function Calling Over Voice#

Implementing a Tool Handler#

NotebookLM: Google's AI-Powered Research and Podcast Tool

Cursor: The AI-Powered Code Editor That Changed How Developers Work

Stateful Conversations#

Building Your Own Tools#

Understanding the Audio Pipeline#

Production Deployment#

What Makes This Different#

Frequently Asked Questions#

What is the OpenAI Realtime API and how does it differ from Chat Completions?#

Why do I need a relay server for the Realtime API?#

Does the Realtime API support function calling?#

How does the Realtime API handle conversation state?#

What audio format does the Realtime API use?#

What latency can I expect with the Realtime API?#

What are the main production considerations for deploying the Realtime API?#

Can function call responses arrive late in the Realtime API?#

Vercel AI SDK: Build Streaming AI Apps in TypeScript

How to Build AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

Try These Tools

Related Tools

LocalAI

OpenAI Codex

Vercel AI SDK

OpenAI Agents SDK

Apps from Developers Digest

Migrate

Related Guides

Chronicle Research Preview Setup Guide

Migrating from Cursor to Claude Code

Getting Started with DevDigest CLI

Related Videos

OpenAI Realtime Voice API: A 7-Minute Getting Started Guide

Groq API: Quick Guide with 5 Examples - Groq SDK, Langchain, LlamaIndex, OpenAI SDK, Vercel

Build Your Own Voice AI Agent with Ten Agent: A Step-by-Step Guide

Related Posts

Vercel AI SDK: Build Streaming AI Apps in TypeScript

How to Build AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

Codex Changelog April 2026: Goals, Browser Use, GPT-5.5, and Safer Agents

OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience

How to Use Claude Code with Next.js

Build with the member tools

Get Smarter About AI Dev

Official Sources#

How It Works#

Setting Up the Console App#

Why a Relay Server?#

Function Calling Over Voice#

Implementing a Tool Handler#

NotebookLM: Google's AI-Powered Research and Podcast Tool

Cursor: The AI-Powered Code Editor That Changed How Developers Work

Stateful Conversations#

Building Your Own Tools#

Understanding the Audio Pipeline#

Production Deployment#

What Makes This Different#

Frequently Asked Questions#

What is the OpenAI Realtime API and how does it differ from Chat Completions?#

Why do I need a relay server for the Realtime API?#

Does the Realtime API support function calling?#

How does the Realtime API handle conversation state?#

What audio format does the Realtime API use?#

What latency can I expect with the Realtime API?#

What are the main production considerations for deploying the Realtime API?#

Can function call responses arrive late in the Realtime API?#

Vercel AI SDK: Build Streaming AI Apps in TypeScript

How to Build AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

Try These Tools

Related Tools

LocalAI

OpenAI Codex

Vercel AI SDK