TL;DR
The Realtime API uses WebSockets for two-way voice interaction with function calling and stateful conversations. Here is how to set it up and build on it.
Most voice AI applications follow a three-step loop: record audio, send it to a transcription API, pass the text to an LLM, send the response to a text-to-speech API, play the audio back. Each hop adds latency. By the time the user hears a response, multiple round trips have happened.
OpenAI's Realtime API removes that overhead. It uses WebSockets to stream audio packets directly between your application and the model. As you speak, tiny audio chunks travel over the socket in real time. The moment you stop, the model already has the full payload and can begin responding. There is no transcription step, no separate TTS call. The model handles everything natively over a single persistent connection.
The result is voice interaction that feels conversational rather than transactional. And because the connection is stateful, the model remembers what was said earlier in the conversation without you manually managing chat history.
The key difference from the standard Chat Completions API is the transport layer. Instead of HTTP request/response pairs, the Realtime API maintains a WebSocket connection. Both sides can send messages at any time.
On the client side, your microphone captures audio and streams small packets to the server. On the server side, a relay process forwards those packets through the WebSocket to OpenAI. The model processes the audio, generates a response, and streams audio packets back. Your application plays them as they arrive.
This architecture means:
OpenAI provides a reference console application that demonstrates the full Realtime API feature set. Clone it to get started:
git clone https://github.com/openai/openai-realtime-console.git
cd openai-realtime-console
pnpm install
Create a .env file with your API key:
OPENAI_API_KEY=sk-your-api-key-here
REACT_APP_LOCAL_RELAY_SERVER_URL=http://localhost:8081
You need two processes running. The frontend serves the React app, and the relay server handles the WebSocket connection to OpenAI:
# Terminal 1 - Frontend
npm start
# Terminal 2 - Relay server
pnpm run relay
Once both are running, open the app in your browser and click "Connect." You should see client and server packet counters incrementing as you speak. Those numbers represent the audio chunks traveling back and forth over the WebSocket.
The relay server sits between your frontend and OpenAI's WebSocket endpoint. You need it because:
For local development, the relay runs on port 8081. In production, you would deploy this as a Node.js service behind your API.
The most powerful feature of the Realtime API is function calling. You can define tools the same way you would with Chat Completions, and the model will invoke them mid-conversation based on what the user says.
The console app ships with two example tools: set_memory and get_weather.
// Adding tools to the WebSocket connection
const tools = [
{
type: "function",
name: "set_memory",
description: "Saves important information that the user wants to remember.",
parameters: {
type: "object",
properties: {
key: {
type: "string",
description: "The label for this memory",
},
value: {
type: "string",
description: "The content to remember",
},
},
required: ["key", "value"],
},
},
{
type: "function",
name: "get_weather",
description: "Gets the current weather for a given location.",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "City name or location",
},
},
required: ["location"],
},
},
];
When you say "What's the weather in Toronto?", the model recognizes this matches the get_weather tool, extracts the location parameter, and invokes the function. Your code fetches the weather data and sends the result back through the WebSocket. The model then speaks the answer.
The set_memory tool demonstrates persistent state. Say "Remember that I need to buy eggs tomorrow" and the model calls set_memory with the key and value. The stored data renders in the UI and remains available for the rest of the conversation.
Here is how the weather function works under the hood:
async function handleGetWeather(location: string) {
// Get coordinates from location name
const geoResponse = await fetch(
`https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(location)}&count=1`
);
const geoData = await geoResponse.json();
const { latitude, longitude } = geoData.results[0];
// Get weather data
const weatherResponse = await fetch(
`https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}¤t_weather=true`
);
const weatherData = await weatherResponse.json();
return {
temperature: weatherData.current_weather.temperature,
windSpeed: weatherData.current_weather.windspeed,
unit: "celsius",
};
}
The pattern is identical to function calling in Chat Completions. Define the tool schema, handle the invocation, return structured data. The difference is the transport - everything happens over WebSockets instead of HTTP.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
With the standard Chat Completions API, you manage conversation history yourself. Every request includes the full message array. The Realtime API handles this server-side. The model remembers everything said in the current session.
This means you can have exchanges like:
"What's the weather in New York?" "The current temperature in New York is 17.4 degrees celsius..." "How about Chicago?"
The model understands "how about" refers to weather because it has the conversation context. No message array management on your side.
One thing to watch: sometimes function call responses arrive after the model has already started speaking. This is inherent to the WebSocket architecture - the model may begin generating a response before the tool result comes back. If the tool call is slow, you might get a "I'm unable to retrieve that right now" response, followed by the actual data. A second query will work because the data is now in context.
The console app is a starting point. The real value is in the tools you add. Here is a complete example of a tool that fetches stock prices:
const stockTool = {
type: "function",
name: "get_stock_price",
description: "Gets the current stock price for a given ticker symbol.",
parameters: {
type: "object",
properties: {
ticker: {
type: "string",
description: "Stock ticker symbol (e.g., AAPL, GOOGL, MSFT)",
},
},
required: ["ticker"],
},
};
async function handleGetStockPrice(ticker: string) {
const response = await fetch(
`https://api.example.com/stocks/${ticker}/price`
);
const data = await response.json();
return {
ticker: data.symbol,
price: data.current_price,
change: data.daily_change,
currency: "USD",
};
}
The tool definition tells the model what the function does and what arguments it needs. When a user says "What's the Apple stock price?", the model matches that to get_stock_price, extracts AAPL as the ticker, and invokes the function. Your code fetches the data, returns it, and the model speaks the result.
More ideas with high utility:
Calendar integration - "Schedule a meeting with Sarah tomorrow at 2pm" triggers a tool that creates a Google Calendar event.
Database queries - "How many users signed up this week?" invokes a tool that runs a SQL query and returns the count.
Smart home control - "Turn off the living room lights" sends a command to your home automation API.
Document retrieval - "What does our refund policy say?" searches a vector database and returns relevant passages.
Each tool follows the same pattern:
You can register as many tools as you need. The model selects the right one based on the user's spoken request. If no tool matches, it responds with natural conversation instead.
The WebSocket connection handles three types of messages:
Input audio - raw audio packets from the user's microphone, sent as they are captured. The console app uses navigator.mediaDevices.getUserMedia() to access the microphone and streams 24kHz PCM audio.
Output audio - audio packets from the model, played back through the Web Audio API. Packets arrive in small chunks, and the client buffers them for smooth playback.
Events - JSON messages for function calls, transcriptions, status updates, and conversation management. These are how tool calls get triggered and results get returned.
The console app's ConsolePage.tsx manages all three streams. The audio handling code is not trivial - it deals with buffer management, sample rate conversion, and playback synchronization. If you are building from scratch, starting from the console's audio utilities saves significant effort.
// Simplified audio capture flow
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (event) => {
const audioData = event.inputBuffer.getChannelData(0);
// Convert Float32Array to Int16Array for the API
const pcm16 = float32ToInt16(audioData);
// Send over WebSocket
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: arrayBufferToBase64(pcm16.buffer),
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
The model detects speech boundaries automatically. When you stop talking, it recognizes the pause, commits the audio buffer, and begins generating a response. You do not need to implement voice activity detection yourself.
The console app runs locally with your API key exposed to the relay server. For production, you need:
The relay server architecture already gives you the right separation. Your frontend connects to your relay. Your relay connects to OpenAI. Authentication and billing logic live in the relay layer.
The Realtime API is not just "ChatGPT but with voice." The WebSocket transport fundamentally changes what is possible:
For developers building voice assistants, customer support bots, accessibility tools, or any application where natural conversation matters, the Realtime API is the most capable option available right now. The WebSocket architecture means you can build interactions that feel like talking to a person rather than dictating commands to a machine.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source OpenAI API replacement. Runs LLMs, vision, voice, image, and video models on any hardware - no GPU require...
View ToolOpenAI's cloud coding agent. Runs in a sandboxed container, reads your repo, executes tasks, and submits PRs. Uses GPT-5...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Install the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedInstall Claude Code, configure your first project, and start shipping code with AI in under 5 minutes.
Getting Started
In this video, I guide you through setting up the new OpenAI real-time API, which promises new interactive possibilities for developers with its web socket-based architecture. You will learn...

In this video, I dive deep into the Groq Inference API, which I've found to be the fastest inference API out there. I share my insights on the various approaches to leveraging this API, focusing...

In this video, discover how to build your customized voice AI agents using TEN Agent, an open-source conversational AI platform. Learn to integrate top speech-to-text models, large language...
How to use Claude Code's Task tool, custom sub-agents, and worktrees to run parallel development workflows. Real prompt...

Two platforms, two philosophies. Here is how Anthropic and OpenAI compare on APIs, SDKs, documentation, pricing, and the...

The creators of Ruff and uv are joining OpenAI. Here is what this means for the Python ecosystem, AI tooling, and why Op...