Add voice to your AI app with Gladia
Gladia is a fast, accurate speech-to-text API with realtime streaming and speaker diarization built in.
Prerequisites
- +Gladia account and API key
- +Node 20+ or Python
- +A short audio file or a working microphone
Step-by-Step
- 1
Get your API key
Sign up at gladia.io and grab the key from the dashboard. Free tier covers most prototyping.
- 2
Transcribe a file
Upload a file URL and poll for the result. Pre-recorded transcription returns timestamps, confidence, and full text.
curl -X POST https://api.gladia.io/v2/transcription \ -H 'x-gladia-key: $GLADIA_API_KEY' \ -H 'Content-Type: application/json' \ -d '{"audio_url": "https://example.com/recording.mp3"}' - 3
Enable diarization
Diarization labels each segment with a speaker ID. Indispensable for meeting recordings and podcasts.
{ "audio_url": "...", "diarization": true, "diarization_config": { "number_of_speakers": 2 } } - 4
Stream from a microphone
Open a websocket, push 16kHz PCM frames, receive partial + final transcripts. Latency under 300ms.
const ws = new WebSocket('wss://api.gladia.io/audio/text/audio-transcription', { headers: { 'x-gladia-key': key } }); ws.on('open', () => ws.send(JSON.stringify({ x_gladia_key: key, language: 'en', encoding: 'wav' }))); ws.on('message', (d) => console.log(JSON.parse(d.toString()))); - 5
Use the gladia CLI
For local files, the gladia CLI handles upload + polling in one command.
gladia recording.mp4 -o transcript.md --diarization - 6
Pipe into your app
Take the transcript JSON, feed segments into a summarizer or RAG ingest pipeline, and you have searchable audio.
Common Pitfalls
- !Wrong sample rate (44.1kHz instead of 16kHz) tanks accuracy.
- !Skipping language hints when known.
- !Forgetting to close the websocket - you keep paying for the open session.
Video Clipper
AI-powered video clipping with smart moment detection. Turn long videos into shareable clips.
What's Next
- ->Pair with ElevenLabs for full STT+TTS conversational agents.
- ->Add a summarization step to convert transcripts into action items.
