
OpenAI Enhances Speech Models: New Text-to-Speech & Speech-to-Text Innovations In today's video, we delve into OpenAI's latest release of three new audio models. Discover the enhanced speech-to-text models superior to Whisper, and a groundbreaking text-to-speech model allowing precise control over timing and emotion. Learn how to try these models for free on OpenAI's interface, designed with a distinctive, practical look by Teenage Engineering. Explore various voice types, personality settings, and pronunciation controls. We also compare new models, GPT-4 Transcribe and GPT-4 Mini Transcribe, against other state-of-the-art models. The video provides cost details and a simple guide to getting started with these models using Python, JavaScript, or cURL scripts in the OpenAI API. Additionally, insights into logging, tracing, and example setups in OpenAI Agents SDK are shared. Don't miss out on the future of AI voice applications! Links: https://www.openai.fm/ https://www.youtube.com/watch?v=lXb0L16ISAc https://platform.openai.com/playground/tts https://platform.openai.com/docs/guides/audio https://platform.openai.com/docs/guides/speech-to-text https://platform.openai.com/docs/guides/text-to-speech https://platform.openai.com/docs/api-reference/introduction https://github.com/openai/openai-agents-python/tree/main/examples 00:00 Introduction to OpenAI's New Audio Models 00:16 Exploring the Interface and Features 01:01 Demonstration of Text-to-Speech Capabilities 02:21 New Speech-to-Text Models and Their Performance 03:18 Getting Started with OpenAI's API 04:21 Using OpenAI Agents SDK 05:15 Conclusion and Final Thoughts
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2025-03-20 youtube_id: 7MWBkdzeyJ4 --- # Transcript: OpenAI GPT-4o Speech Models in 6 Minutes just today open AI released three new Audio models two speech to text models that are considerably better than whisper but also a new text to speech model that gives you the ability to control both timing as well as the emotion of not just what to say but how you want it to be said the first thing to note if you want to try all of this out you can try it out for free right now at opening ey. FM want as side on the interface it does look like it was designed by teenage engineering which is a really great firm that's developed a ton of really cool devices over the years it has a very distinctive look and feel to the interface here and it's really practical too you have a number of different examples of the different voice types you also have the vibe as well as the script effectively how this new text to speech model works is you're going to have the ability to control over the timing and emotion this is going to be similar to something like a system message you can Define the personality the tone as well as the pronunciation and then once you've defined those aspects you can pass in the script of whatever you like it to generate I'm going to go ahead and play a sample from a number of these different examples that they have within the interface and I'll also put the link to everything that I'm showing you within the video within the description if you're interested in checking out any of this the Stars tremble before by genius the rift is open the energy surging unstable perhaps dangerous most certainly Captain ryland's hands twitch over the controls fools they hesitate but I I alone see the future Engage The thrusters I Bellow eyes well now partner you've made it to tech support let's see if we can't get you fixed up if your internet's giving you trouble press one and we'll get it back in line need help with billing or Account Details press two and we'll sort it out all right team let's the energy time to move sweat and feel amazing we're starting with a dynamic warm-up so roll those shoulders stretch it out and get that body ready now into our first round squats lunges and high knees keep that core tight push through you got this halfway there stay strong breathe focus and keep that momentum going in addition to the new text to speech bottle they also released GPD 40 transcribe as well as GPD 40 mini transcribe So within the chart here basically what we have here are all of the error rates across a number of different languages and what the word error rate indicates is the lower the error rate the better the model this chart compares the latest state-of-the-art models to the previous generation of models whisper large V2 as well as whisper large V3 a similar thing here this is how gp40 transcribe as well as gp40 mini transcribe compares to both Gemini 2.0 flash scribe as well as Nova 2 and Nova 3 these are some of the other non-open AI mod models that are out on the market right now in terms of the cost of these models it is broken out between text tokens as well as audio tokens but for GPD 40 mini is going to cost 1 and 1.2 cents per minute and for GPD 40 transcribe it's going to be6 of a Cent and then for GPD 40 mini transcribed it's going to be 1/3 of a cent next just to hop back to the example arguably the easiest way on how to get started is you can go to open.fm and you can grab whether it's the python JavaScript or the curl script to get started with all of this it is super straightforward so you're going to be able to initialize the open aai client you can specify your input and then you can specify your instructions and then from there you can generate whatever that voice might be and then finally you can play the audio it does have the ability to both stream in audio as well as stream out audio so overall how these are structured within the API do seem to be quite robust additionally I'll put the links to the documentation for both the new textto speech model as well as the speech to text model which you can read through some of the specifics if you're interested and in addition to the op. FM interface you can also access this directly within their playground here you can go and select the gp40 mini text to speech model you can put in your instructions for the inflection pacing whatever it might be specifier voice and you can select the different format and you'll be able to have it within the playground here additionally they did add examples within their opening eye agents SDK which they just released last week and there is the ability to see all of the tri tracing within the openai API dashboard as well the nice thing with this is if you are using the openi agent SDK you'll be able to see all of the different pieces that are going to be relevant to whatever your AI voice application is to give you an idea on what the tracing looks like you'll be able to see a waterfall of all of the different latencies of how long everything took but in addition to that it will also log and store things like the audio file so you'll be able to see and test different pieces directly within the opening eye dashboard to get started with the opening eye agents python SDK you can go ahead and run pip install opening eye agents voice install that then leveraging some of the examples that they have with in here you'll be able to set up your voice agent in just a number of lines of code overall that's pretty much it for this video kudos to the team over at opening ey for this release I definitely like the option of having something that isn't just strictly using web RTC or web socket to be able to leverage these new voice models having the ability to Define things like the personality aect tone and pronunciation in addition to the text just makes working with these models that much easier if you found this video useful please comment share and subscribe otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.