
In this video, discover how to build your customized voice AI agents using TEN Agent, an open-source conversational AI platform. Learn to integrate top speech-to-text models, large language models (LLMs), and text-to-speech models seamlessly. Follow the step-by-step guide to set up your AI with OpenAI's real-time API, DeepGram, Fish Audio, and DeFi for voice AI applications. From initial setup and securing API keys to configuring agents and creating interactive applications like chatbots, this comprehensive tutorial provides all the tools and insights you need to develop robust voice AI assistants for various use cases, including smart home assistants and call centers. Try Ten Agent Here: https://agent.theten.ai ⬅️ Start the Repo: https://github.com/TEN-framework/TEN-Agent ⭐ 00:00 Introduction to TEN Agent 00:27 Key Features of TEN Agent 01:10 Voice Storyteller Demonstration 02:03 Setting Up TEN Agent 02:29 Why Use TEN Agent? 03:38 Getting Started: Prerequisites and Setup 06:39 Configuring Your AI Agent 07:51 Advanced Features and Extensions 10:52 Conclusion and Final Thoughts
--- type: transcript date: 2025-02-21 youtube_id: YTvbYPTR3Z8 --- # Transcript: Build Your Own Voice AI Agent with Ten Agent: A Step-by-Step Guide in this video I'm going to be showing you 10 agent a comprehensive conversational AI framework that's designed for real-time multimodal interactions 10 agent allows you to deliver low latency and easily interruptable voice responses support for multiple languages and platforms including C++ go and Python and comes with built-in real-time communication for both audio and video 10 agent it's an open source framework that you can use as the starting point to build out your AI agent applications some of the key features that are built in within the platform form is both its responsiveness its latency as well as its ease of interruption that makes building AI agents a smoother experience so just to give you an idea on what you'll be able to do with this let's say you want a voice agent with the open AI realtime API you can easily connect an integrade with basically all of the different top speech to text models llms and textto speech models that are out there you can go ahead select it you can even choose your language depending on the setting that you choose in this case I'll just say English will connect here and then I can begin to have a conversation if I just say hello how are you hey I'm 10 agent I can speak see and reason from a knowledge base I'm going to show you the voice Storyteller with image generation and effectively what this does is it's going to stream in the voice and then as soon as it detects your intention to generate a photo it will use that as a function call and invoke that and put it on the screen for you you can imagine that you could use this for a ton of different use cases where you'll basically be able to talk to the the computer and based on what you say or do it will perform an action or update the UI or whatever it might be if I just go ahead and connect here hey I'm Storyteller I could tell story based on your imagination say hi to me I'll say generate a photo of a dragon for a children's story in a magical forest a friendly dragon with sparkling scales and big gentle eyes smiled warmly vibrant flowers bloomed around him and Tall Whimsical trees sued in the breeze what do you think this dragon's name should be in this video I'm going to show you how you can set up a 10 agent and begin to build out your own voice AI agents that you can deploy and use for whatever use case you might have in mind first off you can head on over to the repository and I always encourage you for open source projects to start that repo and also for the link that I just showed you if you're on the GitHub page you can also access it right here now there is also a Discord Community if you do have any questions you can join that just like you see on the GitHub repository here why use 10 agent well if you've tried to integrate something like the real-time API from open AI or a number of these speech to text or text to speech models you would have known that there are a lot of different pieces that you would have had to set up to make this work seamlessly so effectively what 10 agent has done is essentially abstract away a lot of those harder pieces of having to build out voice AI agents you can really use this for any application that you're thinking of so whether it's a voice AI assistant if you just want to have something to talk back and forth to if it's an emotional companion we're starting to see those use cases crop up if you want to have your agent leverage computer use that's another potential use case and the list really goes on like if you're trying to learn a new language you could use this as a smart home assistant or if you want to use it for something like a call center or whatever it might be the main thing to consider is the framework is completely for free which open source you can pull this down and you'll be able to integrate essentially whatever different models that you want to and be able to have these pre-built modules for speech to text llms as well as text to speech to quickly enable these conversational experiences and build out your AI agents there are some prerequisites in terms of getting started the great thing with this is just about everything that you're going to use you will be able to get a number of free minutes per month to integrate This and like I mentioned you will be able to swap some of these out for whatever different services that you want to use to configure this I'm going to quickly show you how to get all of the different keys to get set up and pull this down locally on your machine the first one that we're going to set up is aora within here as soon as you sign up you'll be able to get 10,000 minutes per month for free if I go over to project management and I go over to create a new project and I say 10 agent then you can just select whatever use case it might be within here with testing mode you could grab the app ID but in this case we're going to grab the secured mode API and token we'll go ahead and submit that once we've done that we can go over to the configure Tab and in just a moment we'll grab the ID as well as the app certific ific so next we can head on over to platform. open.com API Keys we can go ahead and we can generate our API key in this case I'll just call it 10 agent we'll create our key and then once we have that key keep it up on screen we'll grab that in just a moment next we're going to log into deep CR and again you'll be able to get some credits for this as well for free once you're within deep gram you'll be able to just create your API key just like that finally we're going to grab our API key from fish audio within here you can just go and click your account we're going to go over to the API tab we're going to go to API keys and create a new secret once you have those just make sure that you do have Docker installed and that you do have a recent version of nodejs installed make sure that it's at least node version 18 or higher once you have that we do have some pretty humble system requirements hopefully you'll be able to use this on your machine next if you have a Mac with one of their silicone chips what we're going to do within Docker is if you open up the desktop app we're just going to un this use Rosetta for x86 64 next what we're going to do is we're going to go ahead and we're just going to pull down the repository so you can use whatever IDE I'm going to be using cursor in this example you can use BS code or Vim or whatever you want so I'm just going to get clone the repository next what we're going to do is we're just going to follow the next steps within their documentation here what we can do is we can just go ahead we can copy our environment variables over now we see this EnV and within here is where we're going to plug in all of those different environment variables and values that we just got in the previous step for instance within here you can just look for all of the different values that we just grabbed you can go ahead and plug in your open aai API key and then finally also your deep Grand API key once you put in all your environment variables we can go ahead and Docker compose up this will just take a moment to pull everything down and then once it's set we'll be good to get started once we see that everything is running we can go within the container just like this and then within the container we can go ahead task use once the agent has been built you can just task run and now we can see that the server is running at this specified Port once that's done you'll be able to access the playground at Port 3000 and you can start to configure your agent if I go ahead and go to Port 3000 here I can allow every time I'm visiting it's detecting my microphone and additionally with in here we can begin to configure our agent for instance I can select the craft that we want to use if I select the voice assistant for instance I can go ahead and connect here we can see that the agent's connected I can say hello world now we can see that the agent is actually working you can see as I'm speaking it's transcribing all of the different words that I'm saying and then as soon as I stop it will go ahead and respond now just to touch on the interface cuz this one is slightly different than the hosted interface within here you can see the different graphs that you can choose from now additionally what you can do is you can also select these different extensions here for instance we have a gor set up and then for speech to text we also have deep CRS if you want to configure the language or the model or the sample rate if you want to change the quality up up or down you can do that all within here if you want to change things like the greeting the max tokens all of those types of things can overwrite the model string that you have in the environment variables if you'd like Additionally you can configure your text to speech within here for text to speech we're using fish audio we have the model ID that we're using and we can also configure some of the options here as well now the really neat thing with extensions is that this is where you can build your voice AI agent for just about any application within here there's the example for the weather API tool within python now if you want to add different extensions what you can do is within the project and then if we go within the packages here and we go to extensions we can see all of the different extensions that are pre-built within here there are a ton of pre-built extensions say if you want to have a voice application that has access to be able to search the internet you could use something like the Bing search tool next if I select a voice assistant for instance within here I'll be able to see the different speech to text models that we have access to so in this case we're using deep gram speech to text we can also change out the different large language model you can go ahead and switch that out you can use Bedrock Gemini cozy or defi as well and the great thing with some of these options is if you aren't familiar with defi for instance what you'll be able to do with this is you'll be able to have a guey where you can build out these agent workflows where it can go through and perform whatever functionality that you like and it will go ahead and return that result for you that makes it really easy to build out these AI agents because you can do it all within a visual interface additionally within here we do have it set up with fish audio but you can swap this out for 11 Labs caria poly cozy minia Max or Azure text to speech basically regardless of the different services that you want to use and the thing to note with this is these services are constantly changing we see new releases from model providers speech to text providers as well as large language models constantly so being able to have a platform that's both flexible and robust will allow you to have a good basis for building out these applications but the really neat thing with this is you can set this up in a way where it's really up to you how you want to configure this now the other amazing thing with this is this whole application layer you can go and edit all of this if you just search for the different components that are within here you'll easily be able to find it within the project and effectively what you're going to be able to do is you can think of this template as the starting point but you can use this and change it to be whatever you'd like it could be something where you generate a children's story where every night for instance you could be reading a story to your child and have images pop up and have it be an interactive experience it could read a story back to you generate pictures along the way there are just a bunch of really neat applications within this another thing that was really impressive that I found is they actually set this up in a way where you can stream in both your camera as well as your screen what you can do is you can go ahead and select the screen that you want to show if you have a model like the Gemini models for instance that support the feature to stream in video you'll be able to do that all within this environment as well and honestly if you haven't used that feature before within Google's AI Studio it is very cool and it is a really great way on how you can learn things especially if you have questions about programs or websites or what have you it is a very helpful resource without a doubt 10 agent is probably the best place on how you can get started with building a voice AI agent there are a ton of different options within here it doesn't restrict you on having to use one llm provider or one speech DET text provider you can basically use whoever you want to build out your AI agent finally I just want to give you a really quick visualization on effectively what I demonstrated here this is the architecture of what I showed on the front end we had the playground this would be your application layer of whatever you're building if you're building a user app this is going to probably largely fit with in the playground here now what I showed you primarily is this frontend component this is the web app and within here we have our UI layer like I showed you and depending on the request that we had from our UI layer if we were updating the graphs if we were setting things like the configuration that would be setting to this port here where it would go and internally write to our property Json file but alternatively if we're sending requests like we're actually asking and talking and conversing with that AI AG agent it's going to send to our web server and within our web server that's going to be what interacts with the agent architecture and like you saw this can be set up with a variety of different combinations of whether it's speech to text llms text to speech or those real-time apis in a nutshell this is effectively the framework that you can build on top of you can take the stalker image layer within it and then ultimately be able to send those requests to the backend architecture that is already set up probably one of the easiest ways on how you can get started with building a voice AI assistant but otherwise that's pretty much it for this video if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.