
Repo: https://git.new/ai-pin Building an AI Assistant similar to the Humane AI Pin, the Rabbit R1 with Advanced Functionality from Scratch This video details the process of creating an AI assistant similar to the capabilities that we are starting to see with devices such as the Humane AI Pin and the Rabbit R1. This highlights the soon to be released Whisper capabilities now available on Groq.The app is capable of handling transcription and response from an LLM by leveraging the new whisper-large-v3 on Groq or whisper-1 on OpenAI. It also has the ability to handle complex queries such as weather updates, time inquiries, spotify request and photo recognition with responses from various AI models including GPT-4, Fal Llava AI and LLaMA 3. The project also has an option to access results from the Internet as and has support for OpenAI's Text to Speech (TTS) Models. I give an overview of setting up the front-end and back-end, incorporating various AI models, dealing with bottlenecks in response times, and optimizing the experience. The project also supports Langchain's LangSmith for observability. Serper API for internet search functionality Upstash Redis for rate limiting. The demonstration includes setting up the project environment, configuring API keys, and guiding through the coding framework using Next.js. 00:00 Introduction to Building AI Devices 01:12 Breakdown and Analysis 05:12 Diving Into the Coding and Development Process 09:05 Backend Logic and Functionality Explained 19:47 Frontend Development Insights 28:20 Final Thoughts and Future Directions
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2024-05-07 youtube_id: CXDFGyO2FUI --- # Transcript: Build an AI Device with Groq, Llama 3, OpenAI, TTS, Whisper, Vision, Vercel AI SDK & Next.js all right in this video we're going to do a technical deep dive on how to build out something analogous to the Humane AI pin or that rabbit R1 device that just came out what was the biggest song from 1972 show me the weather in Vancouver Canada right now what time is it right now how much was GitHub acquired for how much was YouTube acquired for how tall is the Eiffel Tower what's 9,000 ID 10 what's 4,000 ID 53 tell me something I probably don't know the shortest war in history was between Britain and Zanzibar on August 27 1896 and lasted only 38 minutes what is the rabbit R1 the rabbit R1 is an AI powered device that can go perform various tasks including answering questions taking photos and placing calls among others what is this image before I dive into the coding portion of this I want to give a quick overview on why I decided to build this out before I do that I want to play this quick clip from MKBHD or Marquez brownley here look and tell me what this is or I'll just do this I guess ah it's a cyber truck photo is of a cyber truck an electric pickup truck produced by Tesla so if we break down this clip a little bit so he's holding down the pin it's recording at that point once he releases the pin it's going to send that photo as well as that audio to an end point from that point it's going to transcribe what he said once it's transcribed that request is going to be sent with the payload of the image to something like I would assume gp4 and the reason I'm assuming gp4 is because Humane has a collaboration with open AI when I first tested this shortly after when the video came out I found that the biggest bottleneck was from that Vision endpoint from open AI as you can see here it takes almost 15 seconds to process this simple response just to get something similar to the response that Marquez had in the video one thing that I do have to say is from the time that this video was released this endpoint does seem to be pretty quick but at times if say you think about it there's a ton of different requests going to this GPT 4 Vision server like when Humane releases all these devices to the Wild and everyone's all of a sudden taking pictures of everything there's going to be a cue to handle all of those different messages presumably right that was the initial bottleneck that I ran into when the these Humane pins were rolling out ever since then the INF speed from that endpoint is actually not too bad it still takes a handful of seconds but it's definitely not 14 15 seconds when I first saw this and I ran into that bottleneck with gbd4 vision I decided okay what other models are out there that's able to process Vision as well as text and my first thought was the lava model lava is an open source model I reached out to the team over at artificial analysis and asked them what's the fastest inference speed for a vision model they said that they're not tracking it quite yet but they did point me in the direction of f AI F AI offers a ton of different models that you can go ahead and try out they have a really great website you can sign up get some free credits I've also included their example in this video so you'll be able to choose from gp4 V or foul the one difference when I was using the lava model versus The gp4 Vision model is the gp4 vision model knew that this was a cyber truck whereas when I tested it with the lava model it would often come back with responses like this is a futuristic metallic truck or something to that effect so there's a little bit of a trade-off so you might be able to get faster inference speed from something like lava but if you want more accurate results you might be bound to using something like gp24 V or the Opus Vision or something like Gemini the other thing that I'm going to be showing you is Langs Smith which is an observability Suite from Lang chain this is a product that you can use there's a free tier that you can use as well and what this allows you to do is it allows you to to track and have these quote unquote traceables in your application and what you can do with this is you can really wrap any function with a traceable so if you're using Lang chain you get all of this functionality if you're not using Lang chain you can go ahead wrap your function and pass in information that you want to send over to Langs Smith as you can see here there's a ton of different helpful information here my main motivation of using Langs Smith was really to try and iron out all of the different latency portions of the application so being able to see how long does the TTS take how long does the image generation take how long does the transcription take Etc so on and so forth really encourage you to check this out but it is optional so you don't necessarily need to use it if you don't want to diving into the code portion we're going to be using nextjs for the front end and the back end within the configuration here you're going to be able to use grock or open Ai and you'll be able to use all of the different inference models that they have available right now we're going to be using Gro we're also going to be using the Llama 38b you'll also be able to use whisper on grock or on open AI I want to thank the grock team for giving me access to be able to test the grock whisper endpoint which is incredibly fast which I encourage you to check out from there I just have support for open AI for TTS right now but you could swap out this portion for something like 11 Labs or another provider if you like then for the vision setting we're going to have support for open AI or file AI like I had mentioned then you're going to be able to pass in the different model for the different provider here you're going to be able to provide the function calling provider right now for the function calling capability I have open Ai and gbt 3.5 turbo shortly after the video I plan un incorporating groc as well so if you want to use them as well it's going to be able to just be swapped out right here then we also have some booleans that you can just set on toggling some things within the UI if you want to enable showing the response times if you want to show that little Cog in the bottom left of the screen say in the application you don't want to have the user be able to use the speech UI toggle or the internet you can just turn all these different things off within here it just gives you easy flexibility on say if you want to turn off that speech capability or if you want to turn off the internet results say if you don't want to go ahead and reach for an API key from serer which is what I have set up for the internet results in here you can just go ahead and turn them off and then for the photo say if you don't want to have an application that's going to be leveraging something like gp4 V you can just turn that off here as well and that will just remove that functionality from the UI then once we have that set up we also have optional rate limiting support from St Rus and then we're going to have the Boolean toggle for whether you want to use lsmith for the tracking of all of those different functions that you saw just a moment ago so first the project is going to be within the app directory just like you see here the main components for the front end are going to be within the page as well as the components folder and then we also have the tools folder this is going to be those conditionally rendered UI components that we get back from the function calling so from there we're going to be using the versel AIS SDK and react server components and all of our logic is going to be routed through this actions. TSX from there we're also going to be importing a handful of things from a number of files here that we're going to be leveraging within the actions and we're also going to be incorporating a handful of tools that are going to be included this is going to be where we set up things like Spotify time for the back end it's going to be the actions. TSX and then we're also going to have our utilities folder So within here we're going to have a handful of things that we're going to import within here which we'll be diving into we also have this optional rate limiting package that you can use here since we're going to be using the versell AI STK as well as react server actions the way that we're going to set this up is we're going to have the action TSX which is the entry point and the main portion you can almost think of this as where you would typically have a post request from there we're going to have logic that we're going to build out in all of these different components within the utils then for the Chad completion with tools which is where we're going to be setting up all that different Logic for the function calling that's going to be within the tools folder as well as this chat completions with tools you see all of the different definitions that we pass to the llm as well as how to handle all of the different logic to actually invoke those functions once we get the arguments back that it at a high level I'm going to be going over this relatively quickly but if you have any other questions please feel free to reach out let me know open an issue on GitHub I don't think I mentioned this yet but you will be able to just pull this down on GitHub feel free to Fork this use this how however you want incorporate it into your project into your company do what you got to do hopefully you enjoy it hopefully you learn something I'm going to go over our backend we're going to be importing a handful of things we're going to be setting up the EnV so if you're not using bun you can just go ahead and include the EnV to reach for our environment variables we're going to set up the logic for upst rdus which is right here within here this is going to be how you can set up rate limiting say if you only want users to have five turns on this application you can go ahead sign up for a free account on upst Rus and then you can just plug in your environment variables and then you'll be able to plug in the interval here so if you want 250 requests per user per hour you can put it in there more often than not you'd probably want this a little bit lower especially on a free tier from there we're going to be importing a number of different utilities which we're going to be diving into in just a little bit but the way that this is going to work is we're going to be creating this streamable value what this allows us to do first we're going to check whether it needs to be rate Limited and then from there what we're going to do is we're going to get the form data object this is going to be that toggle as well as the audio blob that we're getting from the front end and then optionally the image if it exists here we're just going to essentially destructure the different portions of the form data and then we're going to have a little bit of air handling there just to make sure that we have an audio blob since this is an audio first project we have to make sure that we actually have that the first thing that we're going to do is we're going to transcribe the audio I'm going to dive into all of these different utility functions in just a moment here but just at a high level first we'll go through all of this once we've transcribed the audio we're going to send that transcription to the front end we're going to render it on that left hand side of the screen there then what we're going to do is we're going to set up a variable where we're going to start to build out our response text if it's using a photo it's going to go through this path one thing that I'm going to mention here is with the photo capability it isn't tied to that internet capability and I decided to do that because I didn't think that having a photo as well as the internet necessarily made sense let me know your thoughts in the comments if you think there could be a use case for that but I found in the example of the cybertruck right if you're holding that and you're asking it what it is it doesn't necessarily need to do both an internet search as well as triggering that inference from the vision model but there could be some use cases right so say if you hold that button you say how much is the cybertruck today and then it could go ahead and query that so that would obviously take a little bit longer in terms of a response time but you can definitely twak this a little bit and include that within this logic if you'd like within here we're basically going to detect whether it's using fou AI or whether it's using open AI so we could even make this a little bit more strict and then we could use something like this if it's using open AI what we're going to be saying here is we're going to say okay if that toggle is set to use photos we're going to make sure that there is a photo otherwise we're just going to say okay you've forgotten to upload a photo from there we're going to go ahead and decide okay is it using fou otherwise just use gbd4 V from there it's it's just going to take that image as well as that transcription of what the user said and then it's going to wait for that response and then it's going to send that back to the user but if we're not using that we're going to go down a little bit of a different screen we're going to have the result if you're using the internet we're going to have this answer engine utility where we're going to go ahead pass that transcription to a search engine API and do a quick request on all of the different results for that query if it's not we're going to generate a quick chat completion this is going to be where we're going to be leveraging that llama 8B model from Gro from there we're going to go ahead and see whether we need to invoke any tools if there's a query like you saw in the example of show me the time or play this certain song like what's the top song in 1995 or whatever it is we'll be able to break down all of the different function calling capabilities that we have within here then you'll also have to set that up with any chat completion with tools from here we're going to be starting to set up the function calling capabilities so this is going to be where it detects whether there's the intention on if the user wants to see weather data Spotify data whether they want to see the time we're going to have the logic here we're going to have it within the chat completion with tools and then it's also going to be within this tools folder we have these three different examples here we have Spotify time and weather and that's going to be the different tools that you can build out over time I'll add in some more tools here that you can go ahead and play around with so if there are no tools that are detected you can just quickly respond back with the message then from there we're going to stream that result back so that text that we were adding to up here we're just going to send that back to the client and then we're going to send back the result this is going to be what shows on the palm and then from there we're going to say okay if it's using TTS we're going to go ahead and we're going to update and send back the audio TTS and call that generate TTS function the way that I set this up is it's going to send back that result then it's going to go ahead and generate that audio so at least you see that text on the palm and you're not waiting and it's not blocked by waiting for that TTS response then you just have to make sure that your calling streamable done that's just how the versel AI SDK is set up if you have errors or long response times it's going to be showing you that within the console and then from there we're just going to be setting up our initial AI State as well as creating AI this is going to be how we wire up our AI SDK to make it interact with the front end it's pretty straightforward the different functions but the one thing that I want to call out within these functions is how I set up the traceable so if I just go into a simple one here so for the generate TTS that I'm importing traceable from Lang Smith and all that you have to do to leverage a traceable you just have to wrap your function with this traceable and then you can pass in the name of it as well as a bunch of other stuff here so this is just a really quick example I'm just really passing in the name of the functions I'll be able to see the response time of all of that as well as what's being returned in that lsmith interface that you saw there within all of these different functions I'm not going to be diving into every single piece of these but they're pretty self-explanatory so the transcribe audio what we're going to be doing here we're going to detect whether it's using open AI or Gro then from there we're going to be sending that file to the endpoint going to be waiting for it and then it's going to be returned similar thing with process image we're going to be sending in that request of the image to GPT 4V or whether we're using fou similar thing so we're going to be sending that and essentially we're just going to be waiting for that text response back with this answer engine what we're going to be doing is we're going to be rephrasing that question then from there we're going to be sending that question to the search engine API and then once we have that response back is that's going to be what we pass within the application which is going to be essentially acting as a little bit of a rag application we're just getting the top results and the summary this isn't an advanced search we're not scraping different web pages in this we're just getting that highlevel overview of what serer provides at that top level request object we're going to be essentially Json stringifying everything within that and then sending that within the payload to the llm so once we have that we have a simple chat completion at the top and then we also have that chat completion with tools I do want to take a quick moment and go over the generate chat completion with tools so essentially how this works is you're going to have to set up all of the different tool calls within here so here we have the weather we have search song we have get time and what this will do is it will take that query if we head back to the action. TSX you'll see within the response text here that we do have that initial response I briefly want to touch on one of the unique things that I set up within the application I have a generate chat completion which is going to be just quering that llama 8B 7B model that's going to come back really quick and then we also have the chat completion with tools and the reason that I wanted to break this out is I wanted to have each of these be specialized in exactly what they're intended to do so with the first one we're going to have the chat completion all I'm asking for is for a quick response back I'm not asking it for all of the different function invocations that we're passing in with our tool calls I'm going to be leveraging separate model for that so the reason for that is I wanted to break this out is to essentially have two experts you can think of it uses one is really good at responding really quickly and then one is going to be focused on that function calling capability and my rational for this is as I start to scale out the number of different functions that I'm going to pass into the application I really want that chat completion with tools invocation to be focused solely on that function calling capability so now I'm passing in three functions to be invoked you could imagine a situation where maybe you're passing in 12 functions or 15 functions if you start to try and pass in all of that as well as hopefully get a good quality chat completion from that same model you might run into some potential issues I really just wanted to separate it out so there's these sort of two experts you can think of it as one's really good at just responding quickly to a quick piece of text in a sentence or two and then one is able to to respond back with this chat completion with tools and that's going to be focused on the function calling capability you can dig into some of the backend utilities here get a look you can add some in if you just follow the structure if you wanted to add in a new one you could just go within here you could set up another condition for a different tool you could add it within the chat completion with tools and then you can add the tool within the import at the top here and then add them below there but that's just a way on how you can set it up I'm might do another video in the future on how to layer in some other tools so if you want to add in your own tool so you can see how to do that within how the project is already set up that's pretty much it for the back end at least the portion that I'm going to be covering within the video check out the GitHub repository if you have any questions or any suggestions on how you think you could potentially improve it open an issue open a p request to go into our API Keys just for a moment if you just want that simple transcription as well as the text response you'll be able to deliv that with Gro as well as open AI so I'd recommend at least acquiring these two API keys in terms of the other ones if you want to use the search engine functionality you can get a free serper API key if you want to use tracing from Lang chain and Langs Smith you can go ahead and get the different keys and values that you need from there if you want to use rate limiting similar thing you can go ahead make an account on up stash go on their free tier get a free API key and then optionally for Spotify you can include that if you want to use use that Spotify component like I showed at the start of the video and then if you want to use the fou AI lava image model you'll just be able to plug that in here I wanted to set it up in a way where it's really flexible if you want to use open AI you can use open AI if you want to use Gro you can use groc and you can swap in the different pieces as you see fit all right so to quickly go over our front end essentially what we're going to be setting up is some of the typical suspects from the react package we're also going to be using the versel AI SDK like I mentioned one thing to note on the versel istk if you're using this within a fresh nextjs application just make sure that you also set it up within the layout here you see that I'm importing AI from the action here and then I'm wrapping the application within AI so back to our page here we're going to be importing a number of different modules that we made oursel we're going to have a screen just to show that it's not mobile friendly quite yet we're going to show a number of different components if they're conditionally rendered such as a clock that Spotify widget as well as the weather we're going to have an attribution component which is just going to say this is using up stash or Lang chain or what have you if you'd like to include then from there we're going to have our input component this is going to be really largely that first portion of our application so to set up the application first thing if you're using a fresh nextjs create next app just make sure that you import the AI from the action that we had declared in the back end there and make sure to wrap your application with this AI component here we're going to have most of what we have within the application just on this one page here but the first thing that I actually want to go through is going to be the input component then we'll hop back to the page component because I think the input component is largely the bigger part of the application so if you understand this you're going to understand 50% of the front end the first thing that we're going to do within the input component is we're going to use something called drop zone so this is going to be how we declare those optional images if you want to pass those to something like g4b or file AI lava that's going to be what we leverage to actually drag that on screen and be able to upload images from here we're going to be passing down some props we're going to be passing down the onsubmit whether to use uh TTS whether to use the internet whether to use photos Etc first we're going to declare how to set up that drop zone we're just going to specify okay here are the different file types that the endpoint accepts I have it set up for what gbd4 V accepts you could tweak these for foul if you'd like from there we're just going to set up a simple button to remove the image if you're done with the image you can just remove that and put on a different one then from there we're going to set up the handle recording what we're going to be using is we're going to be using the media recorder within the browser and essentially what we're going to be doing is we're going to be concatenating all of that stream of the audio that's coming in from the microphone and then once the recording is done so once we've lifted our finger off of that button we're going to declare that okay this has stopped and we're going to go ahead and submit that data to our action I'm using form data here at this point we're going to be sending through the toggles and the values those Boolean options okay are we using TTS are we using the internet Etc and then we're going to be sending that all to the back end so once that's all set up we're going to send that audio blob to the back end from there we're going to be setting up the form object we're going to be appending that audio blob we're going to be including those Boolean toggles on whether we're using TTS internet or photos if we're using photos we're going to optionally include that image and then we're just going to send that data to our action on the back end then here we just have that simple fun AI pin here you can really Swap this out for whatever you'd like or just change the styling this is just an example you can take this you can Fork this you can do whatever you want with it right from there this is going to be where we actually drag the image onto the screen and the rest of it is pretty self-explanatory we have the remove image button and that's pretty much it for the input component just to quickly hop through some of the other components we just have a simple modal essentially on mobile that's going to render and just show you that hey Mobile isn't yet supported that's not something that I've gotten around to positioning the hand and all of that uh I haven't had time to to do that yet from there the settings component is also very simple what we're going to be doing here we're going to be looking at that configuration object and then we're going to say okay if we want to enable the different aspects of the to that we have set up here whether it's the text to speech toggle the internet toggle or the photo toggle and I see a little typo here I'll fix after the fact we're just going to render those and then that's going to be what's ended up pass to the parent component and then sub quently sent on that request within the in component here we just have a simple attribution component so if you want to include that within your application just to point out all of the different great services that you're going to be using so for our tools they're all really simple we have a clock component it's going to show that secondhand every second so that's what this interval is showing and then we're going to render the value and pass it in continuously within the clock here from there we're going to have a simple iframe button for that Spotify component very similar thing with the weather for the weather component it's very straightforward we have live temperature uh information we do have mock icons for the weather those aren't actually real whether it's sunny or cloudy or what have you you could swap those out if you want to put in a different API key my motivation for putting in these initial tools where that these things are free the clock is free the Spotify endpoint is free you can just make a free Spotify account get a developer API key plug that in and then for the weather endpoint that is also free the temperatures and all that is all going to be dynamic now within the main portion of the page after we go through some of the logic we're going to be setting up a number of different hooks here and I also have some things commented out that you can comment back in if you'd like to use them so if you want the response times and all of that like a total response time of everything within the application that's something that I'm still working so now within the page I'm going to run through this pretty quickly we're going to have the hooks for a number of different things there Boolean toggles and the transcript the messages whether it's mobile or not Etc we're going to show the handle click so actually clicking on that that image we're going to set the tougher things like the TTS toggle so whether the internet's on or the TTS is on we're obviously going to have one for the handle submit so then for our handle submit this is going to be how we actually trigger that action that we have set up on the back end and then this is going to be the main portion of how we actually watch for all of the different responses so here we're going to declare okay if the rate limit is reached or if the UI component comes back as time or if the transcription comes back this is essentially going to be how we wrote all those different requests within our application to view within the various states that we had previously and this is going to be how essentially we wrote all of the different messages within the front end of the application and within the state of our application so you can play around with this so if you have new components that you want to layer in there just make sure that you also include them within this readable value stream with this current architecture and how it's set up and from there we're going to declare all of the different response times so we're going to have different ones just so we can track things like how long the transcription takes or how long the message takes how long the total response time is and all of that then we're going to have a simple usif for checking the mobile device so you can just tweak this as you like if you want this to be on more of a tablet view you could bump this up a little bit so it doesn't start to break as much this is just checking if it's mobile to render that mobile view that we'll have just within the jsx here if it's mobile we're going to render that component then from there we're going to have a simple icon I didn't use an icon Library I suppose I probably could we're going to render the AI pin within here I also have a a class on here just to make sure that the pin doesn't actually move I found when you're holding it had I not added that class and the CSS Styles it would move around a little bit then from there we're going to have our input component which we had just gone through we're going to have the current transcript which is going to render right below our input component then from there on the right hand side this is going to be where we render that hand from here this is going to be where we have that hand so we have a lot of things that are just positioned absolute if you wanted to make more of a serious application I'd obviously encourage you remove all of that and set it up in a way that makes sense with your UI but this is just a fun example right so if it's Spotify we're going to render that Ed Spotify button the thing that's unique with the Spotify button I set it up with its own state so it will be continually there even with additional requests so I thought it' be an interesting experiment to throw it up there if you want to play music while you're asking other things within the application then from there we're going to put the messages on on the right hand side of the hand and then if things come back like the weather or the time we're just going to render that within the place of the message as well from there we have the option to turn off that settings toggle if we'd like and then we also have that optional use the attribution component if you want to show the attribution to all of the different services that you're using that's pretty much it for this video if you'd like to see something like a gumroad course or a udemy course those are all things that I'm potentially considering right now for in instance I think with this video I could probably break it out into 20 different segments right and it could probably span over the course of hours instead of just sub 1 hour sub 30 minutes that I often go through within my videos there's often complex demos that I like to demonstrate I'm considering different options on what to pursue next if you have any suggestions on what you'd like to see leave them within the comments below otherwise that's it for this one if you found this video useful please comment share and subscribe and otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.