Moshi: The Open-Source Answer to GPT-4's Voice Model? - Developers Digest

Transcript

--- type: transcript date: 2024-07-04 youtube_id: xkjGsAIKACQ --- # Transcript: Moshi: The Open-Source Answer to GPT-4's Voice Model? we might have our first assumed to be open- Source model competitor to the GPD 40 voice capability that was just demonstrated about 2 months ago now a research lab out of Paris qai have released this model as well as this interface that they're calling Moshi so Moshi do chat is the website to gain access to this and before I dive into it I'm just going to show you what exactly this is you can plug in your email here and then you can join the queue and as soon as I click join Q it will very quickly start to begin the interaction and I'll just role play a little bit to demonstrate to you what it does what's I'm just trying so could you tell me a little bit about yourself I'm a neural network with a few billion parameters which is a type of artificial intelligence I've been trained on vast amounts of data making me quite knowledgeable and capable of answering question oh no we have to be really quiet could you whisper all your responses back to me okay let's whisper let's whisper from here on out okay so tell me more about yourself in detail I'm MOSI and you're on network with a few B parameters I've been trained on vast amounts of data making me quite knowledgeable and capable of answering questions okay could you respond back in maybe a French accent certainly would you like to know okay could you tell me a little bit about let's say a trip to New York City could you give me some ideas on some things that I could potentially do there certainly New York City is a busing Metropolis with many things to do and see we can visit the Statue of Liberty Take a Walk In centr park or visit the Museum of Modern Art all right so as you saw there there's just a really quick example so you can give it a thumbs up if you had a good interaction if you didn't you could give it a thumbs down and then you can just disconnect to pause it and then you can continue it from Just clicking the start over button there you also have the ability to download the video as well as download the audio what's interesting with this demo is this is the first soon to be open source model which is speech in and speech out you can see that the latency it's almost too fast right they have a keynote on YouTube which I'll link to the description of the video and click through and see the demonstrations the impressive thing with this which they talked about in their keynote is that this is the first research prototype that they have and they built it from scratch in 6 months with only eight people the model can support 70 different emotions and styles like Whispering accents and you saw a few of those within the demo that I had there where I could ask it to whisper and it was able to Whisper back you could ask it for particular accents and it can potentially respond if it has that certain type of accent there's a number of exciting things with this model first it's shipping you saw that you can interact with this today before Chad Chief shipped GPT 40 now mind you gp4 is going to be delivered to tens of millions of people once it's released some of the details I found on Twitter which is a really good post which I'll also link within the description of the video here so helium is a speech in and Beach outed model it streams out constantly the generated text tokens as well as the audio codec which is also tunable you can see that the latency is incredibly short and I inspected the web app and it is using websockets to communicate back and forth I'd imagine eventually they'll probably use something like web RTC for that interaction within an app like they have some of the interesting pieces on the training and the rlf is that the model is fine-tuned on 100,000 transcripts that were generated from the helium model itself all of these transcripts were highly detailed and heavily with a motion and a style the textto speech engine is further fine-tuned with 20 hours of audio recorded by Alice and license the model can be fine-tuned with less than 30 minutes of audio in terms of the training so this was trained on a, h100 gpus and then this model is going to be capable to be deployed and hosted on scaleway and hugging face the model is capable of scaling down in both 8bit and 4-bit quantization then the backends works on Cuda metal and CPUs in terms of the release itself there is the web app that you can go ahead and interact with but soon there will be a technical report as well as the open models that will be released the open model released will include also the inference code base the 7 billion parameter model the audio codak as well as the full optimized stack there's going to be a ton that we're going to be able to learn from the release here as they begin to release this in terms of the license it looks like it's going to be as permissible as possible TBD on that front and then in terms of when they actually release the model for all of us to use and Deploy on our own infrastructure that's still TBD in just the past couple of weeks we've seen some really interesting and compelling competitors come to the spotlight that are beginning to challenge open AI in a number of ways so just a couple weeks ago we saw Claude Sonet 3 . 5 that was released which became by all intents and purposes one of the most powerful models overnight and it also was released with that interesting artifacts feature that I think a lot of us have played around with and then just recently we had Runway ml released their Sora competitor and now we see that sort of Scarlet Johansson type of voice that was demonstrated with the gp40 demonstration they do have a press release that they put out today in tandem with their announcement I'll just go through the press release a little bit the line that really stood out to me in this press release is Moshi can be installed locally and therefore run safely on an unconnected device just imagine what I just demonstrated to you if you would be able if you're able to have that built within your laptop and be able to have a conversation just like that where you could imagine maybe instead of a Siri pod or a Google home or what have you being able to interact with something like this that's load latency the last thing that I do want to mention with in here is that the code and wage of the model will soon be freely shared they're coming out of the gate and they're basically saying hey look we're just a handful of people look what we can do in just a number of months and they're going to be releasing this widely to the world I think mral did an incredible job they gave us a ton of really powerful models with permissive licenses like the mistal 7B model and all of a sudden you have a ton of people across the world being able to run these models locally on their devices and I think this could be a similar approach or playbook for Qi imagine if you could have something like this on your laptop and be able to interact in real time and have a conversation and be able to interact with your applications or what have you and build on top of it so a ton of really interesting stuff here but really un incredible release of what just eight people can do in a number of months I really commend the team over at Q Tai I hope they really have a ton of success from this demo I hope they become maybe like the mraw of these multi modal models potentially who knows but I just wanted to do a quick one point you to the app try it out let me know what your thoughts are in the comments below but that's it for this one if you found this video useful please like comment share and subscribe otherwise until the next one

Moshi: The Open-Source Answer to GPT-4's Voice Model? - Developers Digest