LLaVA: A Multimodal Model with Local Inference Capabilities - Developers Digest

LLaVA: A Multimodal Model with Local Inference Capabilities - Developers Digest

About this video

In my latest video, I delve into the fascinating world of LAVA, a groundbreaking multimodal model that's setting new standards in local inference capabilities. My exploration begins with a hands-on demonstration of how effortlessly LAVA can describe images and extract intricate details from them. I walk you through various examples, including vivid descriptions of photographs, such as those of Queen Elizabeth, and detailed analyses of webpage screenshots. One of the standout features of LAVA that I highlight is its ability to run locally, a significant leap forward that ensures versatility and accessibility. I demonstrate the model's superior performance across a range of tasks, showcasing its ability to outshine other models with its advanced capabilities. Moreover, I emphasize its compatibility with various devices, from the latest MacBook models to laptops that are four years old, illustrating its broad applicability and user-friendly nature. As the video progresses, I provide comprehensive guides on deploying the LAVA model on cloud hardware, utilizing it through Perplexity Labs, or setting it up on a local Ollama server. These tutorials are designed to cater to a wide audience, from developers and designers to hobbyists interested in cutting-edge AI technology. A crucial aspect of my presentation is the emphasis on the transformative potential of LAVA in the realm of application design and delivery. I explore how its local inference capabilities can revolutionize the way applications are developed, making them more dynamic and responsive. Lastly, I touch upon the integration of LAVA with Lang chain, unveiling a world of possibilities that this synergy opens up. This segment aims to inspire viewers to consider the broader implications of such advanced technologies in AI and application development. Throughout the video, my aim is to not only showcase the technical prowess of the LAVA model but also to inspire my audience to think about the future of technology and its application in our daily lives and professional projects. 00:00 Introduction to Lava 00:06 Demonstration of Lava Model 01:53 Running Lava Model Locally 02:24 Understanding Multimodal AI Systems 03:18 Future of Multimodal Models 04:57 Exploring Lava Model's Capabilities 06:47 Deploying Lava Model 07:47 Potential Use Cases of Lava Model 08:48 Integration of Lava with Other Tools 09:13 Conclusion and Final Thoughts

Transcript

--- type: transcript date: 2024-02-23 youtube_id: x9aI2N4Kyco --- # Transcript: LLaVA: A Multimodal Model with Local Inference Capabilities describe in bullet point form what this image is so what I'm demonstrating here is the new lava model now the thing that's pretty exciting with this lava model is you're able to run this all locally run through a few different examples just to show you how this model works one of the more common requests that you can send in with the model is something like describe this image for me that's probably a pretty safe bet for most of the queries that you're going to be passing in and in this case you see that I passed in that photo of Queen Elizabeth and you see that the model responded back with this image features a black and white illustration of Queen Elizabeth II The Queen Is depicted with her iconic Crown smiling slightly as she gazes to the side this is pretty remarkable that now we can run these multimodal model is straight on the hardware of our devices I'm running this on a newer MacBook but I did test this exact same model on a laptop that's about 4 years old and I did have pretty good results it did take a little bit longer to process but it was still able to run without an issue so the next picture I'm going to send in is one with text so I'll say describe this image in bullet point form and I'm going to go ahead and drag that over into my terminal and run this and this is an image of a blog post from open aai it's responding back with this image appears to be a screenshot of web page specifically showcasing the design for a new feature of an application or a service here are the details visible in the image the web page has a clean modern layout with a light color palette a prominent image depicting memory and representation on how users can interact with chat GPT highlighting features like memory control and more there's a list to the right hand side of the page under a heading new controls for chat GPT the item lists are not fully visible due to the image resolution and size so if I just pull up the image here that's the response back that I got from this image what's great with this model is you can use this to run inference locally for free everything that I'm using here is completely free the model is free amaama is free and and if you want to deploy this to a server you could also do this as well the thing with AMA which is really interesting is it's a couple former Docker employees and what they're really looking to do is to sort of dockerize these llms and make it really simple to spin these up locally on our machine or on cloud hardware and be able to easily interact with these models on a simple server just to type into the model a bit this is a multi system lava integrates Advanced natural language processing with computer Vision capabilities to understand and process both text and visual data seamlessly at the same time so if you're looking for more information there is this paper which I'll link in the description of the video along with all of the other websites that I'm going to be showing you now this is obviously Cutting Edge technology and it utilizes the latest advancements in deep learning such as Transformer models for language and convolutional neural network so the way that this works is both through the advancements that we've seen in deep learning such as the Transformer models that are used in chyt and it also uses convolutional neural networks for the vision Tas now what's great with this model is it's really designed to handle large scale application data and complex queries I sent in a number of different images and it was able to respond back with reasonable responses for all of them we're just sort of at the ground floor for these multimodal models now it's going to be really interesting to see the landscape of different models and architecture over the coming weeks and months having access to these multimod models are going to be a huge unlock for being able to build different applications and overall it's just a really neat feature right over the past year we've seen countless different examples of numerous chat Bots that companies have developed since chat GPT really made the splash onto the scene we started to see these impressive versions of these multimodal capabilities and it really wasn't that long ago where we just began to have access to these multimodal models you're able to access through the Chad gp4 Vision API if you want a paid solution be able to pass in an image and it do something similar for you now it goes without saying that the performance of this model isn't quite as high as something like chat gp4 from what I've read it does outperform it on certain tasks which will allow you to dive into if you'd like another great thing with this model is it does have an Apache 2 license so you are going to be able to use this broadly so if you're looking for more information on the model I'll point you to the GitHub there's also a ton of different information on hugging face where you can go and check out models on hugging face space Also another great place where you can go and check out some more information on this model just to show you some of the results of the model itself there's two different sizes of these lava models there's a 13B variant as well as a 34b variant if we look across a number of different metrics you see that there are some benchmarks where it does outperform both a Gemini Pro as well as gp4 Vision it's great that we now have these open source models that are creeping up to the capabilities of these Clos Source models there are some good examples on the results from the model here here's an example of passing in this picture of Mark Zuckerberg and the response back that we get from the model what I found interesting with this query is you're passing in a question where I need to pick up my wife I live in San Jose what time should I leave and you pass in just the image of the flight information and it says based on this information provided in the image the flight is scheduled to arrive at 11:51 a.m. at San Francisco International Airport if you live in San Jose you should consider the travel time between San Jose and San Francisco which is approximately 45 to 60 Minutes depending on traffic conditions and it continues on from there so really remarkable right the fact that you could just take a picture of something on screen and be able to send in a query quickly I think this type of thing is going to be more and more familiar we're starting to see these different devices pop up whether it's glasses or The Vision Pro or the rabbit R1 device where you're going to be able to take a picture of something and quickly get a response back it's going to be a really exciting time for developers to integrate these things into a variety of applications I know for myself if I can just pick up my phone and just ask a question of what it is and get a response back almost in real time that's going to be a thing that I think a lot of people are going to love the absolute easiest way to get started with this is on perplexity Labs you can head over to labs. perplexity doai go to their drop down in the bottom right hand corner there's a ton of different models that you can play around with so you can play around with the mistro models or the new JMA models from Google but if you want to play around with these two lava models you can go ahead select the model and then upload your photo here the thing with perplexity Labs is it does have really good inference say if you don't have a fast computer but you do want to have some good performance you can go ahead and just pass in your image the Queen Elizabeth photo here and we see that the tokens per second is over 200 now with that being said I think what most people are going to want to do with this model is try it out locally just to see how it performs on your Hardware AMA is a great option for running these models locally on your machine simply install it like you would a simple program and then you can check out the different models on the website we scroll down and select lava to install it is you'll have to ol Lama run lava if it's the first time it's going to go ahead and pull down that model it's going to be several gigabytes in size it's going to take a little bit of time depending on your internet connection just like you saw in the outside of the video you can pass in the text asking a question of what the image is and then you can just drag over the image you always typ in the path to the image but I found dragging the image from the finder to be an easier experience with interacting with this now the other thing with this is if you haven't used AMA before it sets up a local inference server so if you're developing a desktop application or if you want to have different tooling on your desktop or maybe it takes a screenshot and you want to ask a question and then you pass it in and you just simply ask something like I'm stuck here help me figure out what to do here or what is this you can imagine all of the different use cases where you can interact with this the other thing with AMA is you can deploy it to Cloud Hardware so if you want to play it on something like AWS or gcp or Azure or what have you and I was playing around with a AMA recently where I put an ed Grog proxy server in front of my olama server and I was able to access it from anywhere so I set up a simple nextjs application from my phone I was able to query the different models and get responses back from my MacBook even though I'm miles away on my cell phone you could access this anywhere in the world so you just put in that URL and then you're able to interact with it there's a ton of different ways on how you can interact with these models lava could be another extension of that application imagine you're out and about and you want to take a picture of something but you don't have access to something like chat gbt Plus or perplexity and you just want to pass that into your computer running the lava model locally you could do that by simply setting up a simple app to do so a ton of different use cases for stuff like this I just wanted to show you that there is an integration in Lang chain for a llama so if you've built in Lang chain before that's an option for you there's also an option within llama index to use and you're able to easily use these multimodal capabilities straight within Lang chain so you see the example here where it's just reading a file of something like hot dog then it's going to pass the image in as Bas 64 and from there you can pass in your query as well I just wanted to do a really quick demonstration of lava point you in the direction of it encourage you to check it out otherwise that's pretty much it for this video if you found this video useful please like comment share and subscribe and until the next one