Links: https://huggingface.co/spaces/HuggingFaceM4/idefics_playground https://huggingface.co/models?other=multimodal https://huggingface.co/blog/idefics https://huggingface.co/papers/2204.14198 https://huggingface.co/huggyllama/llama-65b https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K Multimodal Embeddings with Langchain in Node.js with Vertex.AI: https://www.youtube.com/watch?v=cxxEsCYt-C0 Patreon: https://patreon.com/developersdigest https://www.developersdigest.tech/
--- type: transcript date: 2023-08-24 youtube_id: ob6uZhN63-Y --- # Transcript: IDEFICS: A New Frontier in Multimodal Language Models in this video I wanted to quickly showcase edifix which is a new Open Access a larger visual language model that allows you to combine images and text and get a text response back so if you head over to hugging phase and their spaces there is a playground where you can play around with this for free now there's a handful of images and text combinations that you can just go ahead and see a sample results from but it is pretty interesting if you actually go ahead and test this with your own images so if I just upload that image of a dog like you saw here and I say this is Archie tell me a short story about him so the one thing that is nice with this model is it is very fast considering that it has both images and text combined and the one thing that I found interesting is if you do go to actually try and duplicate this model it does show you the suggested Hardware to deploy it and the cost of it is relatively cheap considering so I thought that this four cents an hour combined cost is actually pretty cheap for what this is able to produce for you so if I just go back to the model here you see okay I didn't provide any information that this was a dog or anything and you see once upon a time there was a happy energetic dog named Archie et cetera et cetera now the thing with this model is it does have the ability to chain the responses and the text or image inputs that you put in it so if I just go ahead and upload another image and let's upload one of the Empire State Building and I say I'll say Archie he wants to go for a walk here tell me about his experience so you see Archie's walk in the city was exciting an exciting Adventure he saw tall buildings busy streets and lots of people he sniffed around etc etc so I've played with these two examples a couple of times and sometimes in the examples it will actually say the correct breed of dog and it will also specify that this is the Empire State Building and have the context of it being New York City and whatnot so it is pretty sophisticated in the images that you submit to it now the other thing to note with this is obviously I'm using an iconic building and I'm using a rather generic photo of a dog now if you have more of a complex scene where there's lots of things going on it'd be interesting to see the responses when it is a more involved photo with a lot of things going on but for these types of examples it seems to work pretty well now a couple things to note with this type of multi-modal model is it is something that is within gpt4 but it just isn't publicly released yet now I think a lot of us it's easy to forget that this was something that was demoed when they initially released gpt4 and I think once this is released to the public that these types of models will become a lot more prevalent than they are so this was just an example from openai of them demonstrating that capability and you might remember if you watch the demo they drew a simple picture of a simple website and they asked for the model to generate the HTML CSS and JavaScript to make that website and it was able to do that so there's going to be a handful of use cases for this type of combination of having both images and text and eventually video and audio all combined within a multi-modal model and it's just sort of interesting to start to explore what the capab abilities are and what you can do with these things and potential use cases so I thought this one was nice because you can go ahead and access it right away you can make your own private version of it if you like and you can sort of get going now in terms of the details of it now if you go on to hugging face and you search multimodal there's only a handful of models here and this one shot up right to the top so there's two variants there's an ADB variant and then there's a 9B variant that you can use and if you're interested in learning more about the model there is a great blog post on it which I encourage you to read it's also based on Flamingo which is interesting in its own right and then it is the child of these two models so Lama 65b and this clip vit model for the image portion now the last thing I just wanted to mention was I think it might be interesting when the Vision Pro comes out to see how these types of models are incorporated into a product like this so I think getting familiar with getting some ideas on how you could potentially incorporate leveraging a multi-modal model into some workflows or some ideas that you might have could be a pretty interesting considering the timing because once this product comes out obviously it's going to be a relatively niche market at first but if you just think about this product out in the wild so it has cameras that can see your environment and if you can tie in those cameras of what you see into a model and it have the con text and no say you're looking around a room and it knows you know there's a computer in front of you or there's a bottle of water in front of you or what have you it starts to become interesting in terms of the different use cases so if you're interested in exploring multi-modals further I have another video where I explored multimodal embeddings with Google's vertex AI which I'll link in the description of the video if you're interested but otherwise that's pretty much it for this video I just wanted to introduce edifix to you and if you found this video useful please like comment share and subscribe and otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.