
Repo: https://github.com/OthersideAI/self-operating-computer#self-operating-computer-framework Self-Operating Computer Framework A framework to enable multimodal models to operate a computer. Using the same inputs and outputs of a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Key Features Compatibility: Designed for various multimodal models. Integration: Currently integrated with GPT-4v as the default model. Future Plans: Support for additional models.
--- type: transcript date: 2023-11-30 youtube_id: nQor7Weu4LQ --- # Transcript: Self-Operating Computer Framework in 4 Minutes: Control Your Computer With GPT-4-Vision in this video I'm going to be showing you self-operating computer framework which is a new open source project that allows you to enable multimodal models to operate your computer so if you have a particular action and you imagine you want to accomplish that so you want to Google for some information and you're using your mouse to open up Chrome and you're typing in all the commands and whatnot this aims to actually take out all those steps and just take in that prompt and it will go ahead and move your mouse around and input what it needs to input and it's a pretty interesting project so this is also a brand new project there are definitely going to be bugs but this is very much on The Cutting Edge of things that are possible with GPD 4's Vision API so I expect this project will get a lot better very quickly over time as there's more um contributions to to this so the first thing you're going to need to get set up with the project is just go over to the GitHub repository so you can find it in the descript description of the video here and what we're going to do is we're just going to go ahead and pull down the repo so whatever you do to pull down your repost I'm just going to copy the uh GI GitHub command here and paste it within the terminal so once that's all installed so I had a completely empty directory here and this folder I'm just going to CD into so I'm going to do everything from scratch so you can see all the different steps so once we've done that we're just going to run through a handful of the steps here so I'll just make this a little bigger so hopefully you can see it all right and first we're going to just set up a virtual python environment so most of these steps are pretty quick in uh the execution time uh to get it all set up it will have to install some dependencies so it will take just a second to do that but one thing to note with all of this is you will need Python 3 installed so if you're running into any errors when you go ahead and run this Python 3 command if your terminal is yelling at you a little bit just make sure that you do have uh Python 3 all installed on your machine so once all of those are installed the one thing that I will mention while this is all loading you will need an API key from open AI so if you don't have an API key and you haven't used the the their API yet you you'll likely be able to get some free credits where you can play around with this for a little bit uh I think like $5 or so worth of free credits um but I'd imagine most people watching this video probably have played around with their API so just grab an API key like you typically would and then I'm just going to expand this here so we can see what's going on and then we're just going to move that uh example. EnV to a EnV so then just like we would in other projects we're just going to put in our API key in ourv and just make sure to save that out so once you've done all of that you can go ahead and run the operate command so in here I'm just going to go ahead and I'm going to say look up the most recent Lang chain release on Google so you hear the screenshots that are being taken and those screenshots are being passed to the llm to be interpreted so you can see here once that response has come back it is actually taking actions on my computer so any Mouse movements or uh inputs from the keyboard none of those are me operating at this point I'm hands off hands in the air just talking into the microphone so as you see here it's going ahead it's typing that within Google it's uh taking another screenshot and it's very novel like it's just sort of hard to believe to see that all of this can do what it's doing now because it's very new there's definitely going to be some bugs and in toying around with this uh the hit rate is definitely not super high but some of those novel examples you can sort of see like we're on the ground floor of this type of thing and it's likely only going to get better from here so I just wanted to show you this project um toy toy with it see what you have Success With or things that you don't have success with please leave them in the comments for all of us watching and hopefully you enjoyed this video that's it for this one I'll just sort of leave this running in the background here but that's it for this one uh if you found this video useful please like comment share and subscribe and otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.