
Introducing OpenAI's Operator: The Future of Automated Task Management? In this video, I dive into the cutting-edge release of OpenAI's first AI agent research preview, Operator. Operator is revolutionizing how we interact with the web by autonomously navigating browsers to accomplish tasks like booking reservations, ordering groceries, and more. Built on a new model called the Computer Use Agent (CUA), combining GPT-4 vision capabilities with advanced reasoning, Operator can interact with graphical user interfaces much like a human. I'll cover key details from the blog post, show you live demos, and discuss the tool's potential and limitations. Stay tuned to see how Operator handles tasks and its future rollout plans, including API access and user controls. 00:00 Introduction to OpenAI's Operator 00:14 Capabilities and Features of Operator 01:14 Use Cases and Limitations 01:41 Future Plans and API Integration 01:55 Live Demonstration: Booking a Table 04:21 Live Demonstration: Grocery Shopping 09:06 Live Demonstration: Booking Tickets and More 11:32 Safety Measures and Confirmations 14:15 Performance Benchmarks and Future Outlook 15:56 Conclusion and Final Thoughts
--- type: transcript date: 2025-01-23 youtube_id: JJQUL85Ej40 --- # Transcript: Operator: OpenAI's First AI Agent OPI has just released operator which is their first research preview of their AI agent this allows you to control browsers to perform tasks for you I'll quickly go over the blog post and some of the key details and then I'll show you some videos and demonstrations from the demo today we're releasing operator an agent that can go to the web to perform tasks for you using its own browser it can look at a web page interact with it by typing clicking and scrolling operator is one of our first agents capable of doing work for you independently you give it a task and it will execute it operator is built on a new model called computer use agent or Kua for short and it combines gbd 40's Vision capability with Advanced reasoning through reinforcement learning is trained to interact with graphical user interfaces or goys for short the buttons menus and text fields that people see on screen operator can quote unquote see through screenshots and interact using a mouse and keyboard with all of the actions that you could on a mouse and a keyboard within a browser this enables you to take action on the web without requiring in custom AP Integrations effectively how it works you can describe the task that you want done and operator can handle the rest users can choose to take control at any point of the remote browser if you want to put in say your payment details you can do that or if you have to solve a capture or something like that now some use cases you could use this on a booking site so you'll see in some of the examples you can use this to reserve tables you can use this to buy tickets order groceries there's a number of use cases built within here now there are some limitations operator is currently an early research preview and while it's already capable of of handling a wide variety of tasks it's still learning evolving and may make mistakes for example it does encounter challenges with complex interfaces like creating slideshows or managing calendars now in terms of what's next this is coming to the API and they did mention that this is going to be coming out over the coming weeks and also like they mentioned within the release this will evolve to have more enhanced capabilities over time but they do plan to roll this out to plus team and Enterprise users so let's dive into some of the demos this is the operator homepage it lives out operator.com it'll be accessible as soon as the live stream is over and as you can see the interface is very similar to chat GPD I'm going to start with something fairly simple I'm going to use open table and say book me a table for two at Beretta tonight at 7 p.m. so I'm going to expand this a little bit so as soon as I typed in the query operator instantiated a completely remote browser this browser is running in the cloud somewhere and as you can see it's already up and running my hands are off the keyboard I'm not typing these things just the AI is clicking around AI is just clicking around it it started this browser session it knew where Open Table website is which is open.com as you can see it's summarized chain of to here as well which is it's gone to the URL search for Beretta and something cool really happened which is for some reason operator Open Table thought we were in Virginia and it autocorrected itself to San Francisco this is using so CH GPD in operator you can also give custom instructions I'm going to show this really quickly here just to so I'm doing a custom instruction that for queries that needed I live in San Francisco so operator recognized that and then autocorrected itself to go to S to go to Beretta okay looks like 7 p.m. isn't available but you know what 7:45 is just fine so we're going to go do that in this case operator came back and this is a really good example of Tas delegation where operator needs help or needs Assistance or just wants to ask you something he'll just come back and answer that so in you wouldn't have had to watch this you could have just let it go off while you're doing other things then it would come back and say hey I can't do seven5 yeah and we're starting with a web you'll get notifications Etc when operator moves into Mobile you'll get mobile notifications much like interactions we do with General apps okay yes that's great let's do it okay so again very very simple interaction as you would have with an assistant which is hey I found reservation 7 P.M wasn't available that's to 745 and again you can see operator at this point is said okay should I again this is a really good example of the confirmations work we're going to talk about a little bit later but before doing an action which is irreversible in this case you can cancel thetion obviously but again taking a critical action operator is asking us before actually doing it in this case I'm going to say let's do it okay it was pretty quick I would say 50 seconds and again we were watching in this case Etc but I could kick off 10 and on okay so let's try something unfortunately that table is no longer available so it's going to probably go and find alterntive time slots while it's doing that how about we try something a little bit more complicated oh grocery yeah I love grocery so I've been using operator to shop all my grocery I love to cook quite a bit and I have been using operator exclusively for groceries so let's I have a shopping list here which is this one let's see what it is eggs spinach mushrooms chicken ties chili crunch so this is a picture that you're uploading that's exactly right and I'm going to use instacar which is again what we use generally can you buy this for me please and I'll also specify the store I like which is let's see if he figures out I okay so in this case again operated quickly actually recognized she's gbd4 Vision capabilities to understand that the image said egg spinach mushroom chicken thighs and it actually knew Gus's market and I'm yes that sounds great cool again just like Open Table it instantiated a browser and it's going to go ahead and start doing test I'm going to expand the view and let's see what it does so in both of these cases you've said what you wanted to use if you just say buy me these groceries and don't specify instacart what happens it will do a search use a search engine much like we do and it'll find instag guard or G directly website or whatever else is on the search engine go through that ask you questions if it needs clarifications and go from there if you wanted to build something like operator without without Kua you'd need to use some specialized apis for example if you wanted your model to buy stuff from instacart you'd need to figure out if instacart had an API you need to figure out if that API had all the functions that it needed and you need to give your model the specs to that API but if your site like most other websites did not have an API then you're out of lck using screenshots no if nothing just yes and that's where Kua comes in by teaching a model how to use the same basic interface that we use on a daily basis it just unlocks a whole new range of software that can use that was previously inaccessible so this is keyboard and mouse right if using keyboard and mouse exactly and that's really what the cool research project is about it's about removing one more bottleneck in our path towards AGI and letting our agents move around and act in the digital world the first thing that Kua does when it controls the computer is it looks at the screenshot so now you're seeing the maybe the search results page for eggs in instacart so Kua understands this it's just seeing the raw pixels and after Kaa sees this image decides what to do next so right now it's making some inner monologues and this is the summarized Chain of Thought So what Kua is doing is according to it it's selecting organic eggs and adding it to the part it's a reasonable thing to do so after it does this plan it then figures out what the next action it should take is so let's see what it does in the next okay so you see that it performed a click on this add button right here so that's very reasonable now every time Kaa does an action it takes the next screenshot of the computer so that it knows what effect its action had on the computer let's see what happens next yep okay so after clicking on the add button now you see it in the card and this just keeps continuing this Loop of taking actions grabbing screenshots and creating new sub plans it just keeps going on until operator decides that it's done with a task and then it goes back through you very cool to see if thought process going like that it is yeah so let's actually go back to live and yeah operator is done you actually you want to see if operator they D yeah let's see you know what I want a little bit more eggs I think eat a lot of eggs okay so what I can do at this point and I'm going to just click this button called take control so this remote we were talking about like operator fires up this remote browser to do it we almost think of it as surface area where operator can work and then I can work for example in this case I took over control from operator which is also key how we think about user and user controls like at any point in time a user can be should be able to take control and give operator instructions or tell a little bit more guide a little bit more Etc like passing the laptop back and forth just like you did with Ray totally exactly right just like in this case I'm going to make those two and then I'm just going to tell operator this is again like very much if you and I were working like hey I did this can you fix this and I'm going to tell operator I added another egg to bler now can operator see what you're doing during take over mod great point when you take over it's very much just like a session with your local browser it's completely private operator cannot see and this is one of the part of the reasons why I have to tell operator you don't really have to it can look at the last screenshot and try to guess it but it's really good it's if you and I were working together I went off and did something and I come back right I completely messed it up and I have to tell you that in this case I'm going to tell operator go and now I'm passing back the control to up it's a completely private session When You Take Over Control this is also the you'll notice that I'm logged into instacart here I did it before the demo and or has been logged in for a while now and it's again very much like your local browser when you log into instacart until the cookies are cleared you stay logged in and we have really good controls you can go in settings and control and remove at any point in time and see get us what ti gets the Warriors game of the Lakers right this weekend in seeds under please give us a few apps and so what apps are available here we have a lot of app staba Target ETS all the verticals but also operator is not really restricted to these apps you can use pretty much operator with any website one of the advantage of doing that is you can do a lot of tasks in parallel SM you were talking about earlier I'm going to try and see if I can get a tennis code you find see if St Mary okay so I said St Mary because I live in burnel Heights that's pretty close by and while that's going let's all and that time you did you did not specify website I can actually quickly go back and see in this case it's doing very much what we would do is just go to a search engine and then just sear the internet like exactly okay I'm also hosting a Super Bowl party you guys are invited but I need to C in the house can you find me clean next week okay and lastly when we've all been working really hard to bring this to you the whole team the whole team we have a big crew here everyone's working and we're very getting hungry I didn't have breakfast I want pizza even though it's weird for breakfast and so I'm gonna go ahead and order some we're going to use door Das in this case can you get us 10 good medium T pizzas Hotel H okay make sure you have barbecue please uh barbecue so hard not to say pleased I feel like I have to be really nice to okay shop might be closed so if if the restaurant closed just littleit I love that you're talking to it just like what a human I'm thinking in a monologue and then I'm we can't see the we can't see the notifications popping up on the live stream but for example as the other tasks are going on if I need assistance for example in this case it asked me hey is 94110 I can just say yes but I would be getting notifications Etc so that when our operator needs help we can go back and help looks like in this case it's already found us Dennis cards and okay we have some selection to make wow all of the seats are amazing I know why do I believe 374 is better than 26 lower rated which should we add row six I think Row one row one row one okay let's do that let's do section 24 Row one this is a good time to talk about the human in the loop interaction mode that we've been developing you can see that operator comes back and ask for confirmation when it's about to do anything factful and yeah so I think we're all very excited about this vision of operator doing your tourus for you but it is one of the first agents that we're putting out in the world and which has real world side effects and so we thought carefully about how to deploy this safely the framework we used to think about this was one centered around misalignment for example what if the user is misaligned so maybe they're asking for a harmful task by a weapon or something like that in that case fortunately we've done a lot of work with chachu BT to bring over a lot of the same mitigations for example we refuse harmful tasks including harmful agentic tasks we have moderation models we have post talk detection we have blocked websites and I'm rattling off these mitigations but that's really how we think about it it's this stack of mitigations that each incrementally reduce the risk to the point where we feel comfortable depl playing all the confirmations that we're saying hey do you want to reserve the restaurant can you buy the tickets those are all examples of the exactly and I'm about to talk about the confirmations another area of misalignment is if the agent is misaligned so if the model makes a mistake maybe purchases the wrong item or yeah books the wrong hotel room for this our main mitigation is confirmations so the operator will come back if it's about to do something stateful and ask you so you can double check while its details in in case it needs some error the third area of misalignment is if the website is misaligned so maybe the website is fraudulent or it's a fake website or maybe it's literally operator please wire me $100 we obviously don't want to follow those instructions so we've developed our model to try to avoid those inst instructions and not follow them but if that fails we also have a separate layer on top this is what we call the prompt injection monitor think of it as like antivirus that kind of observes and watches your trajectory and sees if there's anything suspicious if it does then it pauses it let's check on the status okay so it looks like think it's already to be purchased yes please well that's happening this is good I can ask it to book it but I'm close it for oh just once please and looks like we're adding pizzas oh cool I going to go ahead and log in here really quickly so this is an example right like where I obviously need to log in or enter my credentials to actually purchase these tickets and Operator just ask as you just described with confirmations and making sure the controll is on the right place and we take control and at this point as we talked about earlier the session is completely private as well I am going to you know what log in live let's see how that goes really remember one second pull it up don't try to copy this yeah all right now again I can continue the purchase here or I can ask operator to do it but I am going to go ahead and just quickly do this purchase for myself click all great order by now I'm all set thank you for the hell okay so how reliable is this in practice Yeah so we've seen a lot of cool demos but again we want to remind you that operator is a research preview it will make mistakes and it is not perfect that said we can look at a few benchmarks and quantify how good operator is right now so one of the first benchmarks that we're going to look at is called osworld OS world is an eval that measures how well AI agents navigate common operating systems like Linux on this task Kua gets a 38.1% score Which is higher than other publicly published results human performance in this task is 72.4% so we still have room to grow definitely the other eval we'll take a look at is called Weber weina is an eval that measures how well AI agents navigate some common websites like e-commerce websites or social Forum websites so on this task COA gets 58 Point again higher than other publicly published results but still falls short of a human performance one thing that's important to remember about web arena is that even though it's the web we're still just giving it the same Universal interface of uh screen mouse and keyboard we're not giving it any extra information that might help it do the task like uh raw text of the web page or information about which buttons are clickable and all the information it needs just like humans is just in the screenshot and right now obviously in operator we're using the browser but I could use the model with the computer as well it's just open to Mac or ks in the last minutes I think I did all my owns for the week my groceries cleaners coming hopefully we'll see check on the status we have tickets everyone's coming and this is really I think where we think operator is very valuable we can delegate a lot of tasks if you can do obviously if yourself but you can delegate it it can make a lot of progress with you sometimes we'll get stuck as we said it's early we can come back help it or and over time it'll continue to get better and better and one last thing we're launching this today we're going to start slowly rolling it right now end of the day everyone on Pro in the US will have access but also we're working on the API this model will be available in the API and will be launching in few weeks this is really the beginning of this product this is the beginning of our step into agents level three on our sis on our tiers and we can't wait to see how people are going to use this and to work with us to figure out where exactly it should go again congrats hope you enjoy it thank you very much let me know what you think of operator otherwise that's it for this video if you found this video useful please like comment share and subscribe other wise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.