
Anthropic's Latest Breakthrough: Automating Computer Operations with Claude 3.5 Links: https://docs.anthropic.com/en/docs/build-with-claude/computer-use In this video, we explore Anthropic's groundbreaking new capability for automating computer operations through their API. This feature enables running local scripts in TypeScript or Python to control your computer, taking screenshots, moving the mouse, and interacting with applications just like a human user. We delve into the experimental 'computer use' capability, comparing it to other open-source projects and highlighting its potential for innovative desktop applications. Additionally, we discuss the upgraded Claude 3.5 Sonnet and new Claude 3.5 Haiku models, emphasizing their enhanced performance on coding benchmarks. This release opens up exciting new use cases and ways to think about interacting with computers and the internet. 00:00 Introduction to Anthropic's New Automation Tool 00:10 Capabilities and Features of the New API 00:42 Innovative Integration and Potential Use Cases 01:49 Detailed Breakdown of the Blog Announcement 02:39 Performance Improvements and Model Comparisons 03:43 Practical Applications and Future Prospects 05:32 Setting Up and Using the New Tool 06:37 Conclusion and Call to Action
--- type: transcript date: 2024-10-22 youtube_id: eB7E-mtpC18 --- # Transcript: Anthropic Claude Can Now Control Your Computer anthropic just put out computer use for automating operations what CLA allows you to do now is effectively control your computer there is a new capability via their API you can run a script locally whether you're using typescript Python and from there what it will do is it will start to take screenshots of what you have on the screen and then it will have the ability to move the mouse to the position on the screen have the ability to click and interact as if it was using your mouse and keyboard and what's really interesting with this is there're the first Frontier Lab company to really offer something this out there and this Innovative in terms of this type of offering there are a number of Open Source projects that effectively try to do something similar but what's interesting with this is now they're integrating this directly into the anthropic API we can start to think about how to use computers a little bit differently because right now the internet is really vast there's not like a uniform API that we can just interact with and set up all these function calling capabilities within our llm applications it's a very broad set of how people use the internet how websites are set up and having something general that's able to both use local ABS as well as things like our web browser can potentially unlock a ton of new use cases now I think the thing that's interesting with this is we really have to start to frame and think about how to potentially use this what are the use cases for this with a lot of these llm applications we've seen a ton of different web app imp mations of them but this stands to usher in an era of building out desktop applications now because if we're able to have an application and it can control our computer like this and be relatively General it goes without saying that this can unlock a ton of different use cases that we might not even have an idea for quite yet let's just go over their announcement and blog post that they had today introducing computer use a new clae 3.5 Sonet and Claude 3.5 Hau within the blog announcement they mentioned where anouncing an upgraded claw 3.5 Sonet as well as a new model claw 3.5 Hau no mention of claw 3.5 Opus like a lot of people were potentially expecting the computer use capability is still experimental it is still in beta and they're releasing an early prototype to get feedback from developers and to expect that the capabilities will improve over time this is available today and it's available across the anthropic API Amazon Bedrock as well as Google Cloud claw 3.5 high is going to be released later this month so within a week or so now in terms of the capability a lot of people are really going to probably be focused on the cloud 3.5 Sonet model this is arguably the best model for a ton of different tasks in including coding so today on the human of valve Benchmark just to give you an idea Claud 3.5 Sonet was previously at 92% and today it's at 93.7% and and you can see basically across the board that we have improved performance including agentic coating so this is up to 49% % whereas previously it was at 33.4% now one notable thing that they mentioned within the blog post that on the agentic coding evaluation Benchmark that this score is higher than all publicly available models including open ai's 01 preview model when terms of clae 3.5 hiou this is an excellent model if you haven't used it before another piece with this release is even the clae 3.5 hiou model has better performance on that agentic coding task Benchmark than even the previous version of CLA 3.5 Sonet it's both a considerably cheaper model as well as a considerably faster model that is now going to be outperforming what we just had yesterday effectively for cloud 3.5 Sonet on a gentic coding task while there were updates with the models the big use case is going to be that computer use case that I had mentioned and demonstrated earlier in the video this is going to make a ton of different applications available to us it's very similar to something like open interpreter which is effectively a framework that allows you to control control your computer by leveraging essentially whatever model you like and it's also similar to something like mulon which is a company and an offering that's really focused on navigating the web and being able to control and interact the web similar to what they put out today one of the key pieces here is they mentioned that we're teaching it General computer skills allowing a wide range of standard tools and software programs designed for people and and I think this is a really important point to just think about for a moment because previous to something like this a lot of the agentic tools that we'd use is we'd have to use something like function calling or some sort of framework that would be able to perform API calls and interactions and it would often times become relatively cumbersome whereas if you're able to send off an agent and it has General capabilities of being able to really navigate the web research the web use your computer and the programs that you have it opens up a new way to potentially think about this because it seems like from this release that anthropics opinion might be that agents are going to be more akin to use the internet similar to how humans would rather than wiring up these proprietary models that are going to be calling apis and having to have applications that have that agentic workflow in terms of some of the actions that it can perform it can perform actions essentially akin to how you would use a computer so scrolling dragging zooming you can type you can click basically all of those different interactions that you're used to and the way that it will work is it's going to take a screenshot it's going to send a response back essentially being able to control your computer as if it was a human now in terms of actually setting this up it looks very similar to what you would already use within the anthropic API but the notable difference is the tools that you're going to be passing in are going to be what is used to control your computer do you see these different objects computer text editor as well as bash here and here is the instruction save a picture of a cat to my desktop and then we also have this beta feature to use computer use 2024 1022 take a look at python or typ script or the shell script you should be able to just run this and paste in your API key to try this out now in terms of setting up a turn-based application you will definitely have to set up some form of loop to effectively be able to use this in terms of getting started with computer use there's good documentation here provide cloud with computer use tools as well as a user prompt then Cloud will decide to use a tool and then for three and4 like they mentioned here this is going to be what they refer to as the agentic loop so we're going to extract that tool input we're going to evaluate the tool on the computer and then we're going to return those results and and it's going to go through that Loop until the task is complete like I mentioned I'm going to put a link within the description of the video for this if you're interested let me know in the comments below if you want a video to see how to set this up from scratch and I'll go through these steps one by one and make an example as a resource that you can use to potentially build out an application with this if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.