
In this video, I take a look at SWE One, a new series of models introduced by the Windsurf team. Windsurf has rapidly progressed, recently releasing their AGI IDE and gaining interest from OpenAI for a potential $3 billion acquisition. The video covers the three different SWE models: SWE One, SWE One Light, and SWE One Mini. These models are not solely focused on coding but support the entire software engineering process, including directory traversal, terminal commands, and more. Benchmarks show SWE models performing close to or better than models like Claude 3.5 and Claude 3.7. The presenter demonstrates building a full-stack application using these models, highlighting both their capabilities and current limitations, like occasional errors and slower processing times. Despite some flaws, the models show promise, leading to a discussion on their potential and the future direction of competitors like Cursor. Firecrawl: https://www.firecrawl.dev/ OpenAI I 00:00 Introduction to SWE One Models 00:46 Overview of SWE One Models 01:40 Benchmark Performance 03:00 Demonstration: Building a Simple Application 05:16 Creating a Website to Image Generator 08:18 Testing and Feedback 16:49 Final Thoughts and Conclusion
--- type: transcript date: 2025-05-20 youtube_id: DQ-MVnJ6x64 --- # Transcript: Exploring SWE-1: Windsurf's New Models for Software Engineering In this video, I'm going to be taking a look at SU1, which are models that were just announced from the team over at Windsurf. Windsurf is moving very fast. They recently released their Gentic IDE just in the fall. And then just a number of months after they released the editor, they have interest from OpenAI to buy Windsor for $3 billion. And I think these models are a really great example of the value that's within Windsurf. Windsurf as well as companies like Cursor have a tremendous amount of data. Being able to see all of the different pieces of code that were generated, which ones are actually accepted or merged, whatever metrics they decide to track, they have all of that type of insight. I'll briefly go over the blog post, some of the benchmarks, and then I'll dive into an example, and we'll try and build out a little application, and we'll see how it does. Today, we're launching our first family of models dubbed SUI 1, optimized for the entire software engineering process, not just coding. They have three different models. SW 1 is approximately clawed 3.5 sonnet level of tool call reasoning while being cheaper to serve. It will also be available to paid users for a promotional period. Then they also have SU1 light replaces cascade base at better qualities. And it's also available for unlimited use to all users free or paid. And then they also have 1 mini which is going to be their extremely fast model that powers windsurf tab. So for all those inline suggestions as well as the ghost text. One thing to know with the model is these models are not just good at writing code. A large part of what they do isn't actually just coding. It involves things like traversing different directories, knowing when to run certain terminal commands, knowing when to install things, when to build things, test things. In other words, they basically lay out that these models are not just for writing code, but actually that whole agentic process as well. Now, in terms of some benchmarks, they have a handful within the blog post. They have conversational sweet tasks. On that benchmark, we can see that this ranks just shy of claude 3.5 and we can see it above other openweight models like Deepseek V3. For the end toend task benchmark, we see that this plots just above cloud 3.5 and just shy of claw 3.7. One interesting metric that they have here and this goes with the insights that they have by having this editor is being able to track these types of things. The average number of lines written by Cascade and actively accepted and retained by the user over a fixed period. In other words, this is a benchmark that quote unquote is a piece of feedback to determine whether a user becomes a repeat customer. If you've used different LLMs, you would know that some LLMs definitely give much better suggestions than others, and those would obviously be the ones that you would gravitate towards. Now, on this metric, we have both 1 as well as 1 light above claw 3.5 with it still being a reasonable distance from claw 3.7. And then finally, they have Cascade contribution rates for files edited at least once by Cascade. This is the percentage of changes made to those files that come from Cascade. This is a measure of helpfulness normalized for how frequently a user wants to use the model and how willing the model is to contribute code. There's quite a bit within this blog post. I'll link it within the description of the video if you're interested in checking this out. The main thing that I wanted to do within this video is actually demonstrate an example. First, what I'm going to do here is I'm just going to start with a simple Nex.js JS template and we'll go through some prompts with SU one and we can see how it will perform. First what I'm going to say is I want to create a beautiful header as well as a footer. Let's add in some Lauram Ipsum text. I want to create a beautiful SAS landing page on the homepage of the project. What I plan on doing is I want to go through a number of different prompts. I don't want to do anything too crazy, but I basically want to try and build out a simple full stack application leveraging a couple services and see if I run into any issues. What we see here, I can see it analyzed the project directory. We can see the structure of the project. Agent noticed it was a next.js project. It went in and it found the page for where I want the edits. It also says it's going to check the global.css and understand the current styling. The first thing that it decided to do was make some edits to the global.css. It looks like it added some colors for us. And now we can see it's making an edit to the page.tsx with a modern SAS landing page. And just a funny aside, I do notice that the dictation software that I'm using does seem to have a little bit of a typo here. One thing right off the bat with the model is it definitely isn't as fast as Gemini 2.5 Pro. What I have noticed, and I don't know if this is potentially just related to the amount of demand for the model, given that they're free and that these just came out, it does seem like some requests do take quite a long time. Now, mind you, I did ask for a modern SAS landing page. This could be very detailed, potentially several hundred lines of code, and maybe it took about a minute or so to generate. Now, if I just scroll through here, I see about 300 lines of code, and I'll accept that. Here, we can see it went and installed the icons for us, and then it also created this grid. SVG, and it gave us a summary of all of the different pieces that it did. Now, if I go within here, here's what it spun up for us for a first oneshot prompt. Obviously, there are some little things that I can nitpick like the padding around certain things. Does look like there should be some more spacing around some certain areas here, especially with these tiles here and whatnot. But overall, as a very barebones starting point, this is definitely better than a lot of other models. I wouldn't say this compares to something like Claude 3.5 or Claude 3.7 in terms of UI design, but it definitely is better than a lot of other models. Now, what I want to try and do is build out a full stack application. I'm going to feed it the context of two different services. I'm going to be leveraging firecrawl in combination with GPT image 1 which just came out from OpenAI. If you saw all those Giblly images on X or wherever else those had happened to pop up, this is that model that generates all of those types of images. The idea that I have is I want to be able to convert web pages into these images or infographics and have more or less a simple Nex.js boiler plate that we can go and work with. The first thing that I'm going to do is I'm going to grab the context from the firecrawl documentation here. And I'm going to put that within the instructions for SU1. And then from here, I'm just going to copy this block of code to get all of the syntax within the context for the agent. Now, what I want to do is I want to replace the entire homepage. And what I want to have instead is I want a heading that reads website to image generator and I want to have just a URL input. Now, what I want to do is when a user sends in a URL, make sure that the URL is valid, let's do some front-end validation just so it doesn't get sent to the server without being validated. And then once that URL is sent to the server, I want to have an app router API endpoint. And within that endpoint, I want to take that URL and use the fire crawl scrape method and return the markdown of the page. Once I have the markdown, I want to send that entire string into my request to OpenAI's GPT image 1 generator and have it generate a prompt. Now, I'm going to go ahead and send in this prompt. As you can see here, it's quite a bit of context. I pasted the entire page from the Firecrawl documentation as well as that snippet from the OpenAI documentation. And now, if I just go through here, I can see the agent is working through the task. I'll help you create a website to image generator that takes in a URL, scrapes it content using fire crawl, generates an image using OpenAI's dolly model based on the scraped content. One thing that is interesting is even though I specifically asked it to use OpenAI's GPT image one model, I also specified it right here. It did say that we're going to use Dolly's model. From there, it went and installed the packages. It did correctly reference the SDKs as they should look. It also is going to include ZOD. Within here, it went to create the route. Now, since there wasn't a directory, it went and made a directory. Within here, it set up the route, we also see it's going to have our URL validation, and we're going to set up a utils folder for that. We have that there. Now, one of the main tasks is instead of the homepage being a SAS page. We're going to be leveraging this example. Within here, we have the main content, we have all of the JSX for the page. And now, within here, we see it's even resolving some errors. Let's fix the fire crawl API usage in the route handler. And it's going through, it's analyzing everything. It ultimately goes through some iterations, update some CSS, and then finally, it gives us some final steps here. Since it couldn't create that env.local, what I'll do is I'll just plug in my Firecrawl API key, and I'll grab my OpenAI API key. I'll also put the links to both Firecrawl as well as AI within the description of the video. All right, so here is our application. It definitely is not great in terms of front-end design. This is just based on a couple prompts here. I can see this button is even hidden here. For front-end design, I definitely have to say Claude 3.5, Claude 3.7 definitely seem to be the model that you would want to probably still lean towards. Now, just to test this out, I'll try it out with the Firecrawl website here. And what I'll do is I'll go and I'll click this hidden button that I know is there even though it's not showing. One thing that I don't see is a loading state. But I have tried similar types of applications with something like cloud 3.5. Most of the times it actually does give a loading state without asking but in this case it actually didn't. So within here we have the instructions that the LLM generated for us and then we also have the contents of the page. Now I want to try and prompt and see if it can improve the design a little bit. I'm going to say I want to make the website a lot more fun. Let's give it a neo brutalist theme. Let's update the navigation. Let's add in a footer. And also for all of the different internal pieces on the main page, make sure that there is an appropriate contrast ratio for the text as well as anything that's overlaid on a button or navigation. I'll send in that design specific prompt and hopefully it can improve the overall look and feel of our application, but it will be an interesting test to see some of those design sensibilities. That's one thing personally that I haven't really seen an LLM too great at is really writing out a lot of CSS, per se. A lot of LLMs are great at writing things like Shad CNUI or Tailwind, but when it comes to the nitty-gritty of actually writing out these styles, I've never really seen an LLM that has really accelerated in CSS to the level of a human. But for the things like the structural elements, setting up endpoints, all that sort of middle of the road type of task, you can see the LLM basically does this perfectly. We can see that this is generating with Dolly 3. That was one thing that I didn't specifically want, but I can just easily go in and update something like that. I always encourage you to read through everything. For a low stakes application like this, it doesn't matter too much. But if it was a higher stakes application, really, you have to read every single line of the AI that was generated. You can only trust so much of what it actually generates for you. Now, one thing that's interesting with the model is what it will do is it will do a first pass and an iteration on something like global CSS. And within here we can see that it did add some code within an area where it shouldn't be. And it was a similar case for one of the earlier prompts as well where what it did is it did approximately the correct thing the first time around with some syntax errors. Then as soon as the llinter some errors since this has the context of those things it went in and resolved that. Don't get me wrong, that is a really great feature being able to resolve that. But the fact that it can take a couple iterations, that is something to be mindful of as well because what we see here is we see 118 insertions the first go around and then by the time we got to the bottom to fix that error, we also have a number of deletions and updates as well. Now we see the neo brutalist theme and this is not something that I would compare at all to something like sonnet 3.7 or sonnet 3.5. Now I want to try something really simple. I'm going to say I want to create an index html a javascript file as well as a css file. And within each of those I want to create a neo brutalist themed SAS landing page. For this, I want to test out one light to see how well it performs with front-end design and those types of tasks. Within here, I can see it generated some really nice clean HTML for us. It still has that SAS issue from the dictation software that I have. I'll just go and I'll update that. Now, what's interesting with this is even the smaller model without the potential complexity of the Nex.js JS project and the Tailwind setup and having to navigate all of the different subtle intricacies of that with just a simple HTML CSS as well as JavaScript file. We see that we have a pretty reasonable website here. Now, let's give 1 opportunity to redeem itself for some front-end coding. We've set it up with a good overall structure. Now, I'm going to say now let's make the website 10 times more beautiful. I want it to be neo brutalist. Let's have fun colors. I also want some animations as I scroll down the page. Let's add in a beautiful pricing section. Let's add in testimonials. Let's have sliders and interactive elements and all of the engaging things that you might typically find on a SAS landing page. A little bit more of an involved prompt. We're asking it for some ambiguous things, make it 10 times more beautiful, but we're also asking it some poignant things like being able to have a pricing section, testimonials, as well as some sliders and interactive elements. It will be interesting to see given with a simpler structure how well it will be able to perform that because even though this website is relatively simple what this is telling me is that the model does potentially have a pretty good understanding of the coding portion but where it maybe fell apart was just not understanding the nuances of what to update and where within a nextjs project and the subtleties of tailwind. Now that could be a whole other problem in of itself where because these different frameworks and libraries are changing often times it will try and update for one version of a library or one version of a framework and it will start to break things and resolve things. That piece is something that I've seen in all of the agentic I no one in my opinion has figured out that portion where it really has a good understanding on where it plays nice with understanding all of the different libraries. There definitely are some ways to mitigate that by passing in context of different libraries. Even using something like context 7 is potentially a good option for that which I'll link within the description of the video if you're interested. Now, one quick aside. What I have noticed is when it does seem to try and make a larger edit to a file, this 73line file, but when it's going and making edits to actually make this file larger, I get this deadline exceeded. Encountered retryable error from the model provider. Context deadline exceeded. And it tried again. Same message. Now it's trying that exact same thing again. And that is something that does give a little bit more friction. I'm not quite sure that could potentially be related because the model is free and there is a fair bit of interest on this right now. Maybe there is a little bit of an issue with just the capacity of being able to handle the number of requests. When it looks like the third times a charm. Now we have this much longer HTML file. We see we have now 268 lines of code. I'll accept that it's still writing out the CSS for us. When again, I ran into the same errors when it's trying to edit the style.css. It failed the first time. It's taking quite a long time to edit the style.css the second time as well. This is a file with about 157 lines of code. I'm not sure exactly why this keeps failing. I don't really know. I just assume it's probably related to some infrastructure issues. So I do see here we do see that this cascade error did fail once again. Now if I just refresh the page and I see what at least the HTML added for us. I do see we have this animation of the text. We have some nice icons. We have some hover effects as well as all of that. What I assume it's trying to write out right now is all of the styling for this section here. We can see it all just left aligned. It is taking several minutes just to generate that CSS. And this is just within a very basic structure. Now, while it's still working through editing the CSS file, I just want to demonstrate the actual chat conversation. So, it was able to initially create the HTML, the style.css, as well as the scripts, all without an issue. But when I go and I ask for those edits within those three files, for the HTML, I had it error out twice before it was able to update it on the third time. And for the style.css, CSS. If I look here, I have it error out once. I have it error out twice. And now it's on a fourth iteration of trying to update that. Just in complete transparency, that is just one thing that I do want to mention. That is one issue that I'm running into. This isn't to knock anything to Windserve. I love the product. I definitely think it is a really good agentic IDE, but this is just to highlight my overall experience, and I'd imagine others will probably run into a similar thing. Hopefully, if anyone sees that on the team, they take this as some feedback and potentially iron out the issue so others won't run into this type of problem. And then finally here, after the fourth attempt, it does look like it overall gave up. Overall, that's pretty much it for this video. Kudos on the team for actually releasing and taking a stab at the Frontier within their editor. Will be interesting to see if Cursor takes this path of having their own Frontier models. They do, from what I understand, have models that do certain different tasks like applying different sections to your codebase. But that's something that I'm really curious to see is if Cursor is going to take a similar path or if they're going to continue on the route of really just leveraging the latest and greatest LLMs that are out there within their platform. Overall, I'm really curious everyone else's experience. I don't know if what I just demonstrated here is a one-off experience. One thing that I will say is I did try and record this type of demo twice. I did run into issues both times. It wasn't as seamless as something like leveraging Sonnet 3.5 or Sonnet 3.7 for this type of task. Had I used something like Claude 3.5, I probably would have been able to get to a reasonably looking application pretty quickly. in terms of some of the benchmarks here. Now, it might be okay at certain types of programming tasks, but overall, you got a demonstration of my impressions of the model. But otherwise, that's it for this video. If you found this video useful, please comment, share, and subscribe. Otherwise, until the next
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.