every time a new AI model comes out I try to code with it sure I do the fun little one shots or three shots however many shots it takes but then I use it in a way that you actually do as a software engineer and that is in Windsurf or in cursor whatever you use something that is integrated into your development environment h somebody should really coin that term integrated development environment has a ring to it so I figured I'd do all of that with the five most popular AI models for coding today that means going over everything I've experienced coding with these in real code bases refactoring some simple code in the browser interfaces and seeing if it can oneshot a P5JS game we'll discover their strengths we'll discover their weaknesses as well as which one is better for which tasks and what you will be seeing is the quintessential vibe coding because even though I do like Tab Tab Tab that does not use the model you choose over here use some wind surf built-in thing so in order to get the actual test of the models we'll need to be prompting over here and seeing how it integrates into the codebase starting things off with claw 3. 5 sonnet honestly like this is my first foray into AIT coding and it was awesome it is incredibly precise it it executes exactly what I ask it to with minimal wandering almost no wandering in my experience but it also gets full context of everything that it needs as well so you ask it to do one thing in a particular code file and then it'll see oh well this is pulling in that file over here and that file over there so let me analyze everything that is called within this file so I have full context and then I can write the best code possible and it also keeps very good context i don't have to reiterate something that I had mentioned five messages prior like with some of these other models because it would remember and while it is kind of slow but I'd rather have it work slower and less debugging on my end than work faster and then I spend a whole crap ton of time debugging that code and even though this is an older model it is still one of the best if you want something that is very precise and ideal for tasks that need careful accurate execution this is your model and I do want to be very clear about something this video is not sponsored by Windsurf or any of these AI models it's sponsored by MicroCenter which as you may know they carry basically every monitor you could imagine keyboards mice pre-built computers computer components so you can build your own cables switches anything you could imagine and if you've ever wanted an Apple product now's probably the time to do it because anything from Mac Studio to Mac Mini to MacBook Air to MacBook Pro all of it is on sale right now at Mac Microenter and I do want to show you that these are their recent products these aren't old products you're trying to get off the shelf this is early 2025 late 2024 with the new Apple M4 chip and if you're one of the lucky ones who has a microenter near you definitely go in store there are a lot of very helpful individuals there when I was there earlier getting my uh minis form mini PC to set up Nyx OS i was trying to figure out how do I want to set it up with my other PC for content that for coding and then my KVM switch i didn't really know how to go about the KVM switch stuff and they were a huge help but if there isn't one near you click the links in the description below and you'll be able to see all the sales search for everything you want and if you live in Santa Clara California there's actually a MicroEnter store coming to you soon so make sure you check it out all right back to the video however it does tend to play it a bit safe where it may analyze all of the files related to this file it won't refactor those files where it sees something has gone wrong or could be improved which is a good thing because it stays on task but it's bad because it doesn't improve where there could be improvements like claw 3. 7 sonnet which is I would say overly ambitious it reads more than what it needs for that specific file it feels like and then every single file it reads it's like "Oh this could be refactored a little bit or this function can be deleted or this over here and this over there.
" And it just sticks its arm into everything to the point where you have five six discs that you have to review before you can accept the code when you only asked for one so it is like 3. 5 with a with a bit more horsepower if you will but it's not as focused but that ambition often leads to overreaching to where maybe it will delete a function over here but forgets that it's supposed to replace it with something else or you're looking at it like why did it delete that function i need that function for this file over here that's not a good thing and then when it comes to the extended thinking mode I don't like it it hallucinates too much it takes too long it's too expensive it tries to be excessively complex so thinking it's not really an option for me whereas 3. 7 itself I wouldn't really recommend it to anybody because it feels like just a worse version of the new Gemini 2.
5 Pro gemini 2. 5 Pro feels like all the best parts of 3. 5 and all the best parts of 3.
7 combined so it's just as if not more accurate than 3. 5 and it has amazing breadth like 3. 7 but it doesn't touch as much unrelated code so like I talked about 3.
5 will analyze all of these files but only write code for where you wanted it to write code and maybe wander a little bit 2. 5 Pro unless you explicitly tell it otherwise will actually recommend revisions for this whether it's refactoring and things of that nature and it has such a large context window that it doesn't delete a function over here and forget about it when it was really supposed to replace it like 3. 7 it remembers everything that you asked it to do and everything that it had seen thanks to that large context window and I found that the mistakes that it makes are much more minor than what you will see in a lot of other models as well so sure sometimes you don't want the AM model to touch code that you didn't tell it to touch but if there is one that does it and does it right it's Gemini 2.
5 Pro so if you have a large code base you have a lot that needs to get done you have a huge refactor in mind something a bit more complex or high stakes this is the model I would recommend oh and real quick Gemini 2. 5 Pro is my go-to right now even over 3. 5 Sonnet because even though it is again a little bit broader the code quality for all of it appears to be better for me and then we have the model that feels like the opposite of 3.
7 sonnet and that is 03 mini medium reasoning in windsurf because where 3. 7 sonnet likes to reach everywhere and touch every piece of code 03 mini does not like that at all it barely even writes all of the code that you ask it to it's like it writes most of it and then you have to do a manual iteration oh you need to add this okay let's add just that one line right there oh and then you need to add this let's add one just another line or two there until you get some code that is precise and typically accurate but over many manual iterations and without context of the larger code base because it doesn't really analyze too much of the code base around it at all so if you want more control more precision if you will and exactly what is happening over something like 3. 5 then 03 Mini may be your best bet but at that point I would just be inside the code file and just use tab tab tab that's what it feels like just a less convenient version of tab tab tab because you have to prompt and I don't know if yall saw this in the B-roll but this was the craziest one this was the last time it wrote some code and then it said "I've updated this please test the button.
" Okay I said it worked now let's store the data and says I'll update this i say yes to let's apply these changes i'll now update this wait you didn't do the previous changes i'll update this i say okay yes code these changes in i'll now apply these apply these changes i'll now apply these changes okay so we currently have this and I want to do an actual prompt it's like I'll now apply these changes okay please do and then it wrote the code as a diff and didn't add them to the codebase so I said add them to the codebase and then it says I'll now apply these changes in a single edit i don't that's 03 Mini in Windsurf i don't know how it is in cursor but that is a horrendous user experience if I've ever seen one and finally we get to GPT40 which is supposed to be one of the best coding AI models out there according to this benchmark with its new I think March 26th update you know the update that also came with the image generation update with all the Giblly Studio this and that yeah they also updated its coding ability and what it feels like um is that it's trying to be claw 3. 5 but it's just not as good as accurate as precise with more hallucinations and something it really likes to do for whatever reason is overwrite a lot of code with the same exact code the only thing it's better at than 3. 5 is that it's faster but if something's going to be faster and way more wrong then I'd rather the other one take longer you really need heavy code review or in other words don't use 40 with coding use it for chat it's a wonderful chat companion when I just want to bounce some ideas off of and things of that nature it'll tell me like "Bro you're cooking now that's that's low-key um fire dude.
" Or I don't know what the lingo is nowadays but that's how it tries to talk it tries to be very personable but that has nothing to do with coding now let's see which one is the best one of all and unfortunately Claude wants to charge money for 3. 5 Sonnet but yet I could use 3. 7 Sonnet for free so we're going to skip 3.
5 and go straight to 3. 7 which in theory in other tests that I've done previously 3. 7 is better at trying to oneshot things but spoiler alert it's not the best let's see how it does so I'm entering this prompt make an addictive launch style game like Kitten Cannon use p5 js only no HTML show instructions on screen i like pixelated animals funny physics and random obstacles that send you flying or stop you cold so after about a minute and 40 seconds this is what we have i'm going to open up the p5.
js web editor throw it in and launch my Yeah so that wasn't really expected but that should be an easy fix so if I just say it works as in there's no errors but the screen isn't following the character once launched i need to also be able to aim up and down to adjust trajectory well that should fix everything right well it fixed those aspects but now we have floating obstacles that are I don't even know what to say i'm sure we could have fixed it but yeah I'm sure I can fix it the obstacles are floating up and down in odd ways they should stay where they are and if we try it again okay that didn't fix it we're done wait are we going backwards now let's move on to the next one that being Gemini 2. 5 Pro which gave me an error then I tried to fix it the game worked however I got a collision error which it was able to fix with another prompt and then this is the game pretty dang good if you ask me it wrote all of the code just needed a couple of error fixes which it did itself in true vibe coding fashion definitely better than 3. 7 chat GPT40 i forgot to hit record at the very beginning but I did give it the same prompt all of these are the same prompt and it did in fact work on the very first try it depends on what you consider working my issue here is that there were too many things wrong there was no charge there's no aiming the camera didn't work properly it didn't launch it very far the pixels are floating above the ground but they're not moving like 3.
7 at least so in all honesty I didn't give this as many shots as the others but I don't think it deserved it and I did get 03 Mini High to give it a shot however I forgot that it had memory of other chats which I think is why it looks kind of similar to what 40 gave me it does have an interesting launch system but it's not a very powerful one i actually need to try this again oh that sends it way further i thought it was just bad like a low power but it just depends on how far you drag this in order it looks like it doesn't have any obstacles past a certain point yeah interesting so it did not do anything like infinitely and red ones slow you down green ones speed you up that one's It's actually a very cool mechanic i ain't going to lie however that means it didn't listen to the prompt i said random obstacles that send you flying or stop you cold so I guess the green it doesn't send you flying but it gives you a boost and the red doesn't stop you cold it only slows you down so while cool and unique it didn't particularly listen exactly to the prompt so what that means is Gemini 2. 5 Pro even though it took three iterations where it wrote most of the code at first and then I had to fix two errors produced the best game most accurate to the prompt 03 Mini comes in place number two even though it didn't listen exactly to the prompt it did one-shot 200 lines of code had some pretty cool mechanics tried to have its own unique spin on the game I suppose and it and it worked whereas then I don't know 3. 7 and 40 I don't even think they deserve third and fourth place because those were kind of trash and now for the rust refactoring so what all four AIs got right was change the vec in is safe to a slice which just avoids unnecessary cloning and they all changed Windows 2 next to Windows 2 all which is just more efficient more readable and idiomatic it's better and then claude GPT40 and 03 mini high used expect instead of unwrap which is a better message but it still panics whereas Gemini 2.
5 Pro used result here this operator and match logic so it logs bad lines and just keeps going so they all work but it seems like 2. 5 Pro is just quite a bit better and those same three cloned the full vector remove I like this which is just inefficient whereas Gemini builds a new vector while skipping one index using dot filter map or slicing which is more efficient less memory churn yeah and then what's interesting is Gemini and Claude both did report late less than two is true which I mean is just it's just logical but the OpenAI models returned false which is technically incorrect and then Claude and 03 Mini both use mapapsum change which is nice it works perfectly fine limited error handling and then 40 and Gemini used four loops for this specific thing which is not as elegant if you will but does allow you know better air handling and if you need more control it allows for that too but is it necessary in this instance i'll let y'all be the judge so what we have here is just um 2. 5 Pro appearing to be a lot better 3.
7 Sonnet would come in second place because it did some of the things that 2.