[Music] hi welcome to another video so a new AI agent has come to the market and it's called UI tars it is apparently the best AI agent that can control a whole computer let me tell you everything about it but before we do that let me tell you about today's sponsor ninj chat ninja chat is an all-in-one AI platform that gives you access to more than 10 models like Claude 3.5 Sonet GPT 4 Gemini and even image generation models like flux and video generation models like cling and much more all in one place for a
price that's even cheaper than one chat GPT membership starting at only $11 not just that they have a bunch of AI tools that can help you use these models in intricate ways they have also recently added an artifacts feature to their platform that now allows you to generate code preview it and share it with others using preview links which is great it can even run python code and create charts you can check them out through the link in the description and make sure to use my coupon code king2 to get an additional 25% off these
already great deals now let's come back to the video so UI tars is a model that you can pair with any implementation that can control a computer like browser use or anything or with its own implementation as well because they have share their own implementation too anyway there are 2B 7B and 72b models these models are specially trained for tackling computer controlling tasks detecting what's on the screen and predicting what should be done in the next task it is a vision model which is expected and if we look at the benchmarks it beats the previous
soda models with just the 7B model not just that the 72b model takes an even bigger leap which is quite good it also beats Claude 3.5 Sonet in usage which is great so that's impressive this model is based on or fine-tuned from the quen 2 VL model which is pretty good to see as it is already one of the great models in Vision tasks it does the tasks very similarly to anthropics computer use it takes in the image or the screenshot thinks about the next step and then gives back the exact coordinates where the mouse
should click or if the keyboard should type and everything like that it uses a kind of tool calling to do these tasks I believe it uses Pi Auto GOI to control the computer operations similar to most of the implementations that you'll see it is compatible with midene as well since it's from the same company so that's good as well the 2B model is also good for people who want super fast speeds but locally on a not so cool computer so there's that as well although the 7B or 72b model is more recommended for use now
you can use it with anything like mid scene browser use or anything but they also have their own implementation that we can use locally as well and that's also open source so let me just show you how you can use it and everything first of all you'll need to get the model up and running you can get the model quite easily up and running by using VM you can just do pip install VM and then run the model with this command once you have done that it will get it running I'll recommend you use the
7B model or if you're a little too limited you can also use the 2B model if you want better performance you can use the 72b model if you wish to use it for free you can also run it on glhs F and use it via an API through there anyway now we have it so we can now go to the repo here and this is the implementation it looks very similar to claud's computer use implementation that they had shown in their demos to use it you'll first have to clone it on your computer once you
have done that you'll need to get in the folder then you'll need to run pnpm install and it will get the packages installed accordingly for you then you'll need to run pnpm runev and it will get everything running for you so this is what it looks like it's pretty simple and you can just send the prompt here and it will start controlling everything accordingly but we'll also need to set up the model accordingly through the settings option here you can just enter the base URL of Al or VM or whatever you're using along with the
model name and everything you can also use it with probably any other model that supports Vision so if you wish you can also use that as well now let's test it out with some tasks so let's ask it to Simply go to Google and search for AI code King I'm starting it from the desktop also I'm in a virtual machine and I'll ask you to mostly use a virtual machine so that it doesn't delete all your stuff or something along those lines anyway now you can see that it takes a screenshot of your whole setup
here and if we wait a bit you can see that it first marks the place where it needs to work it's currently clicking on the Microsoft Edge icon and after that it does each step one by one also it's pretty fast so that's great and it can be fully local as well if you use AMA anyway if we wait a bit it's now done and it did this pretty well as I said it goes step by step and it worked pretty well which is what is needed it's as good as Sunnet in these tasks which
makes it super good for tasks where you need an agent to control a browser or something and you can use the model with anything because it's basically a model anyway now let's try something more complex as well let's ask it to go to Google Flights and search for a flight let's send it over here and if we send it you'll see that it again starts going through everything it first opens up the browser again then goes to Google searches for flights between the cities clicks the flights Tab and then just stops there so it did
this pretty well it works surprisingly well [Music] I didn't think it would perform so well but it does which is pretty good so this is pretty great I have tried it on some other tasks as well and it is performing amazingly well I think that if you're using something like Claude for these computer controlling tasks I'll recommend you use this because this will be insanely cheaper than Claude And it performs better in these computer controlling tasks also you can use this model in any contraption that you want whether it be browser use mid scene or
anything of such caliber you can run the model through glhf as well and use it I would like some providers to start adding these models to their apis just so that we can use them more effortlessly I think this model is really good and in agentic tasks it's pretty amazing overall it's pretty cool anyway share your thoughts below and subscribe to the channel you can also donate via super thanks option or join the channel as well and get some perks I'll see you in the next video bye [Music]