so open AI have just gone ahead and released their first AI agent and this is truly a game-changing product as it sets The Benchmark for what will be a glorious future of many AI agents doing various tasks for us in the background while we get on with real productive work I think this is going to completely change the game and there are many things that most people did Miss so let's dive into the first demo and then I'm going to explain to you guys more about this AI agent okay so this is the operator homepage it lives at operator chat gp. com it'll be accessible as soon as the live stream is over um and as you can see the interface is very similar to chat GPD you can type in a prompt and an operator will try to execute the task to the best of its capability you'll also see we have a list of pre-filled prompts here these are not really meant to be recommendations these are meant to be things that you know to give you an idea of what operator can do we' have also collaborated with various Brands like Open Table all recipes tub Hub Uber Thumbtack to dash eBay Target to make sure operator really works well on these websites but also we think users will find operator Val very valuable in interacting with these platforms so with that let's jump in with the demo okay so I'm going to start with something fairly simple I'm going to use open table and say book me a table for two at Beretta tonight at 700 p. m.
okay and so you specifically chose Open Table yeah in this case I'm asking operator to use Open Table to book a table for two at beta beta is a restaurant in San Francisco it's great you should try it out uh and at 7:00 p. m. and I could I'm I'm using Open Table in this case but I could have easily said just do Beretta and it would have probably gone to search engine figured out how to make a reservation as well but let's see what it does so can you explain what's happening in this like yeah right so I'm going to expand this a little bit so as soon as I typed in the query operator instantiated a completely remote browser this browser is running in the cloud somewhere and as you can see it's already up and running my hands are off the keyboard I'm not typing these things so this is just the AI is clicking around a is just clicking around it it started this browser session it knew where Open Table website is which is opentable.
com as you can see it's summarized Chain of Thought here as well which is it's gone to the URL searched for Beretta and something cool really happened which is for some reason operator uh Open Table thought we were in Virginia and it autocorrected itself to San Francisco this is using so like Chad gbd in operator you can also give custom instr so I'm going to show this really quickly here just to okay so I've given a custom instruction that for queries that needed I live in San Francisco so operator recognized that and then autoc corrected itself to go to C to go to veretta okay looks like 7 p. m. isn't available but you know what 745 is just fine so we're going to go do that so in this case operator came back and this is a a really good example of task allegation where operator needs help or needs Assistance or just wants to ask you something he'll just come back and you answer that question in practice you wouldn't have had to watch this you could have just let it go off while you're doing other things then it would come back and say hey I can't do seven yeah and we're starting with a web app you'll get notifications Etc when uh operator moves into Mobile you'll get mobile notifications much like interactions we do with General apps okay yes that's great let's do it okay so again very uh very simple interaction as you would have with an assistant which is hey I found reservations 7 p.
m. wasn't available let's do 7:45 and again you can see um operator at this point is said okay should I again this is a really good example of the confirmations work we're going to talk about a little bit later but you know before doing an action which is sort of irreversible in this case you can cancel the Rion obviously but again taking a critical action operator is asking us before actually doing it and in this case I'm going to say let's do it okay it was pretty quick I would say like you know 50 seconds so right there we got to see the first case of the open AI operator agent now what's cre about the operator agent is that this is something that is completely autonomous meaning that you know you can literally give it a task and you don't actually need to be in the web browser monitoring the agent and I think that that is an important feature because most agents that we're going to have in the future and most ones that are actually useful are ones that we can say you know what go off and do this whilst we do something else and that is of course where the Leverage is and with this agent one thing in particular that I did find was that at certain checkpoints it did actually ask you for more permission which is of course really good but I do know that in future it's quite likely that these models will be able to Reason by themselves and figure out how much autonomy you want to give them now let's take a look at what happens when you actually want to stop operator in the midst of doing something and actually intervene with the task why this is important and how you should actually do that I can do at this point and I'm going to just click this button called take control so this remote As we were talking about like operator fires up this remote browser to do it we almost think of it as surface area where operator can work and then I can work for example in this case I took over control from operator which is also key to sort of how we think about user and user controls like at any point in time a user can be a should be able to take control and give operator instructions or tell a little bit more guide a little bit more Etc it's like passing the laptop back and forth just like you did with Ray totally totally exactly right just like you know in this case I'm going to make those two and then I'm just going to tell operator this is again like very much like if you and I were working be like hey I did this can you fix this and I'm going to tell the operator I added another egg good to place order now can operator see what you're doing during takeover mode great point so when you take over it's very much just like a session with your local browser it's completely private operator cannot see and this is one of the part of the reasons why I have to tell operator or you don't really have to it can look at the last screenshot and try to guess it but it's really good it's sort of like if you and I were working together I went off and did something and I come back like Ray I completely messed it up can you fix this going I have to tell you that so in this case I'm going to tell the operator uh hey go ahead and I'm now I'm passing back the control control to operator it's a completely private session when you take our control this is also the you'll notice that I'm logged into instacart here I did it before the demo uh and or has been logged in for a while now and it's again very much like your local browser when you log into instacart until the cookies are cleared you stay logged in and we have really good controls you can go in settings and control and remove at any point in time now what we actually need to take a look at is how the operator agent truly works now most people don't realize that this is an agent that is actually just just like a human so it looks at the screen it analyzes the pixels and then it reasons about where it should click next and what the next action it should take based on the end goal and the prior steps taken so the reason that I think this is really important is that this means that this kind of tool in the future is going to be able to generalize across a wide range of different tasks meaning that you know these simple tasks that we're now doing in the browser are going to become a lot more complex as time goes on and as the reliability of the model does increase let me talk a little about the research behind it so operator is based on a new model we've trained at open AI which we're calling the computer using agent or Kua for short so Kua is a model built off of GPD 40 but it's also trained to use and control the computer in the same way that humans can by you know just looking at the screen and using a mouse and keyboard to control it before if you wanted to build something like operator without uh without Kua you'd need to use some specialized apis for example if you wanted your model to buy stuff from instacart you'd need to figure out if instacart had uh an API you need to figure out if that API had all the functions that it needed and you need to give you know your model the specs to that API but you know if your site like most other websites did not have an API then you're out of luck this is just using screenshots no API nothing just no API yes um and that's where Kua comes in um by teaching a model how to use the same basic interface that we use on a daily basis it just you know unlocks a whole new range of uh software that can use that it was previously inaccessible and so this is keyboard and mouse right it's kind of using keyboard and mouse justly yes um and that's really what the cool research project is about it's about removing one more bottleneck in our path towards AGI so you can see right there they actually just spoke about how that is of course one of the bonck towards AGI and they are trying to remove that now this is a system that is actually quite good at browsing the internet but it isn't as good as humans just yet and you might be wondering okay well if that is the case where do I stand in terms of maybe my job security how good it can use a computer compared to you know average actual human so they actually did a benchmark where they pitted the operator agent against humans we can actually see the results of these tests and we can see exactly where humans are in comparison to these models and I think it's going to be super interesting to see how these benchmarks change over time and how these models manage to rapidly get better um that said we can look at a few benchmarks and kind of quantify how good operator is right now so one of the first benchmarks that we're going to look at is called OS World OS world is in an eval that measures how well AI agents navigate common operating systems like Linux uh on this task Kua gets a 38. 1% score Which is higher than other publicly published results um human performance in this task is 72.
4% so we still have room to grow definitely the other eval we'll take a look at is called Web Arena web arena is an eval that measures how well AI agents navigate some common websites like e-commerce websites or social Forum websites so on this task Kua gets 58.