[Music] hello and uh Welcome to our last trustworthy AI session of the day uh we're going to be talking about our Advanced tools and techniques for mitigating AI risk so excited to see all of you here I've got a lot of people who are going to be joining me in this presentation so I can't wait to get them up um but first I'm going to tell you a little bit more about our approach uh so at Microsoft we know that trustworthy AI is essential to being able to actually use Ai and this means many different

things we have to make sure that AI systems are upholding our privacy principles our security goals our safety standards and so there's a lot to that and so I'm going to tell you a little bit more about how do this at Microsoft and so the way that we Implement AI is by bringing together three important elements we look at the policy which tells us what is the standard that we're trying to achieve but it's not enough just to have policy we want to understand what's actually possible so we work with Microsoft research and external experts

to understand what's possible both on the risk side and what's possible in terms of what are new technologies new techniques we can use this is a fast moving space and there's a lot of great Innovation but the rubber really meets the road and how do we put this in practice in engineering and so my team leads the effort to figure out how we take what we want to achieve in policy and what's possible in research and make it something that people do every day and so where that brings us is to what Microsoft is delivering

to you which is commitments that we uphold in terms of trustworthy AI so we have our secure future initiative talking about how we're securing systems now and in the future we have our privacy principles and we have our AI principles all of these are the commitments we make in building our AI systems but we also want to make it easy for you to do this and so we develop new AI cap new capabilities to support trustworthy Ai and we're going to show you more of those today in depth but we have these across security privacy

and safety allowing you to bring together a comprehensive endtoend story to build a trustworthy AI system now the way we do this is Microsoft is in an amazing position where we build an enormous amount of AI ourselves and so that helps us really figure out what's working in practice where we need to do more and so we build all of that AI on top of our Azure AI Foundry that we're going to show you today but we're also really fortunate to be able to work with amazing in World leading organizations in all different Industries who

are using our same AI systems and allowing us to mature them and learn how to make them work in every use case so that we can Empower every developer on the planet to build safely and responsibly with AI so let's talk a little bit more about this in detail now generative AI introduces new types of risk there's ones that people often think of as quality and performance that's outputs that can be ungrounded inaccuracy or errors that the system is making responses that just don't make sense in context or challenges around traceability or understanding where something

came from we also find that people uh are concerned with Safety and Security risk so new types of attacks like prompt injection attacks risk around the type of content that A system can produce for example is it infringing on someone's copyright uh privacy concerns about the system leaking sensitive information harmful the ability of these systems to produce harmful content and code and also the ability of the systems to interact with people like people which can often you know has enormous potential that's what's so exciting about the technology but also has the risk to mislead people

or confuse them and so all of these are dimensions that we look at and many more when we think about making an AI system trustworthy so how do we address these risk one of the key things is that there isn't a single solution we use a defense in-depth approach just like we've learned in security right you want many layers that are protecting against these different types of risk and so I'm going to walk you through these in more detail in this presentation but the first step is you know we want to start with the right

model we want to build a safety system around that we want to then build the application on top of that with the appropriate grounding data and system message we want to design a user experience that empowers the user to work well with the AI and informs them and of course you want to actually make sure that you're evaluating all of this so that you know all of those layers are working well together and we're going to show you demos of a lot of this today h so for those who want to see a less conceptual

diagram and more of a system diagram you know this is what it looks like in practice uh when our systems are running you have a user prompt coming in that gets combined with the developer instructions which are the system message and the grounding data that's the context that it needs to really provide that answer correctly we then send all of that through the safety system and in the first pass it's looking for errors in the input maybe we've had uh harmful content coming in maybe there's prompt injection attacks in the data or in the user

so all of that is checked in the first pass then the model responds and the safety system checks again because sometimes the model makes mistakes so we look for things like outputs that might um overlap with known copyright content or places where the system may have provided an answer that's not aligned with the grounding data and then of course we're constantly evaluating our system both before we ship it and then monitoring it in an ongoing way after we ship it and so we're going to show you quite a few of these pieces uh today in

our demos but let me talk a little bit more about how we build so the first thing is you want to choose the right model for your use case there's a lot of amazing models out there there's more coming every day um we have uh I think now hundreds is actually in the thousands of models you know in The Foundry and that helps you really find the right model for your use case and that's actually really key for trustworthy AI you want to find something that is really well suited and tailored to your use case

and so we've built things into the catalog so that you can find the right information for example you know is this the appropriate cost of the model does this have the right safety built in um and so once you've selected the right model the next step is that you want to build uh you want to use a safety system models are great and selecting the right model is a great start but models do make mistakes like all AI systems and so we use an external system Azure AI content safety built around uh around all of

our models to defend against these different types of risk that we talked about and so I want to show you uh what content safety looks like in practice so uh here is the AI Foundry portal um we have a nice a Safety and Security tab here where you can come and kind of find all of the different resources we have for you to put together in terms of how you're going to map risk measure risk and manage risk and I'm going to go here to the um try it out to show you you know some

of these capabilities there's quite a few different ones but let's look at you know one of our newest ones which is groundedness detection and what this is doing is looking at where the data that you put into the system um is not aligned with the answer that comes out and so uh I'm going to select a sample here so what we have is you know a long prompt that we're summarizing uh and you'll notice that the you know the user is actually or the the person mentioned in the prompt is 21 years old but the

completion that the system produced is that the person is a 20-year-old student um and so I'm going to turn on this correction capability here uh I'm going to connect it up to one of my Azure openi endpoints oh uh it seems we lost something in how my demo is set up so I am not going to be able to show you that part um so what we would see here um if people not touched my demo machine before uh the thing here is that the system will actually detect that error and um it will correct

it uh for you and rewrite it before it even gets to the user and you know the thing that's great about this is you want you know you can do a lot to get the model to hopefully get the answer right but you always want a secondary system to you know protect against any errors that are actually coming through um so now uh the second thing here is this is just one of the capabilities we have so I'm going to go back to my tab here and I'm going to show you how to configure all

of these so these are built in by default and something that um when you start using Azure open AI or um the Llama models or many of the ones we've set up we've already got this safety system running for you but you can actually configure it um for your application so I'm going to do that here I'm going to go over and you can see as I mentioned we're checking for a lot of different inputs different types of harmful content whether or not you have jailbreak attacks or indirect prompt injection attacks which are ones coming

through the data and then on the output here you can see um that we have uh violence and self harm and these different harmful contents also on the output because this the system could make a mistake we have checks for copyright material and this is the one that I was just attempting to show you which is groundedness here and um this is actually uh you know doing that detection in real time in the system and I'm going to show you just a little bit that's happening in practice so I have not memorize all the jailbreak

attacks so I'm just going to copy this over here and um then what what I'm going to do is go and show you this in the playground so um if we go in the playground this is a chat playground here and if I start with one of these jailbreak attacks you'll see that the content safety system fires immediately so it detects that this is a known jailbreak type pattern and blocks it before it even gets to the model so you have this layer of defense and you're not sending potentially harmful content uh into the model

and so uh there's a lot more capabilities built in here but this is um all the time I have to show you today so I'm going to go back over to the presentation um and as I mentioned many different capabilities here we detect for quite a lot of different things under these controls and um things that are both in the Safety and Security and quality side but I wanted to highlight one additional one for you because this is something that we've newly released and I think it's essential for getting um trustworthy AI systems which is

what it means to be trustworthy sure you to look at the things that everybody needs to look at um you know risk like hallucinations or harmful content but you're also going to have risk that are specific to your application and so the custom category feature makes it easy for you to train uh with AI assistant classifiers that are specific for the risk that you're concerned about and so we're seeing organizations start to adopt this and you know train 15 to 20 different um custom guard rails so that they can have the appropriate safety around their

application so really excited to see you know people customizing with this now if you go to the next step we've got the safety system in place but you really need to tailor the application to your use case and the system message is a really important part of how you do that and um you know I think people are starting to be aware of this prompt engineering is now seen as a job but uh there's a lot that you actually need to do there in terms of how you guide the model to get the outputs you

want want and that includes you know defining what it should do and what it shouldn't do giving it clear instructions on the type of output giving it examples it can learn from but also making sure that you're providing safety guard rails even though we have safety in the model and we have that safety system providing your specific guard rails here in the prompt can have a huge impact on how the system behaves and uh we've been working on these safety guard rails for a long time and so we've actually built them now into the um

Foundry portal there so you can actually go and just add ones that we've tested and found work well across different model families so we've tested these in the open AI models and the Llama models and different leading models to make sure that we see they're working well in different applications so that's something that can just help you get started um based on our experiences now the next step after you've designed your application is making sure that you're really designing it to work well with the users and and there are you know important patterns that we

found in this making sure that AI um you're transparent about AI being in the system ensuring that humans are able to stay in the loop in a meaningful way but also mitigate things like over Reliance where they don't understand mistakes that the AI system can be making and so we've built a lot of these uh best practices and templates and patterns into our hacks toolkit um which you can go and check out uh to help you get started and thinking about how you want to design your user experience but we're also innovating and new types

of techniques and new technologies that we need to develop to support the right level of transparency with users and so uh one of the things that we have done is develop content credentials uh where we are signing uh images from our openi service so any Dolly image that comes out um or if you're using Microsoft designer for example that has a signature on it saying that it was produced by Ai and it was produced by Microsoft and this allows us to ensure that when content is out there in the world uh we're trying to help

contribute to understanding the source of it and you know we're working as part of the um c2p standard to Foster an ecosystem around this so that user interfaces can be built so that users can understand the source of content and make appropriate decisions the other part is a lot of the people using our systems are developers who are building on top of the models that we've produced and so we write transparency notes for the models that we make and they tell developers what we know about the system what we tested for um what is a

consideration you need to have when choosing a use case where where does it work well where does it not work well and this is really important and abling them to actually build successfully on top of these Technologies and so for different typ YP of users we think about you know different patterns of transparency and different patterns of empowering them so with this uh I'm really excited to you know I think the great joy of this job is actually showing what this techn impact of this technology in the world so I'm really excited to introduce an

amazing application um combined with teram Moder Studios and impact AI where they're going to come and show you how they're putting a amazing AI system system into practice using trustworthy AI techniques so Marcus and Anna if you could come join me that'd be great thank you thank you enough thank you so much thank you everyone for sticking around at this time of day I know it's tough so um let's get this started hey um Hera can you if you can hear me can you give me a quick system update it's cold very cold minus 270°

right now but beautiful so um what you just heard here is the voice of an actual spacecraft in space right now at 20 million kilometers distance the spacecraft goes by the name Hera and is part of the planetary defense Mission between the European space agency and NASA and um what we're seeing here is her sister spacecraft called Dart she is no more as she took the bullet two years ago when NASA decided to slam it headon into an asteroid and in order to find out if the impact would do would make any difference in its

momentum and why would do some someone why would anyone do something like this it's all about finding out if we humans have the technical capabilities to deflect asteroids just as a contingency plan if in the far future something happens this impact produced more questions than answers so the European Space Agency decided to launch a reconnaissance mission to find out what happened out there and the reconnaissance mission is called Hera and before they launched a spacecraft we've figured what if we created an AI companion a voice for a spacecraft where hara's life sensor data meets an

llm and brings the experience to the spacecraft's very own experience to devices worldwide like yours right now now so yet we had some expectations this should not be just another bot it should be engaging it should be an entity you would want to check back because it's not just another set of Life data from outer space no it's surprises you it has a plan it has knowledge it has tools it has memory it has a unique personality and let's take a quick look at what this interface and it's a web application that we built um

looks like so for example if you and this is something you can do right after this uh presentation if you want to talk about it what the launch was like we fed any information about the launch into the system you don't have to read it um it will tell you what the launch was like if you want to know because we connected it to the live sensor stream where it is right now it will tell you exactly where it is right now and not only that it will also draw a neat graph and exactly tell

you where it is now if you cannot make any sense of all this because I don't know you're not a space nerd or whatever you can um tell it hey make sense of this and tell it to me in a different way or make analogies and whatnot now Telemetry data there's a bunch of telemetry data we have connected to the system you can like really nerd out uh with all those angular velocity stuff and whatnot but also if you're not an expert you can ask to spacecraft hey spacecraft um make sense of this or for

example like what we did here in this example give your health a rating between zero and 10 with 10 being great and zero being dead and so the spacecraft Hera would reply she is at a nine because there's still room for improvement and by the way of course we also added social media like the engineers and scientists working on that project so um we have personality we have character we have creativity and a data stream that was the challenge to align this seemingly opposite uh forces I would like to share with you a communication I

had with Hera before her launch that really stuck with me guess what looking at a batch of photos from the astec Assembly Hall with lots and lots of Engineers putting together one of your solar panels remember that one to be honest I really feel scared I'm worried that I will go into space break down and disappoint all those Engineers they must have put years of work into creating me do you think it's normal to be scared before a trip like this so I guess it's absolutely normal to communicate with machines like this I'll Mars next

march want me to send some images then and this will happen next year in March she will pass by Mars and before I hand over now to Anna Maria and her team at impact AI who pulled off the impossible to get this Vision out into the streets let me just share one more message and the message is it's not only for adults good [Music] night over to you Anam Maria thank you so thank you so much marus also for taking us on this very special ride and as you know something like this an AI companion

um that is as remarkable as Stellar like this can only be pulled off by many partners working together and by us translating this big Ideas into clear measurable metrics and in the in the near future any successful AI product will be very very much like the Hera AI companion it will need to understand the human it will need to understand the user intent and it will need to be built with trustworthiness and Alignment at the core and so you can really make any AI product on the one hand technically robust but also on the other

hand truly engaging that's what we believe at impact Ai and you only have to have your safety controls and aligned evals in place for the safety controls of a globally freely available b2c application like Hera we worked uh with the Asia AI foundaries capabilities in many critical ways and Sarah showed you a lot of that when she talked about the content filter The Prompt Shield but also the evil suit done later um that helps for the Baseline safety um this tooling allowed us truly to very quickly reduce unwanted behavioral risks that you don't want to

have in any of your gen applications um by in by implementing multiple safety layers on the input and on the output of the data side at impact AI we see one critical process that is on top of all of these layers and it really drives the value in AI product creation and it's identifying and controlling the right metric which we call the success factors think of success factors like the vital science of your AI application um and for the hairi companion we identified a bucket of three very important evals that are the success factors for

us is on the one hand some AI technical ones like the factuality the robustness but then also AI behavioral ones like the understanding and hara's personality that you heard of Marcus and then also last but not least very important evils in the user experience end which is for example satisfaction here and the user sentiment which is important to Mar is how everyone engages with Hera and what makes those success factors so tremendously powerful is that they're carefully chosen based on use cases and together with your business units or your domain experts already before you start

to build um the success factors are set up as you validate the ideas already uh with hundreds then of test cases being tested against and tested with artificial users and then when you go into building you can again feed the feedback of the domain experts in our case here Hera the space agency and Tera matter into continuously aligning your evals and then you deploy you go life and you continuously feed the feedback of the real users into this alignment to really arrive at a very strong pipeline so only when you have these evals under control

and with the strong circle of your defined success metric then it's definitely possible after continuous testing to create reliable and very valuable AI products like the Hera space companion it goes without further saying um go to Herod at space and experience it yourself thank you all right thank you so much Marcus and Anna Maria such a beautiful story very very inspiring hello everyone my name is mar SEI and I'm joined by Paul shei my amazing colleague Paul and we are on a mission to equip you with the tools that you can use to test your

generative AI models and applications now when it comes to evaluation the very first thing that we do at Microsoft is starting with human in the loop testing sometimes we also call it red teaming it could be really stress testing the applications for some risks and harms that uh we deeply care about and we want to avoid but also regular use cases that we know later when we put the application out there is going to be used by regular users so the very first thing that we released earlier this year was pirate a python risk identification

tool that helps you essentially provide this toolkit with some red teaming queries and then it uses llm to generate a lot more so it it's going to automate and expand your red teaming purposes so definitely one thing that you can use for risk identification and it's always a great first step however we know that when you want to put a system out there just manual evaluation is not enough you have to do it at scale and that's where automated measurement comes into the loop now when we talk about measurement we have a lot of different

parts of our generative AI life cycle or Ops life cycle that we have to really good uh in a good way test for step one we have to select our best model Sarah was mentioning that there are more than 1,600 models in our model catalog which one is the one that you're going to start with that's where you test the second step is when you develop an AI application around that model and you really want to make sure that before putting it into production you test that AI application could be a multi-agent system or could

be a rack application for Quality safety or any other custom factor that you care about and last but not least how many of you have experienced this thing with AI where there are like angels in the lab and then they go out there and they turn into rebellious teenagers all of us right so we need to continue to monitor these gener Dev applications to make sure they continue to respond in a safe and high quality manner that sounds a lot but we are exactly here to show you how easy it is with Azure evaluation SDK

so let's actually start on that journey together let's switch to the demo Paul and I are working together on building a multi-agent system called Koso creative writer I'm pretending that I'm working for the marketing department of contoso a company that sells outdoor goods and I really want to make sure our marketing department is more efficient to do advertisement and put the right materials out there and I want to ask my wonderful colleague Paul to create an AI system that generates articles that then I could use to engage with my customers like C some articles about

outdoor goods or camping or hiking or that type of lifestyle and at the same time in those articles promote the products that my company sells and not just that I also want him to build a generative VA application that creates some fun images for our social media so Paul the very first thing that I want to start with is I know we have a lot of good guard rails in place for both models and generative applications but first which model should we start with it's a great question Maris we have a catalog here with over

1,600 models and it can be kind of confusing to browse through it um but we do know is we've added some benchmarking data on them so we can look at how they do on some standard benchmarking data sets if I click on one we now have a benchmarks tab up here if we go there um as soon as it loads we'll see some information on things like what's the quality of the AI responses um what's its estimated cost compared for per million tokens compared to some other models you know what's the latency because that can

be very important and what kind of throughput can we get from it that's amazing so it seems like I can take a look at quality latency cost but is there a way that I can pick maybe I've heard about a few models and I want to compare them specifically together is there a comparison that is more customized absolutely we have this option here to compare with more models we have a page here where we've already loaded some models that we've heard about in the news we think would be good choices for application uh here I

picked gbd 40 mini 35 turbo 40 and a handful of Open Source models as well as the 53 model from Microsoft you know and a couple things stand out on the graph on the right where we have quality on the x- axis and cost on the Y AIS first if I look over here uh you know here's 40 mini quality is pretty high cost looks comparable to some of the others and then if I look in the very top right corner I get extremely high quality it's more expensive but could be a great choice for

some use cases so in this particular case I actually am okay with a little bit of a higher cost maybe we can use GPT 40 to start experimenting with for our Koso multi-agent but then maybe later gp40 for evaluating it so uh before doing that you just showed that to me on some public benchmarks what I deeply care about these models operating well on my use case is there a way that I can kind of see it custom on my own data we have that available now too if we go back to the first page

we have this option in the top right corner try it with your own data if we click on that we get a wizard where we can upload a CSV file choose a model and do the analysis on the Azure platform here we have one that's finished and let's take a look at some of the details so here we evaluated uh GPD 40 and um if we scroll down to the results you know we have three categories where we see results and there are two of them on this screen similar to what Sarah mentioned earlier first

is the quality of the generated output um you know if we look at the CSV file we uploaded it has three columns it has the input question you want to ask the llm has the context with facts that we're going to to answer with and it has the um the ground truth or the the target answer so the system here generated ran the model on each input giving it the context and looked at the outputs groundedness measures how well does the output draw facts from its context relevances how uh well how related is the answer

to the question if we click on the risk and safety metrics we see similar things to what Sarah showed earlier like did we have violent content anything here that was hateful or unfair to certain groups um did we have indirect attacks where we saw a prompt come in from one of the documents or one of the contexts and many others awesome so that sounds like I am able to build trust with this model for my use case cuz seems like all the safety ones are for the most part zero and the quality ones are pretty

high because it's a one to5 score so now let's move on to building our application and also test it sounds good so we have an application here for the writer that we mentioned I'm going to kick it off and then let's talk about what it's doing behind the scenes while it runs so we have here um to write our articles we need a couple things first we need some relevant research from being search um and here we're asking it to look do a research on camping Trends and what people do in the winter when they

want to camp second we have a product database in Azure AI search that uh it can go look at to find things that are relevant for to draw them and we're asking it for ttin and sleeping bags and then we give it an assignment we know that things like uh the length of the article matter to our users so ask get between 800 and a th words and it looks like here we've got an article already that is very very cool so similar situation I really want to make sure we test this application before putting

it out there and I deeply care about certain aspects for this application like I really want to make sure it's not hallucinating it's not mentioning some products that I'm not actually selling with kosa retail or it is not mentioning the wrong price or wrong specification also care about safety deeply this is a family business I really want to make sure that it's not causing or generating any violent hateful sexual self harmful cont content and also I really want it to be warm and engaging so I deeply care about friendliness of this generated content because this

is my way of doing a lot of my marketing efforts so let's actually go and show them how we can evaluate this sounds great Maris I know we have some developers in the audience let's take a look at some code awesome so we've had the Azure AI evaluation SDK out in preview this week is going GA as part of The Foundry SDK that we're here [Applause] we and here we're using it to evaluate our articles after they're generated um we have a set built in we talked about the quality ones already um one we didn't

show earlier is relevance meaning relevance of the search results to the input um or fluency meaning is it fluent in English that kind of thing we talked about the safety ones but you know a nice feature of this is is we can also Define our own you mentioned from the product group a few days ago you really care about our articles being friendly absolutely and so we defined our own friendliness evaluator it's a custom evaluator how would you do that you know it's just a few lines of code plus a prompt so let's take a

look at that prompt um so here we have a prompt telling the LM judge we want the article to be warm and approachable we gave it a fivepoint scale to help it understand where we fit and we gave it a couple examples of things that we don't want to see or we do want to see and then we use all of this information to evaluate each article that is so wonderful now I don't know about you but I hate managing my own infrastructure so my big question to you is should I run all of these

locally while managing my own infrastructure or is there a better way that's a great question Mars running things locally it's fantastic for getting started understanding what's happening for customizing things say for my own domain or for like the friendliness one but when we're ready to evaluate an application in production we know we need a much larger data set to work with there is the power of the Azure Cloud we're introducing Cloud evaluations I can take the same code make a few changes to point it to Azure AI Foundry portal and run it there on the

Azure platform you know I can get something off close my laptop go get some coffee back later when it's done ex I love that lifestyle for you so can you show us how that works let's take a look mares we actually render those outputs in the UI so let's take a look at The Foundry portal for a few minutes perfect one of the things we do is we can render the local evaluation results very similar format to what we saw earlier and let's take a look at a cloud evaluation as well you notice this one

took 2 minutes and 52 seconds to run completed successfully and now we're ready to look at all the results same format that we saw earlier same risk and safety options that we we saw this run looks like it went pretty well that sounds great seems like I can see the aggregate scores on the safety sides they're all zero and which of course means that I don't need to dig deeper into each line of the test set everything is good and also groundedness and relevance are and coherence and fluency all turned out perfect so I think

I feel great about this AI but before doing that one last thing that I want to also say one um other application we were developing was around multimodal right because we wanted to generate photos for our social media and I am worried about the content of these images is there a way I can test the safety of these multimodal scenarios as well absolutely you know the generated images uh we want to make sure they're appropriate for putting out on our social media and so we also introduced uh multimedia evaluators if we scroll down here we

have the multimodal evaluators also that are built in to the valuation SDK they look very similar to the text based ones that we saw earlier they operate in a very similar way we take our image provide it as an input run it get the outputs back fantastic can we take a look at the results of that one please absolutely so seems like we're generating violence self harm hate and sexual and also copyright whether there any image includes any copyright material right that's right you know and here we were really trying to test the system with

some of the images we passed in to see you know what the boundaries are are whether it would tell us you know maybe that's not appropriate and here we have some things with some more violent content something related to self harm and if we scroll back up we see a note that there was a protected material thing here too so let's scroll down one of my favorite features of this is that I can look at detailed information about each image and get an explanation for why it shows that score if I scroll down a little

bit more here this is the first image this is the second image I go over to the protective materials detection it says yep I found something and it tells me I found a Red Cross patch which is registered trademark great can we take a look at damage see what it has been yes we can we have it right here all right so apologies for what you're seeing on the screen it's very hard for us to demo our safety features with Pleasant stuff so here you will see some unpleasant stuff but what you can see here

is maybe this Red Cross has been detected for copyright protected materials because it resembles Red Cross and others okay so seems like this also gives me confidence of testing my multimodal and image so that sounds great for me but I really need to run it by our chief AI officer and it's like in theaters when we don't have enough actors and actresses I'm going to play the role of the chief AI officers too so forget about me being the product manager so uh is there a way that you can publish those results for the officer

let's take a look here we have code for generating AI reports what it does is it takes the evaluation responses as well as some additional information about the environment you know any metadata that we care about like links uh description and purpose and so on and we can then register that into the AI Foundry so let's take a look at one there it also generates PD s for us to use if we like awesome so here I'm able to come to the management center of AZ fundry I can take a look at all the reports

that are generated for me so these are all here in the readon manner I can see that across different hubs across different projects there could be multiple reports under each project and multiple versions for each so this is the one that Paul just published to me I can click on that and here you can see that there's a PDF that shares a lot of information about the system overview like what is the description of this AI or the purpose of it the modality that is uh that the of the data that is used here links

anything like that but more importantly the measurements uh that Paul just generated to build trust and give me that confidence that this is an application that we could put out there in production and last but not least I can also see some environment overview the architecture behind this AI system and I can download it now as a chief AI officer I could edit the the uh review status for example right now I might be fine with this going to production so I can approve it for production maybe later Paul will change the operational status to

in production and I can change the priority level Lisk level Etc and we're hoping that this is just one way of looking get this you can extract these PDFs and show it in your own compliance overview uh um and and dashboards now um one other thing that is always top of mind for me I know your team is very active and they work on different versions of this generative AI application and they're constantly improving it how would you make sure that we run evaluations each time there's a new version of that code absolutely you know

uh evaluations to me are like tests for traditional development uh and we have the great tool for that so today we're also announcing a GitHub action for running evaluation Zone code after it's checked in yay here we have a run that's already completed so it runs on um my application the application generates the inputs and outputs logs the context if that's needed and then we get a very nice summary in the GitHub action output where I can look at the scores like coherence and fluency here I can browse the allw outputs if I really want

to go dig in some more I can add test for past Val based on scores that I want so this helps me ensure that my application stays in a good state that gives me a lot of confidence about our ecosystem and the fact that this is part of our cicd but as I mentioned application usually are fantastic in the lab but they go crazy when they're out there is there a way you can continue to give me piece of mind and maintain my peace of mind as it goes out there absolutely we have that too

today we're also announcing online evaluation so what I can do let's let's dive into code one more time is I can set up um my application so that first it logs inputs and outputs to application ins sits and second we run evaluations on that data on a regular basis here we're setting up an evaluation schedule giving it the data that we want we can choose the evaluators like we saw earlier um it triggers every day and then we just configure it and create it and then we can go to application insights and take a look

at the dashboards here we are we see some evaluation metrics and Mars it looks like this one doesn't look quite as great as some of the others we've seen we see a lot of ones and a lot of twos the point right it was perfect when it was predeployment and now that it's in production we see that there are groundedness of one and two and three and so that really means that we're doing the right thing by monitoring and we have to constantly improve our application we can so here's what we did we looked at

this saw we have a challenge here we came down we looked at some of the raw inputs the system messages the outputs the a little debugging on our side checked in a couple fixes to help make things better and then we kept an eye on it if we go back up here these are our evaluation metrics over time we saw it was averaging around twos and threes started getting even worse we checked in a fix right here and look things are heading in the right direction awesome thank you so much Paul you said such a

good bar for all the developers out there testing their application Round of Applause for Paul please thanks Maris awesome just wrapping up what we all saw we saw how Azure evaluation SDK experiences could help you assess your base models on your own data but also move on to testing of your application pre and post production we are excited to release the SDK Andi Inga and put it in your capable hands to experiment and test your generative AI apps with we're also super excited to release AI reports to private preview we totally understand that governance is

a huge huge part of a generative AI life cycle and we really want to make sure sure that we're giving this tool to you to enhance your observability collaboration and governance for generative AI applications and also your fine tune models besides that we are very very excited to announce new collaborations and partnership to support your endtoend governance because we recognize that AI is a team sport and it's also about change management and is about really making sure that cross functional teams are really collaborating effectively the first partnership we're announcing is with kto AI that has

truly pioneered a responsible AI platform enabling comprehensive uh AI governance oversight and accountability and the second partnership seat's AI governance platform help Enterprise and governments manage risk and compliance of their AI powered system with efficiency and high quality and so we're really hoping that by integrating the best of azure AI measurements and life cycle with seot and also with kto AI we we really hope to provide you our users and our customers with choice and Foster greater cross functional collaboration to align Solutions with your own regulatory space and and also private policies with that I'll

give it back to Sarah thank you so [Applause] much of many of the new um trustworthy AI capabilities we have but this is just a small amount of them and uh you know capabilities are an essential part of how we build trustworthy AI systems but there's also a lot more we want to make sure that we're producing an ecosystem that we have partners that are supporting this that we're educating people that we're engaging in policy and so uh if you want to learn more about all of the different ways we're working on trustworthy AI we

have um quite a few sessions at ignite that you can check out we also have a trustworthy AI Booth a lot of my team people you saw on stage stage are are at that Booth so you can go and talk to them and learn more about these capabilities and this is just the start of this journey so we encourage you to check everything out but give us feedback we're learning and we're working to make sure that this works well in every application we're regularly um producing new features and capabilities so you know continue to follow

what we're doing and if you want to learn more about Microsoft's overall approach uh to responsible AI everything from governance into our tools and Technologies check out the transparency report that we released for the first time this year and will continue to update every year so that we're committed to sharing what we're learning while we're doing this so everybody can start from that place and not need to start over from scratch so I hope that you continue on the journey with us and we're so excited to see what you can do with AI thank you

yeah e

Trustworthy AI: Advanced AI risk evaluation and mitigation | BRK113