welcome to real terms for AI with Jason and Aza most of our previous videos we've talked about how you should try a particular llm or token count or rag technique and see if it works for your data but we haven't actually talked about how you can tell if something works I believe the AI people call this evaluation it would take an entire season of videos to go through all the detail and techniques you could use for evaluation but here we're going to hit on a few things that we're thinking about so starting at the basics

I've heard evaluation before and I've heard a bunch of stuff about like benchmarks and leaderboards for llms is that what we're talking about here no think of it this way you can evaluate a model or you can evaluate the application that you build on top of the model we're going to talk about the applications that you build and how you evaluate those because that's actually more relevant to most developers out there yeah makes sense but wouldn't a model with better benchmark scores make my application that uses that model also perform better also not necessarily often

a better model will help but there are other factors like your prompt templates your approach to Rag and any fine-tuning that you've done such as your algorithmic approach or quantization that can change your actual quality okay so today we're talking about evaluating the code we write not the code that is in the model we're using using that makes sense and if we're going to evaluate how good an AI application is we need to start with some examples of good responses to various inputs that our LM may receive you may also hear these sometimes called Ground

truths I think I've also heard it called a golden data set simile or analogy one of the two anyway we can compare what our AI app outputs are with known good data to make sure that our ai's responses are both accurate and truthful to that golden data set in addition to accurac your truthiness we can also use known good responses to make sure that the AI is stylistically responding in the way we want for example we can evaluate if the AI is polite enough or if the responses are of appropriate length but how do we

get that data to evaluate our app against well there actually a number of ways we could do it one potential source is directly from your users if your policies allow it you can use logs of interactions with your AI or with other agents and the r of those interactions to find the ones that are the most helpful you may also be able to manually create a data set using your knowledge of your users and the domain the AI application is designed to work in I think the most interesting option here in this space though is

actually generating an evaluation data set using AI so how does that work wouldn't the AI just generate output that it's going to do well on uh you would think that but no um you generate your golden data s using carefully constructed prompts and potentially examples of the types of data that you actually want to use for example you can ask an AI to generate 50 examples of customers that want to sign up for a library card based on a few handwritten examples that you provide but what if the AI just spits out a bunch of

bad examples or a bunch of examples that are all exactly the same do you remember when we talked about llms as judges this is why it's important to validate any synthetic data that you generate with AI you can have a human evaluate it potentially scoring the examples of how good they are you could also ask a different AI to evaluate the generated data set remember llms as judges but even if you use a model to validate the generated output you need a human to look it over to make sure that the synthetic data represents the

way that you want your AI application to behave okay so let's say I have that golden data set now what how do I evaluate whether my AI application is any good well what we need to do is actually define specific metrics that we care about for our application for example if your AI application involves answering user questions you may care about things like relevance accuracy politeness fluency and even response length you might also care about maybe how long it takes your application to generate a response if it's particularly latency sensitive or maybe you've got some

metrics around the number of tokens set or received there are literally dozens of possible metrics and for each metric there may be several different algorithms or techniques for evaluating a prompt response pair on that metric and we actually call this pointwise evaluation now you may want to Define your own metrics as well based on your knowledge of your system the domain and your users if you do this you'll need to come up with clear criteria to evaluate your responses and most importantly have a consistent approach to your scoring okay so let's do an example maybe

you want to Define your own metric and evaluation for how well your AI is personalizing its responses to the specific user you could evaluate respon on whether they include accurate personalizations and potentially also like how many many personalizations they include like maybe the name or a preference the user has so if we have an AI application we have a golden data set and we have some metrics that we care about now what we need to do is actually run our evaluation and there are off the-shelf tools that you can use to do this including evaluation

services like vertex AIS generative AI evaluation service so from my research most of these tools are pretty similar you select the metrics you want to evaluate on you provide an evaluation data set and then you ask the tool to run the evaluation and give you the results and one of the advantages of using an evaluation service is that you can easily try several variations of your prompts model parameters or even base models that you call to see which ones will actually give you the best results for your use case and these tools often integrate with

other tools to provide you historical records of all these different experiments that we've tried so if you want to use vertex AIS generative AI evaluation service we've linked to a page with many different tutorials in the description box below so you can try a bunch of different features of that product and of course you could also build your own evaluation pipelines where you set up your own set of sample prompts evaluate the responses on your metrics and then summarize the results and maybe materialize them somewhere else so that's all awesome but one of the questions

I've had is like when do you do this evaluation step like do you treat your evaluation as like this onetime thing you do right before your initial launch to make sure that your app works do you run the evaluation regularly like on every check-in cuz it kind of looks like an integration test or do you run evaluation on every single response that your application outputs in real time yes there are many ways we can do this right and today we've actually been talking mostly about offline evaluation with offline evaluation you can run this while you're

building your application to improve it and to ensure that your application is been your quality bars before you deploy it you can also use offline evaluations as part of your machine learning operations or ml offs pipelines to ensure that any changes that you would make to your application code don't negatively impact your metrics that you may be displaying as a part of your flow that leaves evaluating the responses in real time as the application generates them and of course you can do this too and we touched on it briefly in our video on Advanced rag

architectures one thing you can do here is evaluate an lm's response before you send it to a user you can do this either with simple metrics or by actually sending the response back to an LM with another prompt to evaluate the response on things like relevancy and politeness and if the response doesn't meet your bar you can always try generating a new response now there are a number of other techniques you can use for online evaluation that can improve a fine-tune model or improve your app iteratively in real time as well so let me do

a quick summary to make sure I tracked everything we talked about today so evaluation is a technique that makes sure our AI powered application is producing the results that meet our needs to do evaluation you need some metrics or the things you care about like accuracy latency tone helpfulness you also need some evaluation data either prompts with responses as ground truth or a golden data set or even potentially just a set of test prompts and then you can use off-the-shelf tools or roll your own method of evaluating your application finally evaluation can be helpful at

many stages of the application life cycle from initial development through that application running in production I think that summarizes it pretty well hope everyone out there has a great time putting it valuation into your apps and into your development life cycles until next time this is AA and Jason happy prompting [Music]

How to evaluate AI applications