[Music] it is my amazing privilege and I will have to read this a little bit because it is amazing and I don't want to mess it up is that our speaker is an assistant professor in the school of information at UC Berkeley go Bears thank you she studies human computer interaction with her research spanning education the healthcare to or restorative justice which is an amazing breath her research inses are social Computing centers human centered Ai and more broadly human computer interactions her work has been published and received Awards in the Premier venues colon and our

speaker gave me all the acronyms and I just like I'm going to spell them out because they're so impressive and it's important the association of computing Machinery conference in human factors and Computing systems AC Chi Conference of computer supported Cooperative work the cscw and the empirical methods in natural language processing so these are amazing three I did some reading on this I I encourage you to look these things up absolutely amazing and if that wasn't enough um she's been covered in you know little things like Venture beat wired and the guardian you know those ones

that are a little harder to remember she has a WT Grant Foundation scholar for a work uh on promoting equity in student assignments uh and uh algorithms and a member of The Advisory Board of generative Ai and this tiny little startup named Nvidia I'm not sure if anyone's ever heard of them um and um even though we recognized that uh you know she she does go to Cal now which is a good thing she teaches for Cal or has teached lately she is from this little PhD thing from Stanford so please join me in welcoming

Dr assistant professor Neil thank you Jerry that was very kind um and very funny thank you uh thank you all for spending your afternoon with me um we saw some really cool demos and that's really what I'm going to talk about I want us to I'll take a step back as academics that's what we like to do and just think a little bit more broadly about what's going on here what's happened so far and where are we headed so what I'm going to talk about is large language models it's been about two years since chat

gbt came out we've all tried them and what they mean for work and what they can mean for product management so like Jerry said my name is n far I go by nilu um I've been a professor at UC Bley since 2018 and I've been doing a lot of work on uh thinking about human centered Ai and what that means and so I'm going to come back to this quote but this is something that is just stuck in my mind since the first time I read it and it's a quote by Russell akov um the

quote is managers are not confronted with problems that are independent from each other but with the dynamic situations that consists of complex systems of changing problems that interact with each other I call such situations messes and this will be very familiar to anyone who's done product you're always dealing with messes and we're going to talk about what AI can mean for that mess there are some estimates that about 80% of data in all organizations is messy unstructured and rarely used and product people are expected to always know what that data is what the trends are

what we should be doing with that data and so we'll get into that but first um this is what we're talking about I really like this XKCD comic because I think it sets the stage whenever I'm talking about AI I always bring up this comic so there's this guy who's going like this is your machine Learning System and the other one's like yeah you pour the data into this big pile of linear algebra then collect the answers on the other side and the big question is what if the answers are wrong and the person says

well you know just stir the pile until they start to look right and that's what we're dealing with at the most basic level we're dealing with linear algebra that you can keep stirring until it starts to look right but it becomes profoundly powerful as it starts to get right and so that's something that we have to be thinking about and this is something that people um across um different kinds of organizations and even venture capital is starting to look at as well so um there was this article that came out of En about the second

wave of AI that is less about generating new stuff but it's about synthesizing what we already have and that's something that I think is really really important and um something that we will see more and more of as we move past the initial stage when we were just starting to learn what models can do and into the stage where they get really deeply embedded in everyday workflows so I believe that we are at a stage where we're going from the internet era or where connectivity and information sharing led to really an exponential growth of information

and our ability of what to do with information to a Quantum error where it's not just that we have unlimited amounts of information but now we finally have the tools to also be able to process and synthesize that information and that's why I think everything is going to change but it's also going to change really slowly and in really weird ways and it's going to take a lot of work to get there so I'm going to share one project that um me and my students have worked on on at UC Berkeley and I'm going to

turn this into more of a conversation than me just presenting the work but we were about a year ago starting to look into what would it look like if we used AI more for data synthesis than for just for Generation so what if we took these models and we took all this unstructured messy data and we tried to use the models to make sense of that data so I'll pause here and see if anyone's tried to do that so show of hands if you've tried to upload any of your own data or run a model

over any of your own data all right so about half the room and we saw some demos on how to do that today as well you can go try them out but I want to hear from the people who have tried it what has your experience been like what's gone well and what has not worked this always structure you have data set that's not structured in the way that [Music] model okay so the model needs to know what the structure of this information is otherwise it's it doesn't understand it and then the output doesn't make

sense okay what else could you say more [Music] [Applause] [Music] [Music] so you're expecting magic but it's not really working out so you have to iterate a lot how do you know it's not working so the output doesn't make sense what else what have other people's experiences been um setting up a testing framework to answer that question turns out to be the hardest question to answer because you're literally asking fundamental product questions and trying to evaluate the effectivess of the the so evaluation is really hard [Music] yeah ANW from are different and sometimes you that's

something you expected expecting then you use the same problem for different thing it doesn't work like let's say data and I ask some questions sometimes it gives me right answer I ask same [Music] [Music] right that's a really good point so because these models are non-deterministic or stochastic they don't necessarily do the same thing again and I think that that's a really big key to how this will all play out I think we're moving past the era of software where the software was always deterministic you knew exactly what you would get out if you ran

it times you would get the same thing into an era where a lot of things are going to be really noisy and evaluation is going to be really hard um so a colleague of mine uh Joe Gonzalez and another colleague of mine Joe helstein started a company is called run llm and they did all these po blog posts about a year ago where they were saying you know everyone's evaluating their llms off of Vibes and this can't work so you know we have to have some kind of benchmarking here and a lot of people have

tried to do llm evaluations and benchmarking and just a week ago I saw a blog post come out of their uh their company and the title of the blog post was in defense of vibe based evaluations after a year of trying to create benchmarking they had realized that even if they really put a lot of effort into a good Benchmark it still didn't mean that their end users felt that this was was a better llm and what the end users were doing were going in asking a few questions and seeing how it looks like and

how it feels and just The Vibes and the blog post was saying if that's how the users are evaluating it then maybe that's not so bad maybe we should be thinking more about that so I think there are some fundamental things that are going to change yes um another big problem is often like 95% of you may get situations where 95% is accurate and 5% is and it's easy to one if you don't really understand the domain to just think it's 100% accurate um because you're seeing a bunch of things that look accurate and detecting

which parts are not um hallucinations is another problem um and so I think that the veracity of what you receive and can you fully trust it and in some domains you can't afford to have like 85% confidence or even 90% confidence you need like 98 99 100 so I think that's another that's another B problem that's exactly true so even if I know that I'm 99% of the time correct I don't know what those 1% are and that's another really big problem because um our brains are just not used to tools that work that way

um if I drive a car around La and it's working pretty well I expect that if I come to San Francisco it'll work just as well and that's how our brains function and how we think think about tools and now we're entering into an era where I really don't know why these models keep using the word in the realm of something whenever I see an email and it has in the realm I know that it's AI generated because no person uses that word but I don't know why the models keep using it so it's going

to be weird and we're not going to be able to understand why we have a lot of experiences in this um about four years ago I started working on models used in really high stick settings because I was thinking if these models are keep going to get better and more embedded we have to know how to make them reliable and so I I thought to myself let's take it to the extreme what is a setting in which you absolutely cannot be wrong and I thought well Hospital settings because if it's life or death then you

absolutely cannot be wrong no matter how good the model is and so we started looking at language models early language models translation models used in hospital settings and one thing we learned was that Google translate was widely used in hospitals across California and so we started this study um let's actually try this out with this group so say you have a doctor say you're a doctor and your patient comes in and you want to tell your patient hold your kidney medicine until you've had a chance to speak to your kidney doctor so you just got

their lab results back and you have to give them this piece of information now your patient comes in and they don't speak English they only speak Mandarin now you have a choice here which is to put that information into Google translate and hand it to them or you could you know go try to find a way to find a Chinese interpreter now let's do a show of hands if you were that doctor would you give the Google translate output to your patient let's show of hands if that's a yes okay I think about 30% um

and what if it's a no okay 70% so this is really interesting for me because I usually talk about these things to a technical audience and when I speak to a technical audience almost no one raises their hand for a yes everyone's absolutely no but when I go and talk to doctors because a lot of my collaborators are doctors all of them do this routinely and so I always use this as an example to say that the the doctors are being super practical and the technologists are thinking about it in this zero to one numbers

way and it's interesting to see that product fee follow is somewhere in the middle where we understand the tech and yet we know the reasons that why we need to be practical as well yes have you had a doctor put something like that into Google translate and had somebody who speaks Chinese translate that into English for you um so my colleague Dr elen Kung who's at UCSF has a lot of Chinese patients and she does this all the time she puts all her notes into Google translate um and so a lot of the times the

patient comes in with someone with them who may know some English and so yes they they do that but we did run this experiment where we actually tried using Google Translate to back translate so I'll talk a little bit more about this a little bit later but um yes sometimes people do things like that yeah and they know change like or something end could get yes liability is a big issue yeah absolutely question how accurate I will I was ask translating might be what I was talking about I I personally wouldn't directly translate but I

would translate it not sure the patient I translated back in English to see what it rated as and then if it seems reasonable perhaps I would show the original yes so there's been a button in the middle of Google Translate for flipping it around so doctors do that as to direct some do some do not all we actually found a hospital somewhere in California that where the guidance to doctors was to do that to to back translate but you're completely right there is absolutely no written rule anywhere and the liability lies completely with the physician

who did that one thing that's a really good point that likely will work better that also be you're talking about there's an angle is the doct on their nurses and and on their staff to actually provide the answers foration so even if they use Google Translate it's dependent on the nurses to make the final translation for those patients Li slightly lower in an actual hospital or clinic set where nurses are much more involved ination well I guess in this scenario the liability would be with the person who gave the incorrect information so um what I

the example that I gave you was a sort of a trick question um it's one where Google translate at least up till 2021 was um flipping the meaning so hold your kidney medicine until you've spoken to your doctor was an actual sentence that someone was given in hospital discharge instructions that my colleague she had a big database of things that she had said to patients so we ran it through Google translate translated it had a had someone who's um understands that language look and rate and there were about 8% of the sentences that were not

only incorrect but were incorrect such that they caused clinically significant harm and the sentence that I said was one of those it got trans translated to keep your kidney medicine and so this is an example of these sort of errors that are very hard for us to identify because we don't know when the model is going to be right and when it's going to be wrong and even if we know a percentage of 8% of the time is wrong that still doesn't help me on an individual basis when I'm working on the with these models

and so I think you know the the next generation of tools that we use are going to be completely different and everything about them is going to be different and I think that that means a lot for product managers uh we're going to have tools that are stochastic that are not always right that you can't even really run an evaluation with because each each time you run it something different is going to happen I think that that's going to touch every aspect of the product I think it's even going to touch pricing because so far

we we're so used to a per license or per seat pricing model because it was kind of the cost of running software was pretty much the same for for each person who joined right but now we're at a point where it kind of depends on how much they use it the models are actually the thing that's really expensive and so we're seeing some startups do token based pricing or there's even one startup that's doing this really Innovative thing where they're doing outcome based pricing so they're saying we'll help you answer your support tickets and we

will take a percentage of how many support tickets we close and so I think we're going to see more of that because we're going to see more and more tools that are more involved in the act ual doing of the work than just being a tool that that you use to do the work I think we're going to see more and more I think we're going to see less of those things that have become so standard where you know those tables that every company has where it's like our features and what the competitor's features are

and there are ticks in them like we give you 20 gigabytes of storage the other company gives you 10 I think that those are all a product of software that was very deterministic and very easy to measure and we're going to see less and less of that I think you know Chad GV and G and I have been around for a while there isn't a single table that compares them like that and most people just go off of Vibes or they try it a few times and they I've heard some people who really like Gemini

and some people who really like TBT and I'm sure some people love story tell um and so I think that we're going to see a whole new way of how people you know evaluate software make decisions on what software to buy and how to use it and they are going to be really really important questions about real reliability here because these are non-deterministic or stochastic systems I think that reliability is going to be the most important thing and the way that I think about reliability is two things accuracy and verifiability and so going back to

this project that we started about a year ago these this is me and my students uh we started to think about okay what is it going to look like if we use models for what we believe models do really well which is go through lots and lots of data and find out the patterns insights or even take actions on behalf of the person so I teach HCI at Cal so a lot of my students end up going into product roles or ux research roles and so we said let's try it out on something that we

do all the time which is go through user data and try to understand what people's pain points are or what we should be building next or those kinds of things now I think conversational interface are great I think they come very naturally to people but I don't think the future will be a lot of One-Shot interactions with models I think that models are good at certain things but for a lot of the workflows that we see people actually do it's actually going to take longer forms of Engagement with multiple different models or different kinds of

algorithms and so we said okay let's build three template workflows and we call them widgets and what each widget does is it takes some unstructured data and solves a problem for a potential product manager so the first one is a Q&A widget that one is just you have your data you can ask questions about it the second was paino tracker so we said let's say you have all of your data say reviews from customers and you want to know what are the main pain points that customers have and how have those pain points changed over

time so we made a widget that you can drag and drw over data and it draws a graph for you of here's your pain points that your customers mentioned and here's here's how they've changed and the third one we call this one user insights widget this one takes a lot of qualitative data and you can either ask it specific things that you want out of that qualitative data or you can um you can just say just find out the main Trends here and it goes and it searchs through all that data and it pulls it

out and it gives you a whiteboard with each Trend being one of those squares and it puts sticky notes in it so we thought of these three workflows as things that we think would be really helpful for product managers and so I'm going to show you a video of the thing that we built and mind you this was a research project mostly coded by students so it's not uh can we start playing it okay um I think we need to get rid of that thing awesome thank you so this is what we ended up building

so um there's a space here and you can pull in data we ran this study with um notion reviews so we took we scraped App Store reviews from notion that had three stars in below to test the paino tracker widget so I think if one of each one of these data points is a review that someone wrote on the app store for notion and the task here was it's impossible for someone to read all of these but what if there was a widget and the way you run it on this stas whatever data is in

the middle is what these widgets will be run on you kind of give it a name and you can either give it specific pain points you want it to go look for or you say find the top three so when you run a widget over data it starts to run meaning that it's going through all the data analyzing with an algorithm and the output is the spreadsheet with here are the top five pain points that everyone mentioned and here how those pain points have changed over time so this is how many times each paino was

mentioned each month and there's a graph here you can see laggy and glitchy on iPad was mentioned five times in August and then it went down so it probably got resolved um but another paino lack of intuitive design is going up in notion opstore reviews this is the Q&A widget that I talked about where you can ask questions from the data like what are the biggest challenges with notion AI it goes through that data puts together a response for you and it tells you it sources so that's the where the verifiability comes in and you

can keep running these widgets we try to make it really flexible because our goal was to see what will people do with these kinds of things um this is another example of the widget this is more for qualitative interviews so um we actually got these off of YouTube so there are reviews that um someone did off of the notion app uh these are YouTube video transcripts so we took the transcript and these were also very long so we took tasks where like actually doing it manually would take hours and hours and hours and we built

widgets so I think I'll do the user insights one so you give it a name you can give it specific topic areas that you want it to go look for in the data like I want to know more about the Calendar app and I want to know what people are saying about the AI output because notion added an AI bot and so it goes analyzes the data and puts it all together and the output looks like this so it makes a whiteboard where there's a box for each user a little summary of that interview and

then all the questions that you had um it makes a little sticky notes for each of those things or anything that it finds in that data so this is for user two what they said and then there's a summary table at the end for each of the things that I was interested in and each of the users what did I hear all right so this is what we built and our goal was to see how do people use this is it useful what can we learn from giving this to some product managers and having them

use it for something that resembl the real world task so if we could go back have [Music] question we just built it from scratch it's just a okay and how dat so how like from the different comments ortion [Music] yeah yeah so um my students built this web app they they had two ways to bring data in one was that you could just upload data the other one was they wrote a script that you could sort of get reviews from the API with the Apple Store they they they did that it can scrape that or

it can take YouTube videos or things like that yeah any other questions about the tool I'm going to share what we learned from running some studies with it how [Music] long about 6 [Music] months so we tried a lot of different things and they're not all llms um so our our sort of our Insight was that it would be more interesting if for each workflow we designed an algorithm for that workflow so for instance for the paypoint tracker there multiple things that need to happen the first is that you have all this data you need

to pull out what the main pain points are then you somehow need to Cluster them because we need some understanding of what the groupings are then for each cluster you need to count how many times they come up and so we designed an algorithm that does that and we tried a few different things so the first thing we tried was for each review let's pull out the paint points so one sentence for each paint point now we have a long list of sentences now let's use a clustering algorithm to Cluster it and then for each

cluster let's figure out what that pain Point what that cluster represents and what ended up happening was that there was so much information loss because we did the summarization too early that it didn't really work and so we tried something else we um just chunk ified the text so each chunk was about a paragraph we made a vector for each of those paragraphs and then we clustered the vectors now we had a bunch of clusters some of them represented pain points some didn't and the way we figured out which ones are pain points is that

for each cluster we would sample two or three of those of those clusters and we' give it to an llm I think we use chbt and say hey do these couple sentences represent a paino and if if they did then we would we would sample a few more and figure out what that paino was um and if it didn't we would just throw away that cluster and that worked better and so that's part of my point here is that I think it's going to take a whole lot of work to figure out how to develop

these work flows it's going to take lots of algorithms it's going to take different kinds of models that are good at different things um and we're going to move away from one shot interactions with a single model I think towards um figuring out what needs to be done and just designing algorithm that does that really well does that answer your question yeah guess cool so for um for the user insights we did a similar algorithm that would go through um there were two paths right one was if the user asked specific questions the other one

was if the user just said find the top trends and so we developed an algorithm for each of those um the Q&A was the easiest that was just basic rag um any other questions yes yes yeah so um the reviews were with the API from the App Store those were scraped that way um the YouTube videos I think they just did it manually they just took the transcripts and uploaded them so we did a user study on this we had uh 16 product managers spend about two hours uh doing a pretty realistic task with this

so we said imagine that you're the product manager at notion and you have all this data and we want you to end up with basically a a document or next steps for notion what what should they focus on next and we gave them access to all three widgets we did a quick overview of them of how to use the widgets and the data was already there and we started to look at how they use it what we found was that um they used it to navigate large amounts of unstructured data so they used all most

people used all three widgets and we thought that was really interesting because with the amount of time that they had about an hour there was absolutely no way that they could actually go read all that information but they make use of the widgets to navigate that large amount of information find the patterns and what's important and they use that to write up a document of next steps we found that they had really different workflows and I thought that was really interesting and I I think a big takeaway here is that this next generation of tools

is going to require a lot of customization because this is a few of the workflows that people did so the blue ones here are anytime someone used the user insights widget the red ones are when they use the pain points widgets Q&A is orange and gray is when they try to do it manually or they just like opened a few and just read it to get a feel for it and so the each one of these columns is one person's flow so like this first one this was someone who read the interviews then picked a

few of them then brainstormed and then just did the Q&A once to ask what are the metrics the second one started with user insights to get an overview of the data then they went to pain points and got an overview of that data so they did a qualitative followed by qualitative analysis then they went back to user insights and they ended with two questions and then there were some people who did like really weird things and they did loops and they went back and forth but each person had a completely different flow and I thought

that was really interesting because um I think it teaches us something about what people do well and what models do well I think the PMS relied a lot on their intuition and what they know contextually about the company the business the technology and what needed to be done and they use these as Tools in different kinds of configurations another thing we found was that they did a lot of cross verification so anytime you see one of these sort of arrows that goes back and forth they would like do One widget and then do the same

thing on on the same data or ask this question slightly differently and try to understand if it was right or not and so I think we're starting to pick up on the behavioral patterns that we need to have to be able to work with these stochastic models is um a lot of verification a lot of trying things multiple different ways to see what happens and so we found that um there were two main ways of thinking that people used Yodi for and I forgot to say I think the tool was called Yodi um so they

did Divergent thinking where they were branching out generating multiple different kinds of solutions and they did convergent thinking where they were trying to converge everything together and prioritize and we thought that they would probably use the widgets only for convergent thinking our initial hypothesis was that people are good at Divergence and creativity and modes are good at you know synthesis and pattern matching and bringing everything together but what we found in practice was that they used the widget for both of those things and so it was a lot more mixed and iterative than we expected

we found that they did use it a lot for brainstorming um they used it to come up with new ideas um come up with assumptions that they might have missed and to discover new insights and we found that they did do a lot of trust building through transparency so one of the things that we did right off the bat was that anything that was generated was generated with its source so they could always validate and they really appreciated that but they also did other sorts of creative forms of validation where they would ask One widget

to do something then the other widgets do the same thing and and cross validate that way there were some challenges that we didn't expect though and I think those are really interesting so one thing um so we we our initial hypothesis was that the person has to be in charge and they have to decide what widgets to use when and they have to configure the widgets so for each widget there were some things that you could configure like what questions do I want to ask from this data or what data to even run this on

and when to use the widget and we thought this would be great because we wanted the person to be in the driver's seat but in the followup interviews people did talk about confirmation bias because they were like if I have this hypothesis that something is going on in this data if I ask that question the model will go find the evidence for the thing that I I'm trying to find right because there's so much data it has to be somewhere in that data and so they actually wanted sometimes for the sort of widgets to push

back for there to be someone to question assumptions and stop them from over relying they also worried about overreliance on these AI generated insights of someone else on the team was coming up and saying hey this is what the Yodi widget said and now there's no conversation we have to go by whatever the AI said and they were worried about that might lose some of the depth of understanding that you would get from actually spending hours and reading this stuff and really getting deep into it um they also worked worried about decision- making in high

STI settings um requiring you know human oversight and um one thing that I think is really interesting was that they were talking about how I as the PM know things about the business that the AI cannot know because it's not even in the data most of what we know contextually and through conversations is actually never recorded anywhere so there's no way for it to be in any data that could end up in a model and so there are these things sometimes we call them human intuition sometimes uh it's you know just the art of say

product that models just don't have access to and so that's really important if there's over Reliance on these models for decision- making and so there were three key takeaways for what people actually wanted user control was of course one but they also talked a lot about accountability and ethical consideration um one thing that they loved was that with the widgets they could look at the same data at different levels of granularity so the fact that they could pick which widget to use and how to run it and go really deep into the data with question

answering or get this high level overview with just the paino tracker that is something that they really liked and I think is a good design pattern as well for other people building tools or if you're incorporating these kinds of models in your tools people really like having that granularity of options um they talked a lot about the creativity versus objectivity um I think we're starting to see this a little bit where tools like grammarly and some of the other generation tools now like give you the option of picking your tone or some of them allow

you the option of picking the temperature now temperature is something really technical in the model sometimes it means higher temperature is high higher creativity but what it really means is that high temperature is higher Randomness and sometimes that leads to creativity so I think we'll see more and more of these sort of granular options and adaptations made available to end users and um they really cared about it being customizable and I think that's really important because each person did something completely different and so I think that's a big design lesson as we're building more of

these tools um they cared a lot about accountability I think that's really important um they wanted to know where the outputs come from has anyone had experiences with this where you have an AI output and you kind of want to know okay but how did it come up with this output or this idea have you seen examples of it working wellx can you say more it gives you citations and I think yeah I think the main you know key for is use another model the sources you look at the sources in certain Cas like okay

now I get bu you know talking about you know right yeah so citations has been very successful yeah right so the citations are the reasoning I develop user for example when Gmail tell you this a Spam I don't want to believe thatat I want to understand why did you Market as spam maybe because I deleted the last five messages emails that look like it and actually give me Clarity okay you're doing a good job because you're explaining me what what you're doing and why you're doing it right yes so there's been a lot of research

into explainability so when a model produces an output can we explain how it came up with that output and sometimes we can if it's certain kinds of models we can but for a lot of the models and especially the newer ones the blackbox ones there's actually no one knows anything about how it came up with that idea now there are some approaches of post Haw explanations which is I know that this like really great model that's really powerful did this thing but now another model is going to come up with an explanation for how it

got there the problem is that the other model might be wrong and so there's been a lot of debate about whether that's even useful or not um all right any so how did this model came up the questions user or what of questions ask for the users to extract the right information m so how does it work and also the algorithm that has been used here is there any current available in the market [Music] that I think there yes oh yeah of course thank you for reminding me so the question is um are there any

tools on the market that can do that kind of a thing of you know scraping am I correct scraping the data and then doing that kind of analysis I think there are there there are a few on Market I think cley is one um there are a few startups that are doing this kind of a thing this was of course a research project so it's all it's it's actually on GitHub if you want to go run it yourself but um I think uh we were okay yeah explainability was something that everyone wanted but I'm not

sure how much we can actually get there we did the citations thing I think that that's always useful and that's fairly easy to do as long as you know where the data came in you can just site it um yeah but some of the llms like I recently did this with um with chbt I asked it to explain how it come to exp but then it comes back to the veracity like even within the same model how trusting can you be that the reasoning especially because with temperature and other things and it doesn't have unless

there's history that it's really save the execution plan for La for better word that it went through yeah how how accurate is it going to be able to to explain what zero accurate because um so Chad gvt is using Transformer models right it's um it's billions and billions of parameters and no one knows how it went and so asking it post talk to explain what just happened it just doesn't have access to that data so it will come up with something and that's the big problem right it always comes up with something that kind of

signs reasonable and our brains we just can't deal with that because we're sort of it kind of is mimicking another person and we're socialized and sort of hardwired that when another person tells us something that kind of sounds reasonable we just we just accept it and that's how we've been able to live as social beings for millions of years um and so I think that that's a really big design question is how do we as product managers and designers um create these interactions that explain to the person what is actually going on um in a

way that they can use it reliably but that it doesn't become misleading yeah I think a lot of what this the practice I've had is being able to debug these issues when they come up had issues where I monitor like a custom assistant Ai and saw respons no idea how it could have possibly prod that try to see exact good we scenario we can trace the outcome that fix the issue um and uh yeah a lot of proud of that one B direction is the B review in the App Store uh exactly um that's the

sort of vibe based evaluation right I always ask people what is your favorite llm chat bot and it's always a mix some people love Gemini and some people love chat gbt or flexity and I always ask them why and it's always one this one time I asked it to do this thing and I did it wrong and I asked the other one and did it right and then I completely switched and that's going to be a big problem for product managers right because you're the one who have to then explain to management why you you

know lost this customer because of Vibes and then you have to go tell the ml engineer to fix the Vibes um but I think transparency is something that people care a lot about it builds trust and so maybe even um you don't give the best model to the person right off the bat maybe you build trust with them first with a model that maybe is worse makes less mistakes there have been some people who have advocated for not using Black Box blackbox models at all in really high Stak settings like medicine um so we can

have a deep learning model which we know nothing about how it works and it can assign medicines with say 80% accuracy or we can have a deterministic just algorithm that maybe is a little less accurate but we always know why it did the thing so a doctor can go look and say um one of my friends Hima laraj who's a professor at Harvard did this study stud where um they would they they built a system a model that takes the inputs and outputs of a big um you know blackbox model and builds a deterministic algorithm

that more or less does the same as what the big model does so these types of algorithms are very common in medicine where if you go to the emergency room and you present with certain symptoms um usually it's an algorithm they're like the age of that person and then it branches off in a flowchart what are their symptoms and then they end up with what they should do with you um and so she built a system that builds a flowchart like that off of how a big model that was trained on a lot of data

acts and it was really cool because then you can go look into it and see oh it's discriminating against women on heart attacks because then you can see how it's coming up with that thing right but the thing that it came up with isn't exactly what the big model is doing it's just an approximation we think it's doing it that way but there are people that say in high stake settings we shouldn't be using the big blackbox models at all and we should stick to things are deterministic now for products it's a little bit of

a different story because sometimes the big blackbox models are so good that they can lead to you know a CH gbt moment but then you have to think about what is the worst thing that could happen to a user with this model and how do I build trust with them when they've just joined and I think those are all things that we're going to see more and more of um audit Trails were something that people asked for especially in collaborative settings that's exactly the sort of I did the same thing twice and I got two

different things and that becomes a problem if it was your colleague who did it and got one thing and you did it and you got a different thing and so I think we'll see more and more of sort of accountability measures um on top of models logging inputs and outputs is something that a lot of companies are doing and I think we might see more and more of that being made more transparent to end users um and the last one was people wanted explainability of what the limitations of the tools are just like more you

know end user education on what it can and cannot do and I'll end with some ethical considerations that are participants talked about privacy and confidentiality was of course a big one um any kind of Enterprise tool is going to deal with that I think companies have gotten much more privacy aware and especially with some of the things in the news about open AI or other companies you know training on people's data I think they've just become a lot more conscious of that and that's important um ways to mtig the confirmation bias of what the model

tells people came up um and they talked a lot about balancing the sort of benefits that they saw coming from automation with the human insights so they talked about things like this automation really increases my efficiency but there needs to be balanced with me and my intuition and my central role as the decision maker so um AI tools is enhancing but not replacing human judgment so I'll end there with um just three key insights I think one thing we learned was that that the LMS are super flexible very unpredictable so it has to there's a

big challenge ahead of us on how do we build products about around these models that make use of how powerful the models are but um you know in lead to users actually trusting them and using them reliably um and then we did some um comparisons to more deterministic ways to keep data and one of those is knowledge graphs so you might have heard a lot of companies are now incorporating different kinds of knowledge bases and addition to llms I think that we'll see a lot more of that kind of a duality where llms or models

are very good at language generation very bad at being reliable um and the other end of the extreme is reliable and deterministic knowledge sources like knowledge like knowledge graphs and so I think we'll see more and more combinations of these kinds of things and user guided combinations of these kinds of things so I'll stop there um thank you so much for listening in [Applause]

LLM workflows for product managers: Niloufar Salehi (Assistant Professor, UCB)