hello spark fans welcome back to advancing spark brought to you by advancing analytics this is the third of our little datab brick AI Genie miniseries so in the first video I showed you what a good experience with using Genie should look like in the second more about the steps about how you go about actually building it so the click here click here click here this is how you produce a thing and the third one I want to take a step through some best practice guidelines more to how to think about building in more about the
genie strategy you should be using if you're rolling this out in your business now this is going to be based on some documentation literally just in the DAT bricks guidelines for actually here some best practices and it's kind of best practices but it's also kind of strategic how you think about it how to get the most out of it so I want to just walk through some of those steps some of those how we've seen it start to evolve and obviously this is all new we're going to see these evolve and change as businesses get
used to having this kind of chat experience as just part of that Analytics tool set but yeah talk about how we kind of figure these things out that is BL as always if you're new around here don't forget to like And subscribe and yeah let us know if you're already using Genie let us know if your business users are claing for some kind of chat interface to their data or if you're struggling to try and get them to think about it and they like no just give me an EXL cuz that's what I do and
you know one day one day that'll change today's not that day cool so let's get started so I've got this page up so this is from the data documentation and one the isure flavor but it's on all flavors uh and it is just the best practices page of the genie uh section and it's it's it's a bit of a weird one because it's not just best practice it's also like the how you go about rolling it out and that's that's it's the important thing about who who is the person rolling it out who do you
roll it out to how much do you roll out at a time it's all those kind of questions are just trying to ask and that's what I want to talk about today is just trying to trying to shape that trying to get that in the right place in your head if you're thinking about using this thing so okay let's let's start having a little look through so number one most important of all things is start small now this is a mistake I made when I first started using Junior I was just collected like as many
tables as I could actually physically get it to have and then just have it as this really open space where I can just ask any question that pops into my head about the company's data and I would just get a response that was my dream that what I was thinking of and that's not what Genie is that is not how to think about datab Brick's AI bi Genie Genie is like having a team of experts and you go right you're the expert in this I'm going to ask you a question oh you're the expert this
going to ask you a question that's how to think about Genie it is having a selection of tily focused and scoped chat rooms so you can ask a certain thing about a certain thing and it will know what you're talking about that is the positioning we're trying to go for so if you're coming from things like if you're looking at uh Microsoft's co-pilot you know and kind of all these kind of like chat GPT style super wide general knowledge llms it's a different experience you can't just go to it and just ask literally anything that
comes to mind you go to the genie space for the question the domain expert you're trying to get and it will answer those questions for you so keeping it focused is huge you know there's there's a limit to how many tables you can actually get into Genie but it's going to perform better the fewer tables it has so they say five or fewer which is i' go a couple more depending on what you're doing I mean most of the time we've already done some data modeling we've got and understood the business problems we built this
into a star schema so essentially it's scope it around a star schema with a few Dimensions should be big old fact table few Dimensions maybe you might have a couple of big aggregated summary tables to use instead but that kind of thing so if you've done if you're doing a big roll out of data across your Enterprise and you've actually done a load of groundwork to design data products that are targeted at specific different domains you've modeled that process you model that process model that process building Genie rooms around each of those particular data products
is kind of how it makes sense that's how we're starting to think about about it so we're not trying to build one Genie chat room that can answer any business question it is bunch of experts around different areas so keeping it focused keeping it on a small when it says small amount of data that does not mean number of rows that means just a small number of data objects that can have billions and billions of rows in there uh planning to iterate absolutely this is an iterative process much in the same way that if you
didn't speak to your business uses and you designed a dashboard and you wire framed it and you went through a detailed design then you finally build it and you show it to them then go oh that's not really what I need you should iterate on all of these things very much the same in juny they're going to ask questions you hadn't preempted they're going to try and use it in a way that you hadn't even dreamed of and that process of going iterate do some prepping do some planning try and get it good but then
put it in front of people and iterate has to be the way you build it it is going to be a living product that slowly augments as people actually on board it and they they evolve it and they go with it building on well anotated well annotated tables is an absolute no-brainer you have to do this on something that actually has the real business context in there what does that mean that means Unity catalog it means good column descriptions it means tagging it means foreign keys and primary Keys it means doing your data modeling and
your data governance well so way it works you've got your instructions that is the general big free text space that gives just general instructions to that Genie room but it's also got the unity catalog metadata so whenever you ask a question it just grabs that instruction the metadata passes that through to a large language model that knows nothing else about your data think about it each time it's fresh it's just being given the context of the last question you asked in that same chat any instructions any metadata if it's going to be able to know
well that column name is about that that column name is about that it has to have more context and it gets that from your column descriptions so your tables have to have a good description of what that table's for your columns have to have good descriptions of this column contains information about this this column contains information about this really important to know Genie does not see the actual content of the data it can never go well actually I'll just run a little query about what oh there's only 10 different values in there I'll use that
to things it doesn't have that context at no point does the actual content of your tables get sent to that large language model it is purely the metadata about it that we know that is sent as part of your prompt so you have to make sure your governence is there you have to make sure your column descriptions are good if things are in a certain style a certain format it's good to have that in the comment so it knows how to ask that question knows how to build your Weare statements or one up really really
really really important again yes data governance is cool again we now there now suddenly a driver to spend a lot of time making sure your governance is up to scratch because then any kind of AI that's going to be writing queries and asking questions of your data is going to be better so yeah kind of makes sense uh have a domain expert Define the space well yeah I mean it's the age old problem right we've seen so many times a data team a team of data Engineers have been told okay you need to write we
need a new data model about this particular process go go and build something about I don't know a manufacturing line the efficiency of it uh and they'll go and look Route Around The Source system find where that information is build a data model name based on what it's called in the source system and business un won't have a clue what that means it is in different terminology is in the language of the system not the language of the people same thing here right you need someone from the business actually defining and curating and working with
it right if there's an embedded analyst who answers all their business questions bring them into the story as the curator allow them to define the different questions allow them to define a lot of the instructions and personality of it if you are not talking to the business a lot you're going to have a real hard time building out a Genie chat room that actually captures their interest that is asking the right questions that is preempting the terminology and the the ways of asking questions that they're going to actually use so absolutely have some Wonder who's
on board doing that now ideally that first person should be really proficient in SQL because they're going to be eyeballing the the SQL generated I ask that question is generated that that that doesn't seem right I ask that question generated oh no no that's not right if the person first building it needs to do that sense checking have your set up in the right way is your descriptions right is it inferring the wrong information you need someone to be doing that and nudging it in that right direction once it's there you can start to widen
and then you start to adjust it and you work with it defining the purpose of the space absolutely that that whole thing of if you have people log into it and it doesn't have any sample questions it doesn't have any context it's really hard to get them to ask the right questions they're going to think oh it's like chat GPT I'm just going to ask any question that comes to my mind I'm going to try and get it to generate an image for me that is not how Genie Works whereas if you have a really
good description and you have a load of preceded sample questions that wets their rid fight for the kind of questions it can answer and maybe prompt them with some things they haven't really thought about then they're going to be scoping they're going to be thinking okay can ask questions about that oh what about this and start doing it that way so have it so you look you go to a particular room and there's a particular flavor a particular sent some particular prompts just lets people nudges them in the right direction so they know how to
actually work that what kind of question should I ask you is something you can preed the prompt so it said well I'm I'm I'm an expert in these things why not ask me these questions build that into your instructions so it can actually guide them in the right way testing and adjusting yeah absolutely so you've got a you've got a curator so you've got a A Champion analyst who's had to go themselves so ideally it's I've built this I've logged in I I understand the domain problem I'm asking it questions myself I'm looking at the
squel going oh that's not quite right that's not quite right I'm of going through that process I'm getting it to the point when the questions I think of off the top of my he or or my my domain Champion thinks of it's kind of starting to to answer all those questions quite well so at that point I've I've asked it a few things and it's it's not quite got the terminology so I've added details into my instruction to go if someone asks about year they mean fiscal year if someone ask about X they mean y
that kind of thing I've got some translations in there there's one of few queries that it did okay but actually there's a there's a a way we normally write that query and I've added that as a preceded bit of Sequel and I'm I'm pretty happy about how it's working well okay this works for me I mean the next before just hitting a button and putting it live across everyone should kind of almost conduct a focus group essentially get a small selection of pame users I guess people who have been prepared people who know that it's
not going to be perfect that's a real challenge in some companies uh we talked a lot to clients are like no we we can't show anyone the dashboard until it's fully complete cuz if there's mistakes in it they just won't trust the data ever again and that's really hard cuz what you really need is an Engaged user base that is working with you you go right okay we've got our first Alpha release it's going to be okay some stuff it's going to be rubbish or other stuff can you give it a go and flag it
good or bad and as soon as you flag something's bad we'll go and switch around we'll fix it and you'll see it improve and you can be part of this journey you can help us get to where we need to be that's the ideal right so rather than just pushing out to everybody and then dealing with the wave of problems going okay you four or five people who understand the domain and have actual questions I've kind of got it to where I need to be but you will never have a scenario where you have preempted
all of the things a user will possibly do anyone who's done any kind of UI ux work will know you cannot predict how a user will use your system you can have a stab but the best way to actually then know that is to let users use it and go oh I didn't think they click that button okay same way they're going to ask questions you haven't thought about they're going to do chains of questions oh well that's interesting what about this oh well can you show me this and dig into stuff they're going to
work out in a way that you're not expecting now Genie has this whole log of this is the question this is the response this is the question this is the response and it has a feedback mechanism of thumbs up or thumbs down so if you've actually prepped your users and said look I've given this a good step I think think it can do a lot of things but you guys are the experts can you try and ask this and anytime it's good pop a thumbs up anytime it's not quite right even if it's just slightly
off can you give it a thumbs down and then if you're in this repeated thing of like they do that for a couple of hours in the morning in the afternoon you go on get all of the responses ah had of thumbs down and then just go right okay we can improve the docs for that one we can improve the comments in YouTube catalog we can add some an instruction for that oh I need to add a SQL query for that and just go on that iterative process and then them to retry that stuff and
they'll get a better experience and they go oh ah this this is it's it's fixed it's not a long turnaround it's not a oh I'll put a request in and then 3 weeks later I might get a change made it's a real quick iterative find a problem it's better find a problem it's fixed fin the problem it's fixed they get engaged they get asked better questions they get to the point when oh actually now it's answering all of their questions quite well then it's ready to be pushed out to the general public the general business
to start asking questions on it so it's that little journey of get it right yourself get a load of other people to ask a load of questions of it then roll it out that's the kind of Journey that we're looking at it's not it's not the same as a very formal application process when you're going to try and preempt and predict everything it doesn't need to be better that it's actually just tested and tried and like evolved with how business actually use it it's kind of how we're looking at things uh so you've gone through
that process you've got it rolled out to you you've got some thumbs down coming from your various different users whether it's live or this is your trial test bed of beta users um what are the problems are can I have so misunderstanding business jargon absolutely if people use shorthand if they again the example here if they say year but actually whenever we say year in my organization they mean fiscal year if you got any acronyms if you've got any shorthand if you got any just terms where you say profit and profit is actually defined by
a specific calculation and that's not documented anywhere well that needs to be documented for Genie to be able to understand it and that is going back and just adding into your system adding into your instructions going well if someone says year they mean fiscal year if someone say profit they mean this calculation and you can add all of those as instructions now last year dat RS announced that metrics or a thing that's coming so I'm hoping that that's going to try and replace that being embedded into instructions to just start to add these things into
unit catalog we've not seen that yet we don't know how much Genie is actually going to work with that so for now it's adding it into your instructions pool and then Genie will just have that logic have that as a prompt and use that as it goes forwards so business jargon you will see people try and use it and it doesn't answer it right make sure they are prepped to give it a thumbs down then you can pick all those up translate that jargon and start to collect it okay incorrect table economy so the SQL
that it writes is not correct what's going to happen there well how can we fix it one probably means your Unity catalog definitions are not right prob means that it might just have very vague fluffy definitions that overlap between certain columns which isn't particularly good uh so going in refining those comments adding more detail uh into those column LEL descriptions maybe if you've only got table level descriptions you've not done it on the column level you need to go through and finish that um if it doesn't have um baring keys it's going to help fix
it if you're actually going to put those things in uh if if that still doesn't work so just having the pure docs right doesn't have enough context well adding example queries works now you can only have a limited number of those example queries and you don't want to get to the point when you you have literally every question with an example query but if there's a specific thing that he trying to do that it just doesn't get it from the context of your query yeah you need to go and add that in maybe there's a
a particular concept where one thing needs to be rolled up as a subquery and it's always you in a certain way sure I put an example in there and sometimes you might just have too much data in there you might just have actually we've got two big fat tables and we got a load of Dimensions kind of like a constellation thing maybe want to split those out into two different geni rooms and be very explicit about the kind of questions they cover in which case you're not going to get as much overlap it's not going
to get confused about which tables to use for certain thing preferably you don't have get there but certainly you can go and do that filtering errors if it's having problems with the wear statement so we statement is the interesting one because wear is usually it's going to try and put a specific value into that we statement where equals and then a literal string that it passes down but as we said earlier it never sees the data so it doesn't actually know here's the 10 different strings that stored in this column it has to kind of
infer that from a bunch of other stuff so I have to kind of if you've got enough information in your descriptions to try and give it that info if you can say as it says here oh by the way this is stored as a three-letter ISO code it knows the iso codes whereas if it says if there's a category if there's a categorization you know it could be one of three segments you can include that in your instruction saying this column can be one of these three values then when it constructs the query it'll put
that in now you can't do that for everything but be aware of that going it literally does not know what that data contains it can only go it can only infer it based on what you're descri destion of that column is and any context that it's got so yeah that's one of the ones it's going to be more if you have real specific formats of things inside your columns and you're expecting it to be able to write SQL queries that filters it that way you need to preempt that you need to try and work around
that incorrect joins as we mentioned you can have foreign keys in there that's going to help that uh again just the less complex joins you have the better now this is one of those ones where when Genie first came out I saw a lots of people one of the first things I did was trying to get unmodeled data try to get data that was in like the silver layer of my Lake uh it had it was still in it Source system it's like third normal form coming from whatever garbage I got in and I was
trying to say yeah just write a query on that figure it out turn it into an analytical model on the Fly and it kind of struggled so the if it has real complex you need to understand a complex relationship with this that then needs to boil into an aggregation it's going to try but it's not going to be % successful so the the simpler and more standard your data mod the better the closer to a straightforward star schema the better if you have aggregate tabl is even better so the simpler the thing the the better
chance you're going to have but obviously the more work you have to do so a nice balance of going well this is the data product I put into my semantic model I'm going to put Genie room on top of that that works for me I'm just trying to get those things as says if it's really a problem you can kind of just put a view over the top and then just put things on top of it if it has particularly complex things it's trying to do again metrics we kind of talked about so we have
that just make sure they're actually baked into things uh if you can have anything that's pre-computed as part effect as always it is the the tradition of just compute calculate everything as far Upstream as possible but as Downstream as necessary Ro is Maxim we need to actually make sure that that is plugged in so if you can pre-cal anything in your D do it if you can bake anything into your views do it whereas otherwise make sure it's in your instructions so it actually has the context of how it's doing time based things if you
have a whole mix of time zones it might not have that context so again party your data model make sure things are in a standard time zone keep things in UTC so it's actually it's nice and simple or if it's not you to see make sure you have the time zone actually properly in there then I'll actually understand it again try and actually reference things it's all things like if someone say say uh how much we made this year making sure that year is an understood measure of going well what is my financial year when
when does my financial year start am I talking calendar year I talking Financial year you can plug some of that context in you just need to preempt it and make sure it's actually in there uh you might find that it ignores some instructions occasionally so if you've got a load of comments and things in there it might just not get it so yes your example SQL will help keeping views simple keep your instructions quite simple is a fairly big thing so it's always tempting in that instructions part to give long verose sentences if you were
as if you were explaining things to a junior developer you give like a quite verose explanation paragraphs explaining these different things doesn't need that you want to keep your instructions sh shop just bullet point list of what the different things is trying to do uh just to keep it very very simple I mean it's not there's no room for misinterpretation doesn't need extraneous words and vocabulary to make yourself sound smart just say this is this this is this this is this you don't need to say please in your system instructions because it's just going to
make it harder for it to actually understand um and yeah trying to keep things fight and again it if it is struggling to actually understand what you're talking about it's probably because you got the scope of your geni rooms too wide you probably got too many different domains trying to wrap into a single one keep it focused if you are starting small and going here's a fact table a few Dimensions example questions around it you shouldn't have problems only if you get into real complex things you're trying to just jam into a single room then
yeah performance if you're trying to do really really long queries or is trying to actually of uh struggle to get there uh you can do things like trusted assets you can actually plug in go well here's just a python script to actually go and run and pull back you can prean some stuff that you can plug some performance tuning in if if you need to generally you can avoid it unreliable responses yeah I mean if if something is absolutely incredibly ironcloud and important then yes you can just prean that sequel again similar story you can
only do it so much um and then one of the final ones is this token limit now this is the the balance so the answer that we've been given all the way through this is make sure Unity catalog has great description make sure the tables and columns have great descriptions and you got metadata in there if all else fails make sure that the instructions inside your genie room has loads of examples loads of contexts loads of information make sure you've got lots of example SQL queries however there is a token limit so like any large
language model they all have a token limit the token limit is essentially how many characters you can pass in to that as part of your prompt so your prompt isn't just the question your users asking it's the question they're asking the context within that same chat window plus the unity catalog metadata plus that system information prompt that we put in there jamms all that together that is the overall token that's get sent and if you think about going to like chat GPT and just writing an essay every single time you ask that question that can
get quite Vose it can quite big of what it's going in there you might reach a token limit if it does it's going to try and handle it itself by prioritizing certain things that it passes through so it's not going to pass the whole schema just the schema it thinks is useful it's not going to pass all the instructions just the bits that are relevant and there's the chance there that if it's actually making a decision about how much of that to send through maybe it's going to actually not send everything that it needs to
maybe that's why it's not maybe listening to some of your instructions maybe that's why it's not getting the right context for the table because you just tried to get too much in there so yes you can remove unnecessary columns which move reduces the amount of metadata it's sending through streamlining column descriptions just keep it real real simple yeah it's going to make it easy now the trick it's slightly tricky there right because the whole point of those column descriptions is so a person can read it and understand what's going on there and when're not saying
well actually we're taking the person out the loop and we're passing this to a large language model it doesn't need a nice flowery verbose language it needs this is what it actually refers to and that's tricky because unit catalog should be both we've got two things we're using it for one as a data dictionary for actual people to discover stuff two to pass to the large language model which doesn't need the extra context which is slightly tricky about how you want what you want to prioritize in your Unity catalog descriptions simplify instructions just keep things
blunt easy no one else is going to read that make it super simple and yes prune the number of different squl statements so it that's a tricky bit right all of the advice so far has been add more context add more descriptions add more examples add more stuff and then you'll get to a point when you got too much stuff in there and you need to start pairing it down and streamlining it that's why starting small starting really tightly focused starting with that we are answering questions about this is going to set you up for
Success so you're not going to get to this point when you then have to go oh but we fixed all these things and now we're going to have to squeeze it down and maybe make compromises that's the kind of thing that we're trying to do and and then yes finally if you're not unable for cross GE processing you're going to have a problem and you might reach through limits if thousands of people are trying to ask things so 20 Questions per minute across all of your genie spaces actually isn't that high if you're pushing this
out to hundreds of users so it has a little thre limit for now which I'm sure is something as it gets more you can only went GA last month yeah uh so very very small amount of things and that's it so that that is all of the advice that we have coming from data bricks about how to best think about Genie now there's a lot of things in there about you know specific things you can do fixed stuff but in reality it's a fairly easy strategy right it's a think about a particular problem you're trying
to solve prepare a load of questions and how you want to do that make sure we can answer your own questions get a bunch of trusted users who understand that this is an early model that going to have problems that won't do everything get them into a very very tight very reactive feedback cycle fix all their problems get it to the point they can ask questions and they trust it and they start championing it they start using it push it out to a wiy user base and then just look after it and iterate it keep
looking back at the thumbs down don't just put it live and then forget about it have a a weekly review set up an alert go and look and go am I still getting any thumbs Downs are there any problems that things now that some of those thumbs Downs might be from users who just didn't quite get it and were asking the wrong question we're asking questions outside the domain scope of that thing that's an education piece reach out to them speak to them go hey just want to talk talk to you about how this should
work or maybe you need to improve your documentation or your onboarding um otherwise it's a oh we need to add more things to the instructions we need to add some example sequel bearing in mind that there is a massive limit that we might need to pair down if it gets to that point if you're doing those things if that's how you're thinking about stuff and you're rolling out these tightly scoped focused Genie spaces each with a champion each with a user group that's committed to making it better it can be an actually really really effective
part of your analytic story it's not replacing data like dashboards if I just want to see 20 kpis at once and have a glance of any of their bad I'm still going to look at a scorecard for it I'm not going to go and ask that question every day but as soon as I want to go what that what's the difference between that I'm going to then I have to ask someone that question that's what I'm going to fall back to Genie for that's what we need to start building this into our workloads so yeah
interesting times going to be interesting how this actually works in production as we start to see people with it live for 3 months 6 months N9 months what the genie spaces actually look like what does it look like once it's been out in the wild for months with that feedback loop going how well structured are people's instructions is it just a chaos of little organic things or are people actually managing it properly and how much do we then rip that out replace it with things like metrics when they come out loads of questions in there
lots of things we're looking at but yeah certainly interesting times hopefully that helps and um yeah let me know if you actually are getting on board with Genie and your actually are moving it forward it be super interesting to see all right that is everything I want to go through as always don't forget to like And subscribe and I'll catch you next time cheers