hey everyone welcome back to the channel today we're continuing our DP 600 exam preparation course and we're up to video8 we're making a very good progress here on the course plan today we're going to be looking at optimizing performance and specifically we're going to be covering these bullet points in the dp600 study guide so we're going to be looking at mainly identifying and resolving performance issues right so when you're loading data or when you're querying data or transforming data and specifically be looking within the data flow The Notebook so The Spark engine and also SQL
queries as well then we're going to look at Delta tables in a bit more detail how we can identify and resolve issues within our Delta tables as you know fabric is built on top of the Delta file format so that's a really important topic to understand and as part of that we're going to be looking at file partitioning as well so what that is what that looks like why you might want to implement file partitioning in your Lakehouse so as ever at the end of the video we'll be testing some of your knowledge from the
topics that we cover in this lesson and as ever I've got some quite detailed notes that you can use to enhance your vision available in our school Community I'll leave a link to that in the description box below so most of this video I'm going to be diving into Fabric and going through performance optimization in a number of different places in fabric but I just wanted to start by Framing what we mean really by performance optimization in fabric now as you know fabric is a very d diverse tool so when we talk about performance really
we need to get a bit more specific about well what are we talking about it could be data flow performance could be a SQL script in your dat a warehouse or we could talk about the Delta files and optimizing how they get written and read in our one Lake in our lake houses as well and I would make the distinction here between identifying performance issues and then resolving them it's kind of like a two-step process first we need to know how to identify performance issues in each these tools and then we need to think about
how we can possibly resolve these issues after we identify them and you'll notice for each of the fabric items how we identify performance issues is going to be a little bit different right so for the data flow we're going to be looking at the refresh history and the monitoring Hub and the capacity metrics app and you notice that some of these actually repeat so the monitoring Hub and the capacity metrics app is kind of like a generic place where you can do lots of performance monitoring across Fabric in the data warehouse we have query insights
and DMVs dynamic management views with the Spark engine we have the spark history server and we also have access to quite detailed monitoring in the monitoring Hub as well and when we're identifying performance issues in Delta files there's a number of places you can do that one of them that we're going to be looking at is described then when it comes to resolving some of these issues well that's where it gets a bit more difficult to Define right because it's normally going to involve some element of refactoring so using different operations in your data flow
for example or refactoring your SQL code or for refactoring your spark jobs as well now in the data flow we have some specific performance optimization features that is worth going through so we'll be talking a little bit about staging and fast copy as well then when it comes to Delta file optimization we're going to go into a bit more detail about V order optimization file partitioning and also the vacuum and optimize which are two Delta table functions that we can Implement to improve performance with Delta files so that's a bit of an overview of what
we're going to be discussing in this lesson now let's dive into Fabric and we're going to begin by looking at some of the generic tools like the monitoring Hub and the capacity metrics app before diving into the data flow data warehouse and the spark notebook in more detail so let's begin okay so just before we jump into fabric for this tutorial I'm just going to start in the school Community here and talk through some of the notes that we have for optimizing performance so here is video 8 optimizing performance we've got this framing performance optimization
chart that we spoke about previously but to get us started I want to speak generally about performance monitoring there's a few tools that are quite General they apply across different workloads that are useful to know for the exam and also in fabric generally so we're going to be talking about the monitoring Hub and the capacity metrics app and these are two tools that can be used to monitor performance for a wide variety of operations in fabric so the monitoring Hub is the first one so let's start by looking at the monitoring Hub and if I
just flick over to powerbi here obviously to access the monitor in Hub you've probably seen it here it's in the left hand toolbar here you got this big button for monitoring Hub and within here we can basically have a look at the runs of a lot of different item types right so you can see semantic model refreshes notebooks so this is going to be a spark session and for each item type we have different logging that gets exposed in this monitoring Hub now the notebook we're going to have a look at in a bit more
detail because that's exposing spark log information you can also see like data flow Gen 2 you can have a look at whether those runs have succeeded or not table loading information in a lake house for example and we can obviously click through into specific items that we care about and we get more information right so this is showing a table load into the bronze lake house and it's showing you the different jobs because this is a lake house these spark jobs right so it's also going to tell us in the monitoring Hub whether runs have
been successful or failed so this particular data flow run we can see that it actually failed right so on the 1st of May at 12:46 I tried to refresh this data flow and actually failed so you can click on view detail and you can get a few more details about what happened here you can't really diagnose what went wrong with that particular data flow to actually get the details of the data flow you have to actually go into the data flow itself which is this one here and then we can click on these three dots
here click on the refresh history so for data flows if you want to actually debug what went wrong this is obviously pretty poor data flow this but you can click on the individual runs and get more detailed information about what's going wrong here so here we can see it's actually this activity here that's failed we can click on that and then get more information about why specifically that column or that data set can't be refreshed so the next general tool that we can use to monitor performance and resource consumption within our fabric capacities is the
capacity metrics app and the capacity metric app and I'll leave a link you can obviously get the install instructions here if you have never installed this before this is a powerbi app that you install within your fabric environment you give it your capacity ID and again the capacity settings is where you'll find that capacity ID and then it's going to bring you through to this kind of capacity metrics app now the capacity metrics app is split into two sections we have compute and storage and this obviously lines up with how fabric is build right you're
build partly on storage so the amount of storage you have in fabric Plus the resources that you consume during compute so if you've already installed the capacity metrics app you can find it in the powerbi experience go to apps and then you should see it there the Microsoft fabric capacity metrics app now as I mentioned there's two tabs to this report it's compute and storage on the compute page so starting at the top left we can see the capacity unit spend for particular item types in Fabric and we can also break it down by duration
different operations and by user as well we also get this time series of capacity usage over time as a percentage of the total capacity that you have available based on your skew So currently I'm on this trial capacity so this is going to be an X f64 and as you can see I'm not really using barely any of this capacity we've also got these other interesting graphs around throttling so if you're using more than 100% of your capacity usage on that particular capacity this is going to show you where you're throttling and it's also going
to show you rejections so if you've got workloads that are being rejected On Your Capacity because you're again over 100% And is currently rejecting workloads then that's going to be exposed here we've also got a graph on overages so again if your capacity is throttled and overage is basically you repaying that capacity usage from your future spend right and for each of these obviously I haven't actually been throttled or you know there's no overage on my actual capacity but there is this explore button that you can drill through to specific events that you want to
explore in more detail if that is something that's happening on your capacity down below you've got a table of all of the different items in your fabric capacity and the capacity unit seconds that are being used by that specific resource so this is a synapse notebook and we can see that that is the most resource intensive it's used up the most of our capacity unit seconds and it's also got this tool tip where you can look at specific activities and runs of that notebook to dig into a bit more detail there on the storage tab
obviously this focuses on the amount of gigabytes of storage in this fabric capacity we can see how it's changing over time we can look at the specific storage by date and we can also look at the top 10 workspaces by billable storage once that's loaded that's what that brings you there here we go okay so next I just wanted to talk about data flows and just to summarize what we looked at previously well if we want to monitor the performance of a data flow well at a high level we can do that Within monitoring Hub
but if you want a bit of a lower level data and to understand what's happening within a particular data flow you're going to be wanting to look at the refresh history as I showed you previously for a particular data flow here you can inspect the error messages you get a breakdown of the the different sub activities in that load for a data flow so that's going to be really important for you to diagnose what's going wrong in a particular data flow if it's not refreshing correctly that's when you where you're going to go to have
a look there now there's a couple of features that you need to be aware about in terms of optimizing the performance specifically when we're talking about data flows right so the main one is staging now staging is probably best described using this diagram here so this diagram actually comes from this link here it's the spotlight blog on data flows and it's got some top tips for improving the performance in your data flows so I think to understand what's going on with staging this diagram gives you a pretty good idea so let's start by talking about
when staging is disabled so this bottom diagram here here right so if staging is disabled all of your transformation in a data flow is going to be done by the data flow engine it's other otherwise known as the mashup engine right and if you got a really big data set or you're doing lots of transformation that might not be the most efficient way of doing it so a feature we have available to us to try and improve the speed of doing all these Transformations within a data flow is staging so at the top here when
staging is enabled what it's going to do is it's going to read in the data from the data source and then it's going to immediately write that data into a Lakehouse staging table then it's going to use that Lakehouse to perform the transformation right so leveraging the Spark engine rather than the mashup engine to do your transformation then it's going to read the data back into the mashup engine and write it into the destination wherever that might be so a few things to bear in mind here if you've got a lot of data Transformations or
you've got very large amounts of data staging is probably going to be a lot more efficient now if you got small data sets is probably going to be less efficient right because you're going to have to write the data into a lake house transform it into the lake house and then write it then the mashup engine picks it up again and writes it to your output destination so there's lots of kind of reading and writing here so on small data sets you probably want to disable staging or not enable staging it's only really when the
data set becomes large or you're doing lots of transformation on that data set then we want to enable staging that's what I've kind of summarized with this sentence here there's a bit of an move ahead when you're performing staging and so it doesn't work in all cases only really when your data set is large or you're doing lots of Transformations or both now another feature that they recently announced and therefore might not actually be in the exam yet but it's good to kind of understand know that it exists is fast copy and the way that
I think fast copy works is that under the hood it uses the same technology as the data pipeline copy data activity rather than the data flow technology basically as I mentioned it's still a preview feature and it's relatively recent it might not actually be in the exam yet but you know if you're using data flows in the real world and you're struggling with performance it's worthwhile enabling fast copy just to give it a go see how it impacts the performance in your data flows okay so next up we're going to move on from data flows
and now we're going to focus on SQL so we have the SQL engine within the data warehouse and also the SQL endpoint of The Lakehouse as well and we're going to have a look at how we can diagnose and then optimize the performance of SQL scripts and as we mentioned before the capacity metrics app can give you a good kind of high level overview of the resource consumption of specific operations that you're doing within your data warehouse but we have a lot more functionality within the data warehouse to actually explore and identifying things like long
running queries frequently used queries next we're going to talk about Dynamic management views or DMVs and if you take a look in the data warehouse under the CIS schema there's obviously lot of different views in there that we can use for database management in general and there's three main ones really for understanding the live SQL query life cycle okay so things that are currently going on in your data warehouse that you need to be aware of and getting some insights about what's happening there and these are exact connections exact sessions and exact requests and these
are related in this way here so we've got a bit of a data model here so whenever you start a query execution in the data warehouse it's going to start up a session the session is going to have a onetoone relationship normally with the connections right so it's going to create a connection between your data warehouse and the underlying SQL engine so that's what a connection is going to show you and then you're going to have many requests normally for each of these connections right so using these three commands these three dmbs we can begin
to build AIT a bit of a picture about who and how your data warehouse is being queried and these DMVs are going to help you answer questions like like who is the user running the current session when was the session started by the user what's the IDE of the connection to the data warehouse that is running a particular request how many queries are actually currently active and which queries are long running so you begin to build a bit of a picture about how your data warehouse is being queried who's querying it what they're doing and
you know the performance of those queries now the DMV is quite a lowlevel view right we can do lots of information here can merge these t tables in different ways to get more and more information so alongside the DMVs Microsoft also expose query insights so query insights is in a different schema so if we have a look here at this particular example of data warehouse we've got the Cy which is our DMVs where contains our DMVs as well as other database management system generated views we've also got query insights so in here we've got these
four views that give us basically more userfriendly abstractions over the mvs right so it's going to expose things like frequently run queries long running queries and you don't have to actually perform those joins of the underlying DMVs to get this information it just exposes them right here and so we're just going to focus on three of these query insights views so exec requests history it's going to return information about each completed SQL request on that particular data warehouse frequently run queries is obviously going to give you information about the most frequently run queries and long
running queries is basically going to return you information about queries by execution time so this long running queries is going to be really useful to as the name suggests identify queries that are running for a long time and it might be causing performance issues in your data warehouse now one thing to note here is that if you're coming from a SQL Server background when we're talking about performance optimization a really important tool there is the query plan right so currently I don't think it's possible to expose the query plan for particular SQL query but I
do think they are planning to support that in the future next up I want to talk about identifying performance issues with the Spark engine and you'll notice that we're back in the monitoring Hub and I just want to look at one of the item details here for this specific notebook so I've been running this notebook it's called Delta optimization and when we click through on this item we get a really detailed analysis of the different jobs that have been run in this notebook we can see that all of these have succeeded you can see which
the duration of particular job the data that's been read and written for that particular job so this is a really good place to go if you want to understand what's actually happening when you click run in a spark notebook what's happening under the hood and the success or failure of each of the individual spark jobs now within the monitoring Hub we've also got this link through to the spark history server and as you can see from the UI here we've actually switched from a fabric tool to a generic spark tool right so here you're going
to get a lot more detailed information about specific sparkk J that you're running you can look at graph so if you do want a bit of a query plan look at the different stages in execution of a particular spark job you can have a look at that here so if you're looking for really fine grain control and Analysis of what's going on on your Spark engine you're going to come to the spark history server now to actually interpret and understand what's going on here would be a whole series in itself I don't think you need
to know the the nitty-gritty details of actually what's going on in the spark history server for the dp600 exam just understand and you know what's possible in the spark history server what does it log what can you monitor there I think that's good enough for the dp600 exam so finally in this tutorial I just want to focus on Delta table optimization now as you probably know fabric is built on top of Delta tables so this is a really important topic to understand and the Delta file format is great but it can lead to poor performance
and Bloated storage sizes if we're not managing those Delta files correctly now this is a topic that can run very deep like a lot of the topics mentioned today we're just going to go through some of the basics of what you need to know for the exam and to do that we're going to be going through a spark notebook so this is the spark notebook that we're going to talk through now I'm going to start by just exploring the problem in a little bit of detail and I think a really good way of understanding what
the problem or a problem that can arise with Delta files is this visual here and this visual comes from this blog post here by Salida and it's around efficient data partitioning with Microsoft fabric best practices and implementation guide and I've left a link to that in this notebook The Notebook is obviously available in the community here as well and I've also left a link to it here efficient data partitioning here as well so if you have one big file that's 10 GB one parket file then the Spark engine is going to struggle to process that
right because as you know spark is a distributed processing engine which means it works best when it's splits your file your parket files into smaller chunks and then processes these chunks in parallel but to make that possible we need to partition our data and partitioning is basically the process of converting one really big or several really big files into more manageable chunks so that our data can be transformed in parallel basically by The Spark engine so in this notebook we're going to start by looking at file partitioning and then look at some other methods for
optimizing Delta tables in doing so this can help improve the read and retrieval performance so by doing so there's no need to scan through millions and millions of rows in your really big K file if you got a good partitioning system in place it's going to speed up the read performance of your queries and it's also going to improve your transformation performance right as we mentioned before because partitions can be transformed in parallel so file partitioning I'm going to walk through a bit of a demo here and I've used a demo file and it's available
on this website here but it's just a parket file what I've done is I've just put it in the files location this one here is called Flights 1M parket and before we get started just bear in mind the best practi on Delta Lake partitioning right and I've left a link here this is on the Delta Lake website where they give some general best practices about managing Delta files okay and one of them is around choosing the right partition column now most commonly it's done by date so if you've got time series data you've got some
dates in your data set and it's common to use your date for partitioning but there's two kind of rules of thumb to bear in mind when we're talking about partitioning and deciding what columns you want to partition on so in general we don't want to choose our partitioning column to be something of really high cardinality so say for example you have a column of user ID well that's going to be really high cardinality right every row is basically going to be unique so if You' got a million rows that's going to be a really bad
partitioning strategy right because you're going to get a million different partitions and they're all going to be really small and as a general rule of thumb the amount of data that should be in each partition should be around one gab that's kind of like the the good balance of what you should be aiming for with each partition so let's take a look at file partitioning and how to actually implement it in Fabric in a spark notebook so I'm going to begin by just reading in that parket file our flights 1M parket file I'm just reading
it into a data frame I'm just displaying it here and you notice we've got this date column here that's going to be useful for our partitioning strategy we're going to be partitioning on date and it's got some other items here that we don't really care about for this tutorial so before we do some partitioning and we write this file into our Lakehouse using partitions we're going to do a bit of preparation and specifically we're going to add some columns into our data frame we're going to use DF with columns to add more than one column
at a time and we're going to pass in this dictionary object here and each uh part of that dictionary each key is going to be the new column name and the value is going to be how we're actually Computing that value so here I'm just reading in some py spark SQL functions to extract the year the month and the day of month of that date field right that we looked at previously so if we just run that and if we just display the results here okay so now we've got this transformed data frame object right
and so you can see here that it's added in this year month and day column and we can use these in our partitioning strategy now so you can see here from our code cell that we've actually written three different write modes here so the first one is going to write without partitions so this is just the normal saving of the table into a Lakehouse table from our paret file we're going to put it in tables and we're going to call the table flights not partitioned the second one we're going to do is we're going to
call do Partition by and we're going to Partition by the year and the month and we're going to save this one into another table called Flights partitioned and then as a third example we're going to write with some more partitions so we're going to be partitioning our data into smaller partitions here and again we're calling Partition by but this time we're passing in the year the month and the day so these are going to be more fine grained and I'm going to save that into a table called Flights partitioned daily let's just run that okay
so now our cell has been executed all of our spark jobs have concluded let's just have a look at this so yeah now you can see in our lake house tables we've got flights not partitioned flights partitioned and flights partitioned daily so we've got three different tables here all with different partitioning strategies so let's inspect those and see what's going on okay so I've got three different cells here and you'll notice we're using the SQL so we're using spark SQL here and we're calling describe detail on that particular table it's going to inspect this table
it's going to describe what's going on there and some of the results give us a bit of a picture as to how this table has been written into one leg so if we inspect the results here and this is our flights partitioned we can see that the partition columns here are year and month which makes sense because this is the year and month one in flights partitioned and we can see the number of files it's created so the number of paret files we've got here is two next if we compare that to what we get
when we look at flights not partitioned we can see that we got zero partition columns and we got one file so everything's just been written into one file in that instance here and for the daily one again if we inspect what's going on here we can see that we've got year month and day partition columns and here we've got 59 files so here it's been partitioned we've broken up that data into 59 smaller chunks of data now that might be too fine grained because if we scroll back up to the top here one thing that
I did mention is that it's a bit of a balancing act this because if your files are too big the spark engine's going to have performance issues but there's also the small file problem if your file sizes are under that gigabyte then you know that's also going to cause a lot of issues it's going to have to work harder you're going to have to go through the operation 59 times rather than two so again it's a bit of a balancing act trying to get a good partitioning strategy for your Delta tables but for the purposes
of the exam I think it's worthwhile understanding the syntax for creating partitions like so and then analyzing different partitions using this describe method spark SQL method okay so next I just want to talk about vorder optimization now V ordering is a Microsoft proprietary algorithm and it basically changes the structure of your parket file and what I've put here is is kind of providing a bit of special source so it does some special sorting compaction compression of that parquet files right ultimately to improve the read performance of these paret files across all of the different engines
in fabric now the actual algorithm is proprietary it's only used by Microsoft the output of like the parket file is fully kind of Open Source aligns to the traditional paret standards so you can actually read vorded parket files wherever you can read normal parket files that's not a problem at all Now by default V ordering so this algorithm that you use to write paret files is enabled by default in the fabric spark runtime and you can check that by running this spark comp get so we're looking at the configuration and you can see that here
is actually returning true now it can actually be manually disabled if you want it to so we can set the spark configuration by passing in this specific property here spark SQL paret V order enabled to false and then we can reenable it by doing the opposite right putting it to True again so for the exam you might be asked about how do you know whether a particular notebook or a particular spark environment has v order enabled well that's this one well how do you enable it or or disable it in a spark notebook as well
that could be a common question that you might get asked okay just finally I just wanted to mention a few more Delta table maintenance and optimization techniques so there's a few that come from the actual Delta format itself so we have this function called optimize which is a Delta Lake method that performs bin compaction it it can basically improve the speed of your read queries so if You' got multiple small files it's basically going to coals basically mean joining small files into lger files vacuum is another function that we can run and it basically involves
removing files that are no longer referenced by a Delta table and then on the spark side there's two the might at least want to be familiar with is coales so as we mentioned when we're talking about optimize optimize is basically the what that's doing under the hood is calling coales and coales is a spark method for basically reducing the amount of partitions in your Delta table so if you've got 100 partitions for a particular file you can coales that Delta table into 10 part partions So Co s is a pretty efficient way of grouping partition
into a smaller number of partitions right and I say it's quite efficient because it doesn't require a shuffle of the data it's just grouping partitions together it doesn't actually reorganize within particular partitions your data now repartition is similar to coales but it's actually less efficient because it involves breaking up your existing partitions and then creating new partitions and because of this you can create either more or less partition basically just restructuring how your partitions are created and it does involve some shuffling involves breaking up of your existing partitions and repartitioning them now you might be
thinking what's the difference between the spark functions and the the Delta optimize well the Delta optimize has a few kind of things working under the hood that makes it more efficient for number one is item potent so if you run it repeatedly it's not going to reoptimize files that have already been optimized whereas repartition is going to always repartition your files you can keep on running this again and again and it's never going to get more efficient it's always going to repartition the files so that's one difference between repartition and optimize and you can also
run optimize on specific partitions in your data set whereas repartition that's kind of All or Nothing approach you have to repartition your whole table in one go so if you're looking for a bit more fine grained optimization you're going to be want to using the Delta optimize okay so let's just round off the video here by testing some of your knowledge of the things the topics that we've covered in this lesson question one a client you're working with wants to reduce the SKU of their fabric capacity from an F16 to an f8 to save some
money they want to find the most resource intensive workloads and optimize them to use less capacity unit seconds where should they look to find this information is it a the monitoring Hub B capacity metrics app C query insights D spark history server or e the one L Hub pause the video here have a little think and I'll reveal the answer to you shortly okay so the answer here is B the capacity metrics app we're talking about resource intensive workloads and our capacity is where we're going to get those resources and specifically it's going to tell
you which workloads are the most resource intensive are you the workloads that are going to use more of your capacity units seconds right so the answer is going to be your capacity metrics app we can look at all of our spark jobs our data warehouse operations and our data flows and we can come to conclusions about which of these are good candidates for facturing or optimization now all of the others they might be useful for understanding performance of specific workloads within fabric but the capacity metrics app is the only one here that converts that into
capacity units right and that's the important part of the question to understand question two you noticed one of your data flow Gen 2 runs failed to refresh last night where would you go to find out why a particular data flow might have failed a particular Run is it a the capacity metrics app B the monitoring Hub C power error Hub D data flow refresh history or E the data pipeline run history so the answer here is D the data flow refresh history is where you're going to go to analyze error messages and debug particular runs
of a data flow now the power query error Hub that doesn't actually exist I made that up the data pipeline run history but we're not talking about data pipeline here so it's not going to be that the monitor the monitoring Hub will give you some information so it will tell you whether a particular run has failed failed or succeeded but it doesn't give you more detailed information about error messages and things like that and the capacity metrics app is not going to tell you that answer either so the answer here is D data flow refresh
history question three when talking about Delta table optimization which of the following operation removes old files no longer referenced by a Delta table log is it a v order optimization B Zed ordering C vacuum D optimize or E bin compaction so the correct answer here is C vacuum so as we mentioned previously the vacuum command does exactly as it's mentioned there basically removes old files that no longer referenced by a Delta table log so the correct answer here is C question four which of the following statements about V order optimization is false a v order
optimization is enabled by default in the fabric spark runtime b v order can be enabled during table creation using table properties c a table can be both V ordered and Zed ordered d v order improves the read performance for parket files e v order speeds up the right time of a park a file so the correct answer here is e so the question asked which of the following statements is false and so e is actually false vorder does not speed up the right time it actually increases the right time of a park a file the
benefit of V ordering comes in the read performance right that's why we do it it takes a bit longer to write these files but it massively improves the read performance across any of the engines that you might want to use it in fabric all of the other options are true so it is enabled by default B if it's not already enabled in your spark environment it can be enabled for specific tables during table creation using table properties you can actually optimize both V ordering and Zed ordering for a particular table or par Park a file
within that table at the same time and D is also true because we mentioned that V ordering does improve impr the read performance for paret files that's why we do it question five do you want to analyze long running queries in a fabric data warehouse what's the minimum workspace role you need to run the following query select start from query insights. long running queries is it a admin B member C contributor or D viewer so the answer here is C contributor to be able to run a query insights query you know those autogenerated views that
give us information about about long running queries in this example you need to have a workspace role of contributor obviously the question asked for the minimum workspace role you can also run these queries with admin or member but the minimum workspace role would be contributor so if you have a viewer role you can't run these queries in a data warehouse congratulations you've completed the biggest section of the exam we're well over halfway now so in the next lesson we're going to be starting the third section of the exam which is all about building some semantic
models well done and I'll see you in the next video