hey everyone welcome back and this is the next video in our dp600 exam preparation course this is video 7 we're going to be looking at transforming data and there's a lot to get through with these and it's a little bit overwhelming when you look at the things that we're going to be covering don't worry I'm going to break them down into a number of different sections hopefully to make it a little bit more digestible so we're going to be looking at First Data cleansing how can we Implement a data cleansing process we're going to look
at resolving some common issues that we get with data so duplicates missing data null values conversion of data types and filtering data and we're going to be looking at how we can do those things using the data flow tsql and Spark as well then we're going to move on to data enrichment so under this category we're going to be looking at merging and joining different data sets together and also enriching the data that we've already got so adding new columns new tables based on our existing data finally we're going to take a look at data
modeling we're going to look at the star schema what that is we're going to look at typ one and type two slowly changing Dimensions we're going to look at the bridge table and what problem that that solves and how we can implement the solution using tsql we're going to look at data denormalization Aggregate and deaggregating data as well as ever we're going to be testing your knowledge at the end of this video so we've got some sample questions and all of the key points for this section of the exam plus links to any further learning
resources if you want to dig into a bit more detail into any of these topics they're going to be posted on the school Community I'll leave a link in the description below so first let's take a look at the data cleansing process and to give a bit of structure as to what we're looking at here I'm going to be talking through the lens of The Medallion architecture whereby we have bronze silver and gold areas in our data processing pipeline now when we talk about data cleansing normally this takes place anywhere really between the bronze and
the silver and maybe in the silver as well this data cleaning because we're going to get some raw data in bronze but it's going to be messy so we want to be doing all of our data cleaning steps normally between bronze and silver so the way that I'm going to do this is to walk you through how we can do common data cleansing routines and operations within all three of the tools listed there so starting with the data flow then we're going to look at how we can do all these things with tsql in the
data warehouse and also in a spark notebook as well so let's jump into Fabric and start with data cleansing in a data flow okay so for this part of the lesson we're going to be jumping into Fabric and we're going to be transforming a particular data set here and it's that car dealership sales data set now the data set itself comes from this blog here Tableau server Guru so thanks very much to this person who has some sample data sets on their website don't worry we're not going to be talking about Tableau servers or anything
like that but they do have this nice car sales data set that I downloaded that comes in a pretty good star schema when you download this it comes actually in xlsx now this data set is actually quite a clean data set so i' have actually made some modifications to make it a bit more dirty so that we can perform some data cleaning on this data set and I'll leave a link to the dirtified data sets if you can call them that I'll leave them in the school community so go over to there to get the
source data files that I used here and to get us started what I've done is I've just put them into a Lakehouse files area so it's number of these csvs and then I've just loaded them simply into a number of tables so this is going to be our bronze Lakehouse tables we're going to be using for this part of the tutorial so here we are in the data flow Gen 2 and I'm just going to show you how I can do some data cleansing within the data flow itself I just pulled in one of those
tables from our Lakehouse area which this is the source here it's in our Lakehouse and I pulled in this Revenue table so this is what it looks like basically completely untransformed now you notice I've introduced some null values here I've introduced a few duplicate values as well so what we're going to do is just step through the different transform steps to clean this data set within the power query experience within the data flow so the first thing that we might want to do is remove any duplicate values that we've got so I know because I've
introduced some duplicates into this data set there are duplicates in here so to remove any duplicates in this power query engine what we're going to do we're just going to highlight all of these columns clicking on the left hand column holding shift and then clicking on the right hand column that's going to select all of our columns then we can write right click on the top here and then remove any duplicate rows so this is important because we want to check that the whole row is a duplicate so we need to select all of the
different columns here and then remove the duplicates like that and you can see it's been added into this applied steps here so if we take a look at our Revenue column here you can see that we do actually have some null values in this data set now obviously it depends on your use case for data cleansing but if this Revenue value is actually really important and this is the only thing that you really care about in that data set then you might want to remove these NS all together from the data model again it depends
very much on your use case so to remove the null values in this Revenue column we just click on the column itself and then deselect this null value press okay that's going to remove those null values from the data set now another thing we can do is to change the type so if you want to change type in the data flow simply click here on change type now it might not make sense for this particular column cuz this is already looks good looks like an integer value and it's got an integer type here whole number
but say for example this was a text value and you know that deep down is actually an integer then you can just right click on that change type to any of these types here now another thing we can do with the data flow Gen 2 is add in new columns so you can see here add column and what we can do is add a custom column and maybe we want to do some sort of Revenue bin uh maybe you want to add in some logic here to kind of group these revenues into something a little
bit different I'm just going to do a simple one here revenue is greater than 1 million and we're going to give it the data type as a true or false and then if we click okay that should give us this True Value just to check some of them are false so yeah that's happening correctly so now maybe we want to change the name of that something a bit more descriptive Revenue over 1 million maybe that's probably a better description for what that new colum is you know another way we can filter these data sets maybe
you don't want to actually want all of these units sold maybe perhaps you're just for this particular piece of analysis that you're doing or this data set you want to actually remove the unit sold over one again just showing the functionality here really depends on what your use case is in your business as to which data cleansing steps are going to make sense so that's a bit of an overview of power query and how we can do data cleansing in the data flow Gen 2 now let's jump over to the data warehouse and look at
how you can Implement similar data cleansing process using tsql okay so I've just jumped over to the SQL endpoint here in that same bronze Lakehouse cuz here we've got the same tables here but now we're just in the SQL endpoint experience so we can write some tsql and to begin with let's just have a look at our data set here just make sure that that is coming through okay yep looks exactly the same as we had in the data flow so that's good here so the first step we're going to look at here is identifying
duplicates in this table now there's many different ways that we can use to identify and remove duplicates from a table using tsql now this method here I'm using is Group by so what we're going to be doing is grouping by something that I know is unique or should be unique actually for every Row in this fact table and we're going to be using having where a count of more than one so what this is going to mean is that for each of these groups there should be exactly one row so if this returns a count
of more than one for this group then we'll see that that is in fact a duplicate row so let's just have a look at these so we can see that here for dealer these two dealer IDs and these two dates we can see there's actually three rows here so if we just give this bit of number of rows just to make that a bit clearer and then we rerun that yeah so now we can see our number of rows is three and four so how do we go about removing those duplicates well again one way
way of doing that is by using the group by again we can Group by the same two columns here the D ID and the data ID and we can just bring back a aggregate function this is an aggregate function the max of the revenue and it's worth checking before you do this that the revenue figures for each of these rows are actually the same so the max or the Min doesn't actually make a difference we're just going to bring through whatever that value is and the result of that is going to be the same data
set but with the duplicates removed which is this one here and then you can save that into another table or whatever Downstream activities you want to do with it so next let's look at missing data nulls remove moving nulls filtering that kind of thing here so there are some null values in this Revenue column so we're just going to use where revenue is null to identify those first things first so you can see here we've got six rows here where the revenue is null so again you might want to remove those and obviously to remove
those where it's pretty simple we can just use is not null and that will return you all where the revenue is not null so one way can we can just quickly verify that that is actually the case is if we just bring this into a bit of a CTE with remove nulls as this if we do a select count star remove nulls we'll also do a select count star from the original table let's just compare these two values oh yeah I have to actually call it so we have result one is 1855 result two is
1861 so we can see that the row count Has Changed by six and those are those six null values that we saw when we did the where revenue is null so that's worked correctly so another thing we need to bear in mind for the exam is tsql type conversion so here you can see what we're using is the cast function and here we're going to cast the values in the revenue column as a float I think currently it is yeah I think currently it's an integer and so what we're doing here is we're casting it
as a float so basically a decimal number so by doing that you can see here that when we bring both of these columns we've got the original Revenue here in this column and that's an integer and here it's changed the data type into a float using this cast functionality another thing we've done here is to add another column so when you're using tql you can just add in another row here into your SQL script and you can do whatever you want here this is just a bit of an example here divided the revenue by two
and I've given it a name of half the revenue so that's just an example of adding new columns that we can do in tsql okay so just to round off this part of the tutorial next we're going to look at transforming data using pypar notebooks and specifically we're going to be looking at some of the data cleansing routines that we've been looking at in the data flow and the tsql engine now we're going to look at how we can Implement them in a spark notebook now I did do a 3.5h hour tutorial which goes into
a lot more depth about different data cleansing operations you can have a look at that video here I'll leave a link in the school Community if you want to have a look at that in a bit more detail but here we're just going to go on a bit of a quick Deep dive into some of the most common operations and I'll also leave this notebook on our school community so if you want to play along at home then you can do that as well so we're using this same bronze Lakehouse we've got our tables here
I'm just going to be having a look at this Revenue table which is our fact table in a bit more detail and I've begun by just reading it into a spark data frame and displaying the results here just to check that everything's loading okay and got everything logged in here now you notice that I haven't actually committed or haven't actually written any of those changes that we made in the data flow and the TC engine so this is still reading from the raw data here in that bronze layer so let's start by looking at duplicate
data again to identify duplicates we can use a similar kind of pattern that we used in tsql by looking at group by or by using Group by and then looking at count of more than one so here you can see we've identified the rows where we have some duplicates in that data set and luckily it's the same as what we were seeing in our tsql engine so we can see that the Branch IDs and date IDs have a count of four and three respectively how do we go about dropping those duplicate values from our data
set well in spark we have this drop duplicates method so what I'm going to do here is just start by counting the rows in the original data set performing this drop duplicates saving it into duped which is a new data frame and then counting the rows in that new duped data frame okay so I've just printed out the result here we can see that the process has removed if I could spell process correctly it's removed five rows from the data set so I've just taken the difference really between our original data set and that D
duped data set so that makes sense because we've got seven count here originally and obviously two of those we want to keep right because they are actually valid data so we've removed five rows so five of them are duplicates so that looks like it's worked correctly and if we were just going to verify again that that's worked we can perform the same operation that we did before grouping by these two fields looking at where the count is more than one and we've got this empty data frame here so that looks like it's worked correctly next
we're going to take a look at missing data and how we handle nulls and that kind of thing in spark so first off we're going to look at identifying missing values you might want to do a bit of like inspection of your data before you go ahead and drop things a good idea to interrogate your data bit have a look at what you might be deleting so let's just run this and then we'll talk through it so we've got our original data frame here and we're filtering on this specific column here so we're passing in
DF do Revenue do isnull so this is a method that we get on a spark column and we can see that it's returning these six rows that we know are null now another way we can do that is I've changed two things in this second example here but it's basically using DF do Weare so DF do filter and DF Weare are basically synonymous in spark this time we've passed in call which is one of the spark equal functions it's just another way of writing and obtaining that column data and again we're calling is null on
that column data and we're saving it into this nulls 2 variable and we're displaying that and so we can see that these two are exactly the same so both methods have returned the same null values which is good so now let's look at dropping some of those null values using the drop na method so this method has a few different parameters these are two of them how threshold and also subset as we're using in this example here so if you pass in a value in the how parameter it can be either any or all so
if it's any it's going to drop the row if any of the values in any column is null if you pass in a how value of all it's going to drop the row only if all of the values are null now threshold is going to give you a threshold for the number of columns that need to be null for that row to be dropped and obviously if you use this value it's going to overwrite that how parameter now the other one that we're going to be using is subset so you can pass in subset and
it's going to limit The Columns that it looks for for your drop so in our example we only want to drop it if there's a null value in this Revenue column and you can pass this in either as a list or as a string as well in our example we've only got one column in that list so we're going to save the result as no Naas and then we're going to do the same thing we're going to print out the result here just to check that we have actually removed some rows so we've removed those
six rows the null values from a data set so that looks like it's worked correctly so next let's look at type conversion and also adding new columns into a spark data frame so we can call DF do print schema and it gives you a bit of a look at what our schema is initially for this data set including the data types for each column what we're going to do is get the column using DF do unit sold so that's one of our columns here currently it's an integer and for the purpose of this demo we
want to use thecast method and we're going to give it the data type of string then what we've done is we've printed the schema again just to check what that schema looks like and we can see here this units sold converted so we've passed in this units sold converted which is the new column name that we get with this with column function and we can see that from the print schema showing a data type of string so we've successfully casted that value from an integer into a string value so next just take quick look at
filtering and we've already had a look at some filtering previously in this lesson but let's just go over it again so in a filtered data frame here we're getting the original data frame and we're calling DF do where we're passing in the column name and the column here is revenue and we want to filter only for data that is more than 10 million so we can see that this has actually removed it's filtered out 1,19 rows from this data set as I mentioned previously we can use DF do wear or DF do filter if you
want to do filtering if I change this to DF do filter should work exactly the same there you go so now let's look at data enrichment so adding new columns so we've already seen this one as well we're going to be using DF dowi column and you can also use with columns if you want to do more than one of these at a time we're just going to be showing you one column here what we're doing we're taking the original data frame we're calling with column to add a new column we're giving the col column
name half revenue and we're giving the function to apply to that new column and in our example we're just going to be Haring the revenue so DF do Revenue divid by two and then we're displaying the results in our enriched data frame here so if we just call that that's going to look like this so we've got our new column which is half Revenue here which is the revenue divid by two finally we're just going to look at joining and merging data frames in spark so for this one we're going to be using some different
tables I've just pulled in the dealers data frame so the dealers table which is this one here and also the country's data frame and you'll notice that these do actually have a joining key so they both have country ID so each dealer has a country ID what we want to be doing is pulling through the country name maybe we've got a normalized data model and we want to be denormalizing it and we're going to be looking at what that means in a bit more detail a bit later in this tutorial but for now let's just
look at joining and merging these two data sets together so the basic tax for a join in spark is well we're going to get one data set here which is the dealers data set we're going to call dealers DF do jooin that's going to be the first part of our join we're going to pass in the second data frame that we want to join it to in our case countries DF then we're going to specify on what column we're going to be joining on so we're going to be joining on dealers DF do country ID
is equal to countries DF do countryid in this next row we're just going to specify some select statements so what we've done is we've used this brackets here because we're going to be chaining more than one command together and in this second row we're just selecting a few different columns to be returned in this joined data frame we just want to specify that we want returned the dealer ID the country ID and the country name this is all we care about in that resultant data frame so dealers DF is not defined that's cuz we haven't
defined it let's just read that data into a spark data frame first and then we can run this one here okay so now we've got dealer ID country ID and country name in the same data frame now one thing you notice here is that the country name has actually pulled through a few interesting characters so that's something you might want to change later on in your data processing workflow might be that I've saved the CSV file in an incorrect format maybe that's something we want to clean as well in our data cleansing process but for
now we're just going to leave it like this so finally we're going to be taking a look at data modeling and typically this takes place within that gold layer because these are going to be the analytical models and the data models that we're going to be bringing into our semantic layer so let's start off with the star schema so this is an example of a star schema what you can see is in the middle we've got a fact table and this fact table represents sales so it's revenue for a particular company here this example is
using car sales so the amount of cars sold from different dealerships so we've also got a dealership Dimension table a dim date dim model so the type of car that's sold and the branch that that was actually sold at so it's obviously called a star schema because we have our fact table in the middle and then multiple Dimensions all linking to that fact table via some sort of primary key now the reason we prefer star schemers is if you want to be doing bi powerbi basically because the powerbi engine is most efficient in this star
schema so a pretty common piece of data modeling that you might have to do as an analytics engineer is prepare this data model in your go layer of a data warehouse for example so that your powerbi developers can just pick up this data model and you know write their measures on top of it now normally when we're modeling in this star schema data model the fact table is going to be aend only so we're not really going to be updating many values in that fact table it's just going to be new sales added onto the
end of that fact table normally it's going to be a very long list your fact table and you're going to create connections from that fact table into the other dimensions so if you want to know more information about that particular sale in the fact table then you're going to find those in the dimensions and the dimensions tables can change over time right take for example dim branch in that Dimension table you'd expect to see details about the different branches that exist for this car manufacturer and some of those things can change over time some of
the details about a particular Branch might change right they might change address they might change change name certain details about each Dimension might change over time and one way of dealing with that in data modeling well we need to introduce this concept of slowly changing Dimensions because with our fact table as we mentioned we're not really updating anything over time we're just appending new rows with our Dimensions we need to be able to handle different changes to that Dimension table over time and it's what we call slowly changing Dimensions because these are not going to
be big updates might happen once a month or every week a lot of slower frequency than the data is going to be added into that fact table and in data modeling there's a lot of different ways that we can model these slowly changing Dimensions here's an example here take a different example we're looking at employee table we've got employee ID on the left hand side employee name and the department now it's not uncommon for an employee to change departments so on the right hand side you can see that this Dimension has actually changed so Danny
Walker whose employee ID number two it has moved from the marketing department into the sales department and we can model this in a number of different ways now the first way in which you need to really be aware of for the exam is the type one SCD or slowly changing Dimension so in the type 1D we're going to be overwriting any new data so we're going to get the employee data and every time we query that data set that Source data again we're just going to overwrite whatever's in that table we're not going to store
any sort of History so we're going to be implementing it with the overwrite writing mode and you can do that either in the data pipeline the data flow or in py spark as well now if you're using a Lakehouse as your data store it's important to bear in mind that the history can still actually be retrieved at a point in time using the Delta log so in a Lakehouse architecture in the fabric lake house because you can access those Delta logs just because you overwrite it doesn't necessarily mean that that data isn't stored so the
second way that we can deal with this sort of change in a dimension is the type two slowly changing Dimension so in this example we're going to need a few extra columns you can notice that at the top there we've added valid from and valid to and optionally an is current as well which is also quite useful for bi purposes as well so you notice that when we first write a record into this table we're going to populate the valid from that's when this row is valid in that data set and when you first write
it valid 2 is going to be sometime long into the future normally it's the year 99,999 and we also set the is current flag to one or true now when we get an update to that data set so we get a brand new set of data well in the type 2 SCD we add new rows based on any incoming data so you see in this type 2D we're going to be adding a third row and it's going to be Daniel Walker but this time the department is sales we have to do a few things here
with the valid 2 and the valid from dates so when you write that third row into the data set you need to update row two to populate that valid two column and set the is current to zero or false the new row when we write that third row into the data set obviously we're going to have department is sales we're going to add the valid from date as the day that you're making the change the update with the valid two of the year 99,999 and it is current of true now you can see that by
doing this we're actually storing a bit of a history about how our Dimension is evolving over time and so you can on the back of this create some quite sophisticated queries about what the state of that Dimension table is at any given point in time now when we're implementing the type two slowly changing Dimension there's a few things that you need to bear in mind so if you're using the Lakehouse and Spark engine then bear in mind that as we mentioned previously the Delta log actually stores a history of all the rights to a particular
table so this might be a simpler option and depending on how you want to use the history Downstream in your analysis that might be good enough for some use cases now if not then you can use the merge into so we can use that as part of the spark SQL library and we can use that to basically update the underline Delta tables now if you're using the data warehouse and the tcq experience then one method is to load your new data into a staging table and then use a stored proced procedure to perform the checks
around the updates and the inserts and the valid to and valid from calculations now unfortunately the merge operation in tsql the tsql surface area is not currently supported so you have to actually implement this manually if you want to do that in the data warehouse currently now one kind of implementation trick is to use row hashing here so say for example you want to check which rows have changed from One update to the other and one method you can used to do this in a bit more of an efficient manner is to implement row hashing
so you can hash the value of an entire row both in your existing data set and in the new data set that you're checking and then you can compare the two hashes and obviously if the hash values match then you know that that record hasn't changed but if the two hash values are different then you can go ahead with the update logic that you need to update that table next we're going to talk about Bridge tables so imagine a company has many different projects running and the employees assigned to one or many projects what does
this look like if you want to do a bit of a data model and maybe build a power VI report off that well in the question here or in the scenario we can see that actually many employees can work for many different projects at the same time so that is a many to many relationship which exists between our dim projects and our dim employees table and this can cause quite a lot of issues when it comes to powerbi and efficiency in very large data models as well so one thing that we can do to resolve
this is to implement a bridge table and what that looks like is basically a onetoone mapping of all projects and all participants now this is really useful because it turns our many to many relationship into two on to many relationships let's have a look at how you can Implement that in a tsql data warehouse just to kind of show you what that looks like in real life okay let's explore Bridge tables in a bit more detail and here we're in the data warehouse experience and I've just built a gold data warehouse and we've got these
two tables here we got a Project's table and an employees table now it's just a simple demo just to show you what a bridge table might look like and how you can implement it in SQL our tables here are the projects we've got the project ID and the project name and we've also got these project participant and if we have a look at the data here for our projects table we've got a bit of a messy string separated values in this project participants column so we can see that each project has multiple project participants and
some of them are overlapping right so these are employee numbers that come from this employees table so employee 101 who is John Smith he's actually working on multiple projects right so in our data model here we've actually got a many to many relationship we can't really implement it at the moment because of that string concatenation in this project part participants column so how do we get around this well and I have actually got the tsql here and I'll leave this in the school Community as well if you want to have a play around with this
yourself I've just created the projects table I've created the employees table and I've inserted some values here so one way that we can use to resolve that many to many relationship is to create a bridge table between those two tables between the projects and the employees table and in this example I'm implementing that as a SQL view what I'm doing is I'm using the cross apply function here in tsql along with string split so string split is basically going to look in that project participants column within dbo do projects so that's the comma separated value
column which is a bit messy but we can use cross apply and string split together and it's basically going to separate all those values based on this separator here the the comma separation and what this is going to do is it's going to create a onetoone mapping for every single project and every single project participant so if we just run this here let's just have a look at what that looks like here so you can see if we look at the original table just to remind ourselves of what that projects table looks like so initially
it looked like this it had two rows and it had project one with two participants and project two with three participants and what we've done in this view if we just recalculate that it's basically separated out all of these comma separated values and it's made one row for each one so now we have five rows because these are all the different combinations of projects and employees what we can do is we can create a view with this logic like so and I've just called it dbo view Bridge Project participants and that's going to make that
available in our data model now so let's just have a look at this so now we've got this view bridge table in our data model and what we can do is we can now connect the project ID to the project ID in this instance it's actually going to be one too many because this is a dimension table this is our Bridge table so it's going to be one to many and we do the same on this side here so this time it is going to be many to one so what we've done is we've transformed
a many to many relationship into two on to many relationships and this is going to make it a lot easier when you implement this stuff in powerbi and that's one way that you can resolve a many to many relationships using a bridge table now we've implemented this in a SQL view just because it's quite easy in reality if you wanted to use direct Lake mode obviously you can't use a view with direct late mode so if you want to use direct late mode then you might want to use a stored procedure to materialize out the
data in this bridge table so that it doesn't fall back to direct query mode but that's Bridge tables in tsql okay so the next thing we want to talk about here is normalized and denormalized ized data so if we go back to our data model here where we were looking at car sales so we have some sort of fact Revenue in the middle and some Dimensions that give us more information about that particular sale so what branch it was at what model of car was sold the date that it was sold and also the dealership
and what we've done is we've added in another dimension here onto that dim dealers so there's a relationship between the dim dealer and the dim cities so it's basically telling us the city of that dealership now this is what's called a snowflake architecture now this can be a very efficient way of storing really large data models because you'll notice in the dim dealers table we're only storing the city ID we're not bringing through any other data into that dim dealers now this can be beneficial in some instances for example as we mentioned for efficiency but
when we move this kind of model into the powerbi world it can bring some limitations as well so one way to get around this is with the denormalized data model and this is what this looks like here and it's effectively turning our snowflake model back into our star schema model which we know can be a lot more efficient when we've got very large data sets now this does introduce some redundancy into that dim dealer's Dimension because instead of just storing the city ID for every dealership we're going to be repeating the city ID and the
country ID and the region for all of the rows in our dim dealerships but by denormalizing our data model we actually get a number of other benefits so now we have everything in that dim dealers table which makes it a lot easier to build things like filters on top of that Dimension table we can also have hierarchical filters so we can look at the dealerships by City Country and region and build a bit of a hierarchy there which isn't possible when you've got that split across two Dimension tables finally we're going to take a look
at data aggregation and the word aggregation can mean a few different things in data analysis and data modeling and I'm not 100% sure what Microsoft expects or the data aggregation and deaggregation that they've mentioned in the study guide but I'm going to talk through both examples so that you know both of them and if you have any insight as to what Microsoft mean from the study guide when they talk about aggregation then let us know in the comments so the first possible meaning when we talk about data aggregation is when we have different slices of
the same data set but they're spread across different files so this one we have one file with the UK data one file with the USA Data and one file with the Canadian data and in this context data aggregation can mean basically combining these data sets into one long table of transactions in this case showing you Revenue across all your different Source data sets right and we can implement this in a number of different ways really in Fabric in the data flow we can use the append functionality and in the tsql experience and also spark equal
we can use Union and Union all and Union is also a method in py spark as well now the difference here between union and Union all with a union it's actually going to remove any duplicate rows in the resultant data set whereas Union all is just going to basically append all of the different data sets on top of each other and it's not going to check for duplicates so also mentioned in the study guide is data deaggregation and again there can be many different meanings to the word deaggregation so assuming we mean the first meaning
of aggregation then deaggregation is going to be the opposite right it's going to be splitting one large file into multiple different categories of data so another possible definition of data aggregation is when we transform a data set from a more granular data set into a less granular data set so on the left hand side for example we have all of our transactions and the revenue for that transaction and the different country that that transaction was made in now if we were to aggregate that data perhaps by country then we get some sort of aggregation metric
for each country so here we're showing the total revenue for each country so the UK here has 600 which is 100 plus 500 the US has 200 and Canada has 800 now this is typically implemented using the group by statement and the group by functionality and this can be done in the data flow tsql or in spark and when we build build this kind of aggregate transformation we also need an aggregation function and it's normally one of count sum Max Min average so it's how are you combining all of the values within that group buy
statement so in our example in the top rightand Corner we used a sum so we just summed up all of the different Revenue numbers for each country but you could do average revenue you could do the max Revenue it depends on your use case here okay let's just round up everything that we've gone through in this lesson and test some of your knowledge question one when using DF dojin in pisar Notebook the default join type is a full outter join B inner join C left join d right join or E anti join pause the video
here take a moment to think about the answer and I'll reveal the answer to you shortly so the default join type in a p spark or in any spark notebook is the inner join not much to say about that one that's just something you need to know all the others are incorrect question two you're looking to migrate a data transformation workload that is currently done using a data flow Gen 2 and convert it into a tsql script now the data flow appends two data sets together and removes any duplicate rows which tsql command can you
use to implement this transformation a a left joint B Union or C concat D append or E union so the answer here is the union now we mentioned when we were talking about unions that the union removes duplicates and in our question here obviously the important sentence to pick out was that the source data flow the thing that we're trying to convert into a tsql script well currently that appens two data sets together and removes any duplicant rows so the T SQL equivalent of that is the union it's not going to be the union all
cuz that's not going to remove the duplicate rows it's not going to be the left join because that won't do what we want to do now append is not a tsql function conat is how you could achieve this in pandas but not in tsql so the answer here is Union question three you have a spark data frame called DF your goal is to remove rows that contain a null value in the transaction date column which of the following will help you achieve this a DF do drop duplicates B DF do dropna with how equal to
all DF do filter transaction date do is null D DF drop and a how equals any or E DF do drop na with subset equal to transaction date so the correct answer here is e we want to be using the drop Na and passing in subset equal to transaction date now the question here is asking us to remove rows that contain null values in a specific column so that specific column is the important part of the question so we want to be using drop na because it's going to remove the rows and by passing in
subset equals transaction date that's going to specify spe ify only to look in the transaction date column so maybe that's a really important column in our data set we want to be absolutely sure there's no na values because if there's any na values then maybe that's going to make our analysis completely redundant or it's going to ruin all of our Downstream analysis so you might to remove those rows entirely it's not going to be B CU in B and D we're not actually specifying that subset now if you to implement D then it would drop
the rows where there is a null value in the transaction date C but it would also remove rows with null values in any other column so that might be too much based on your requirements that's probably not what you want to be doing filter that's going to just filter the data set for null values cuz that's going to return all of the rows which are null so we want to be doing the opposite of that basically and a drop duplicates well that's not what we're trying to achieve here so that's not going to be the
answer question four a classical star schema data model consists of the following is it a one Central fact table and multiple Dimension tables B one dimension table and one or more fact tables c one fact table multiple Dimension tables some with Dimension to Dimension relationships or D one big fact table fully denormalized without any Dimension tables so the answer here is one Central fact table with multiple Dimension tables so B is obviously the wrong answer here because it's got one dimension table and many fact tables that's not going to be the star schema C is
a snowflake one fact table with multiple Dimension tables some with Dimension Dimension relationships so that's not going to be what we're looking for in a star schema and one big fact table fully denormalized without any Dimension tables again that's not really a classical star schema so the answer here is a you inherit a data project and you're inspecting the tables in the data warehouse one of the tables is a dimension table dim contacts with the following columns contact ID contact name contact address effective date and effective until make an assumption about the type of data
modeling that's being implemented in this Dimension table is it a type zero SCD slightly change di mention a type one SCD a type 2 SCD or a Type 3 SCD so here the answer is Type 2 SCD so we can see from inspecting the columns here that they've got two date fields or we can at least assume their dates so effective date is probably when that row when that contact data point was entered into the system and effect itive until is basically the same as that valid to now we can't actually see the data here
but we can assume that that's what those two columns are doing a type zero SCD is a fixed Dimension so something that never changes type one is obviously you're not going to be tracking that effective date and effective until those two dates it's just going to be overwritten in a type one and a type three is another slowly changing Dimension type that is not actually asked about in the exam but it's basically going to store previous values for each of your colle columns or at least the columns that you're interested in storing so say for
example you'd have contact name there you might also have previous contact name and then when your data gets updated you're going to update the previous contact name and the new contact name as well so that's a type three SCD don't think that's in the exam but just something to bear in mind so the answer here is c a type two slowly changing Dimension congratulations you've now completed the third part of section two preparing and serving data in the next lesson we're going to be looking at performance mon monitoring and optimization of all of our data
processing workloads in fabric so make sure you click here to join us in the next lesson I'll see you there