hello spark fans and Fabricators and pretty much anyone trying to build a lake right now so this is advancing spark brought to you by advancing analytics your friendly neighborhood engineering Ai and analytics consultancy and we've been building Lakes for a while and the suddenly a state when everyone wants to talk about The Medallion architecture fabricks come along and they've said hey look everything's a lake now we're building lakes and one of the well-known patterns to build Lakes is Medallion architecture you should use that and so a massive influx of people who didn't really get involved
in Lakes before now trying to build it and now discussing what The Medallion architecture means now that is great I am so happy that so many more people are entering the Lake based ecosystem come on in the water's lovely we make lots of water-based puns here in the lake world but actually how you structure a lake and how you work with data moving through the zones of curation until you get something you're ready to show the business is something something we've been doing for a long time it's actually very well understood however The Medallion architecture
is a very simple way of explaining it which doesn't quite fit every scenario doesn't fit every business so of course there's going to be lots of arguments going no that's not what we do at the bronze lay we do that in the silver l no no silver means this and people are arguing about what we do at different stages of this curation was actually it's fairly well known but not if you just take it bronze silver gold and that's all we ever do so all I want to do today is just take that concept talk
about how it actually works talk about how we've been building legs for the past most of a decade and then you can take that and do whatever you want because realistically the main thing is you don't not have to follow any prescribed way of building a lake you don't have to call the layers of your Lake anything particular call them whatever makes sense for your business it is so that people using data understand what the data in that particular Zone how it can be read how it should be processed how it should be trusted and
if you get that communicated well you can call those zones of your l 1 2 3 4 five I do not care but make sure you understand it and you have an operating model your business and all everyone using your Lakes understand what that means that's the most important part if you decide to call it bronze silver gold great go for it just make sure you have decided what you're doing but I want to give you the context of how we do things in all the different layers so you at least make an informed decision
that is the point of today if you're new around here don't forget to like And subscribe and yeah let us know what you've called your Lake zones cuz I've heard some really interesting ones and really fun ones and that's great do it make it fun okay so the traditional Medallion architecture we're talking bronze silver gold so we're getting data from a load of different sources in the business that could be scraped from API it could be sent to us from a third party V vender could be the business of loading their own data could be
a nice regular ETL job pulling data from a SQL Server whatever we get data from loads of different places and it enters our lake-based environment so the bronze data is untrusted it's uncleaned you have to work hard to be able to go and actually draw out insights from it the data might be wrong it might have test data and dummy data it might have missing data and be incorrect bronze comes with a pinch of salt going oh can you trust it now silver is then validated and cleansed we tied it up we've got rid of
any bad data we've performed some data cleansing exercises you can go and use it and query it but we've not made it nice to understand probably in it's original schema so if that came from some kind of application oltp kind of server well it's going to be very normalized you're going to have to try and figure out how all this stuff joins together we've not applied any business calculations to help you actually work out what does the profit kpi mean or anything like that it is good clean valid of trustworthy data that's not being made
ready for the business to understand gold is your analytical tables it is aggregated data now that doesn't just mean it's been summarized it could mean it's been data from various different things pulled together and made ready for the business to consume it's been modeled in such a way that the business can understand it it's been turned from here's a load of system named tables into here's a loot of things named in such a way that the business user will understand what that means we've performed some data modeling we've made it business facing now that makes
sense they was like yeah okay easy fine yeah great we'll do that but actually each of those layers probably breaks down into several additional layers and that's where all the confusion is coming from so why back where did this come from well data bricks when they put out Delta like they were trying to explain how this stuff works for people and actually they're going very much for the more data science kind of machine learning style of doing things and they were pushing this idea of bronze silver gold now personally I hate that naming we work
with a ton of data governance um people and bronze silver gold is well understood in data governance about how much you should trust the Providence of that data and whilst we're talking about a very similar concept here we're talking about how well curated the data is not how trustworthy is so the business made their own data set and yeah we can clean that we can present it but is that trustworthy because it was actually coming from a user and not coming from a formally managed Source it's different the trustworthiness of the data is different to
how curated that data is so I've already got a definition in my head for bronze silver gold which is why I've struggled with The Medallion architecture but if that makes sense to you if you're not using those terms already cool it makes sense and we're following the same pattern right raw integrated ation style data coming in filtered cleaned augmented data in our silver layer and then we're putting out business facing aggregated data again aggregate does not just mean summarized but still doesn't make sense to me that is not the be all and end all and
actually if you dig into some of the documentation if you go and have a look at recommendations as to how to do something like Data Vault in a medallion architecture well the silver layer just has several different layers to it there sub layers inside these things that's what Miss so ignoring data bricks ignoring Microsoft recommendations ignoring everybody else why don't just tell you about how advancing analytics how we build legs how we think about the different stages of processing that we go through and then you can call it whatever you want to call it give
it your own names I don't care but this is what we normally see so firstly we have some war fights now even if we're going off and getting the data on ourselves we we still just land it as flat files and this is an argument people have heard for decades around ETL going if I'm pulling from a structured Source why would I put it to somewhere that's then unstructured Madness I say you're going to lose accuracy well actually we disagree CU these days we store things as Park we either store as Jason or park depending
on if we want to have things as really nested and have good structured structs nested arrays all Ty of complex data types in there that's quite nice a lot of the time we receive Jason cuz we're a rest API therefore it just comes out of Jason we're going to keep the data in the format that we were given it if we're extracting the data ourselves we put as parket and then it's structured and it's tighted it is still accurate and it's in spark ready types so it just makes sense to do that so raw files
as we received it now the main reason we do that is because we want to have a little checkpoint that divorces it abstracts us from the source system because the age-old reason why people do business intelligence ETL loads overnight it's because they aren't allowed to go back and query the application server during the day because they don't want to put unnecessary load on it now in the world of the cloud we have separation of computing storage for lots of things that doesn't tend to be a case for lots of the systems that we deal with
but it still happens occasionally and there's still a lot of network IO problems with going back to the source system if it's a problem so as a default standard to protect ourselves and mean we can just do whatever we want and fix our own problems and be masters of our own destiny we just land everything we load into Lake in a landing area as raw files and then it's ours to do with as we please we then pick those files up and we check it against our expected schema so I am expecting this file to
look like this oh that's the wrong file type or it's the wrong shape it's missing a load of data it's got a load of extra data we have schema validation that we are doing and that might mean that we Chuck whole files away it might mean we break the load and we throw errors it might mean that we reject a couple of rows because they just did not meet our structural validation checks and that's again when we want to be able to go back to our raw files and check it and we can have that
little Loop of fixing data at that point now that is different to when we're talking about the silver layer cleansing data and applying business logic to clean data and fixing dates and removing special characters from strength and all that data quality stuff it's not what we're talking about we're saying can this actually fit in the structure I need it to so we have two layers both of which could be considered bronze and people argue the definition of bronze going no no no it's files in the original format no no no it's files that have been
structured into a Delta table but that might just have some problems in there and the answer for us is it's both we did both of those things as two separate layers in our leg but you could still call it bronze and that's why there were these arguments around it now that scheme of valid to pend only we're just putting data in there we're not checking that record already exists that is just a pure record of structurally valid rows that have come but are still raw they might be corrupt they might be test records they might
be utter garbage in there but it fits into the structure that we're trying to get it to be so that's the first two layers that we're talking about we then pick it up and we apply data cleansing we apply our cleaning rules now I don't want to do fixing dates and applying primary keys and doing all that change management stuff at the team the same time I'm saying does this file even make sense is it even a valid Jason or parket file they are two separate actions and two different recovery points that I need to
work with so we separate out those out as layers so my next layer yeah class that is traditional Silver but it is applying cleansing applying validation probably doing a merge so you just got a single you've got a single record for each unique row that we're talking about you might be tracking history in here you might have a slow you might make every table slowly changing you might be doing some kind of data VY lbby thing not my jam people do it but that is one layer of our cleansed and validated data however we work
with loads of companies who have four or five different business units and they've all got their own Erp system or CRM system and they need to conform that clean valid data into a single set of right this is what we mean when we say product this is all of our customers this is all of our things and that's not analytical data yet they've not remodeled that data to fit into a star schemer or one big table so it can go into some kind of report it's still silver layer data so we have like this conforming
nice homogeneous schema that is still silver data again we've got multiple layers in silver that we tend to think about and again people argue over which one of those is actually the silver layer and the answer is both and then we do some data modeling finally now we I mean I tend to do things most in the Kimble style so we're making facts we making Dimensions we might have several different layers of aggregation here so I might have my big transaction level fact table and then an aggregated summary fact table and then some kind of
accumulating snapshot that's all still just analytical models in that same burst of the gold layers now when we get to things like powerbi and we have a bi tools that have some contextual cutting and slicing and calculations that are uh filter context dependent I mean that's technically a different layer so we have like a reporting layer which will have calculations in which is on top of the star schemery Kimble style layer which has its own calculations in so again we think about it as two separate layers you've got a layer of anyone can go and
query that star scheme and it's materialized down yes you could do it as a logical layer but if you're doing things like Sur keys or you're doing things like managing slowly change Dimensions it's a hell of a lot easier to actually materialize that down at the leg and then just performance of people reusing it and using a lots so that analytical model for us is a materialized star schema and a bunch of analytical tables we might have off top of it and then a semantic layer on top of it which is doing your cut and
slice Style on the Fly calculations so yes you could say we're using The Medallion architecture we just happen to have a couple of bronze a couple of silver and a couple of gold now I don't see the need to try and actually squeeze what we're doing into that definition if that's going to help your users understand context great if your business is already bought Into The Medallion architecture they want to come and talk to you and go right is this from the gold layer no it's from the silver layer and they want to use that
that parin those semantics fine great label your layers that way but don't feel like you're only allowed to have one step in each of those three layers because it doesn't always make sense that way companies are different loads of companies we work with don't have that conformed clean layer because they don't have multiple different Source systems that are talking about the same thing they don't need to conform data across multiple sources that's cool just don't have that layer it is not a fixed prescribed way of doing things it's a data model that should fit your
business so what do we call it we we have a different name for these things because we have been building this for long before people started talking about The Medallion architecture so that first initial file dump we just call The Landing zone so we have our Landing blob maybe multiple Landing blobs aligned to different sour system however you want to build it we have a landing area of the raw files that we received we then have the raw layer which is our schema validated a pendon log of all the records that we received not yet
quality checked now the one that people tend to kind of have a problem with so bass baselined you call all that standardized call all that cleansed I don't care what you call it we call it bass is our clean validator R you can treat it almost like an operational data store but no yes it's basically if everything's going really really well our bass layer just shows the same data that's in the source system with some auditing and logging tracked onto it and it's cleaned any the garbage data it had in there that's how we can
check that things are good and happy enriched is what we call our conformed layer we have a bunch of client doing different things maybe they just use it as a glorified stanging area maybe they're doing some pre-calculations before they turn into an analytical model maybe they're going for a full canonical model they have this thing they're trying to get all the different Source systems into again it's a flexible layer that we can use as makes sense with how our various different clients that we're working with need to use that data curated is what we call
our actual data model and again multiple different layers ins side there and the final layer we tend to refer to as a semantic model if you're being new and trendy you can call it your metrical aay cuz technically your curated layer is using the semantics of the business so it's the metric layer that doing all the calculations I I don't care the important thing is that the users the data personas across your entire data estate when you say to them hey here's some data just look at this table it's in base. customer table they go
Bas okay cool so I understand you've not applied any business calculations for me but I can trust that that data has been cleansed great they've got context from the layer of the lake that's all we're trying to do if you give someone something from curated do fact sales they go great curated that means someone's applied some calculations and business logic to me that means I can go and look up the definitions of these different fields because it's been well modeled and properly documented because everyone documents their data if you say to someone yeah sure just
go look in this table in raw they go oh okay so one I need to know how to query that data I need to dup it most likely there's going to be some garbage data in there maybe there's some data that's only like it's partly malformed and they've got some fields that don't make any sense because it's raw data when it's not been cleaned yet that's all it is it's about sending context it's about sending a process it's building an operating model for how data gets from its original raw State through to being business ready
in a trustworthy repeatable way all we're trying to do if what you're doing in your business doesn't need all of those layers great combine some of the layers and do it once you're making that decision about how to do it if you're saying I want to go from essentially a raw State all the way through to a curated State and you want to build in the risk that something goes wrong you need to reprocess data fine cool that is a business decision that's how you're designing your architecture if you are saying we need to go
through every single state cuz I want to capture the data and be able to replay it at any point great they're the layers that we tend to go through you may want to break it up even more that's up to you just don't feel that you have to squeeze it into a bronze silver gold layer and don't feel that you have to use someone else's definition of bronze silver gold for how that should work because it's very confused because it's more complicated than that honestly if you want to actually extend The Medallion architecture and say
well it's not bronze silver gold it's bronze silver gold and platinum or diamonds to represent my semantic lay or you got in to represent your lending lay that's cool call things whatever you want the best examples I have seen is where people have actually taken these zones of the lake and actually renamed it something that made sense inside their business you could be a gaming company who actually name it after their famous kind of you know whatever kind of brand the game they're selling you could be a manufacturing company that name it after the different
stages of their manufacturing as things go from Warehouse suit to distribution honest well I T wouldn't tend to do that cuz going to get really confused going the warehouse data that's in the warehouse layer and then we can get the warehouse table or the warehouse table in the warehouse layer that gets really confusing but call it something that makes sense so your users understand it and they actually know what they're getting that's all you need to do the final thing is it's not an architecture hate that it's called The Medallion architecture because it's not an
architecture it's a data layout approach doesn't actually involve any infrastructure or solution design okay and honestly that is all I wanted to talk about today so if you're trying to build a medallion architecture you're trying to build a lake house and you're trying to figure out what steps should I go through whether you're using dat whether using fabric whether you're building it your own using spark somewhere else wherever you happen to be they tend to be the different layers that we go through they tend to be the different stages at which case we materialize data
down into the lake we tend to keep all of the data that actually goes through those layers that's an important change and honestly this hasn't changed that much from the days of warehousing when we had a staging schema and a clean schema and then warehousing database and we just move things through these things the difference is in the traditional warehousing World We tend to truncate the earlier parts of that each time we did a load because we're worried about disc capacity on the various different bits of tin we're building our warehouses on legs tend to
be a lot cheaper and they tend to be so scalable we don't really care so we tend to keep the data in these different layers and we have a load of different use cases your data scientists might go and query the raw layer this is a great place to find signals for things like insurance fraud which once you've cleaned your data you've lost those signals there are use cases for each of those different layers about why people might go and query that data and how long you should keep that data there there's a lot more
depth that goes into managing all of this stuff but that first thing when you sit down and go right going to build a lake house how should I structure it you don't have to stick to bronze silver gold you can come up with your own definition of what that means to you there may be way more steps than three involved in your eventual Journey from your file format through to your end data model that is being used by the business and just call it what makes sense to you that's the whole point cool so that
is the end of my r that is all I wanted to share with people all the frustrations I need to get off my chest as someone building lake houses for quite a long time now and seeing the southern Russia people arguing about what they actually mean that's what they mean that is how we build them and yeah if you need any help building a lake house feel free to get hold of advancing analytics otherwise don't forget to like And subscribe and I'll catch you next time cheers