what is a data lake house [Music] in this video i want to give you a super simple explanation that anyone can understand so data lake houses are now used to store data but first we had data warehouses as a way to store structured data for specific business intelligence purposes and reporting which dates back to the 1980s and this served businesses well for several decades until the dawn of big data and this is when businesses started to work with unstructured data so the messy raw information that might come in form of pictures and videos and sound
recordings and in fact 80 or 90 of all the data we have in the world is unstructured data and they offer huge values just think about the value contained in years of customer emails or conversations or all the information contained in medical records and medical scans and unfortunately it doesn't really fit well with data warehouses and they're structured and ordered way of storage so this led to the development of a different type of architecture known as the data lake where unstructured information is stored in its raw material in its raw format ready for whatever users
may you might may be able to find in the future and a data lake is undoubtedly a hugely powerful and flexible way of storing data however it also has some issues for a start it can get really messy where people just dump all of their data into the data lake and if you're not careful you can um end up with something that resembles more data swamp than a data lake and this can create also governance and privacy issues as well as technical complexities involved with creating systems that are able to ingest data in myriad different
schemas and formats which brings us to the architecture of a data lake house this is basically a hybrid approach that takes the best concepts from both the data warehouse and the data lake models and puts them together while trying to eliminate the downsides of both models most companies today use both the user data warehouse for analytics and a data lake for the more data sciencey machine learning type of applications and the data lake house enables structure and schema like like those used in a data warehouse to apply to unstructured data that is typically stored in
a data lake and this means that data users can access the information more quickly and start putting it to work and those data users might be data scientists or increasingly workers in any number of roles that are increasingly seeing the benefits of augmenting themselves with analytics capabilities so a data lake house makes use of intelligent metadata layers that basically act as a sort of middleman between the unstructured data and the data user that helps to categorize and classify the data so by identifying and extracting features from data it can effectively be structured allowing it to
be catalogued and indexed just as it was in this nice tidy structure of a a data warehouse so for example part of this metadata extraction might be using a computer vision or natural language processing algorithms to understand the content of images text or voice files that are dumped as raw unlabeled data into the data lake house and then this helps companies move from business intelligence to artificial intelligence and enables them to bring unstructured data to their decision making so for example if you are a retailer and you want to count the number of customers coming
into your shop you can just simply count them and have a nice structured number but video is so much richer so you can then look at whether they're male or they're female you can assess their age you can even look at what their dress dressed like and even assess their mood and yes you could dump all of this information into a data lake however there would be important issues of data governance to address such as the fact that you're dealing with personal information here and a data like host architecture would could address this by automating
for example compliance procedures perhaps by anonymizing data where it is needed and unlike data warehouse data warehouses the data lake houses are inexpensive to to scale basically because your the integration of new data sources is automated and you don't need to make sure you're manually fitted with the organization's data formats and schemas they're also open meaning that the data can be queried from anywhere using any tool rather than limited to being accessed through applications that can only handle structured data such as sql so the data lake house approach is likely to become increasingly popular as
more organizations begin to understand the value of using unstructured data together with ai and machine learning with more mainstream data infrastructure vendors like aws and databricks offering this architecture and open source tools like delta lake growing in popularity the data lake house is a term we will be hearing a lot more of in the coming years