Are you ready to dive in into the world of data science the fastest growing and highest paying careers for the next decade every corner that you will see there will be a demand for data science all over the world this field surging by 35% is going to offer you lot of opportunities not just in terms of the monetary compensation but also the challenges and the Innovative projects that you can work on Welcome to our 8 Hour hence own data science crash course this will be a great starting point for those who want to know what
data science is who want to get practical when it comes to data science and the traditional machine learning and who also want to hear a lot of career insights from this field here's what we are going to cover as part of this crash course you will learn the data science road map so basically what it takes from you to become a job ready data science Professional in one place this will be an involved road map where in an organiz nice way you will learn from scratch what you should learn what are the skills that you
must have from different fields in order to become a job ready data science professional but also what does it mean to be a data scientist uh what is this buzzword what are these different fields the data science can change and where data science can be implemented what can you expect from a Career as a data scientist what are these different levels and what are the directions you can take one once you get into this field so this will be really helpful to maximize your learning especially if you plan to do self- learning and this will
be great for anyone who wants to get into this field but who has no idea what data science is next up we will get into AB testing Crush course so learn how Tech Giants use Ab test or experimentation what we Also are referring as online testing to experiment different ideas to optimize products and maximize impact so once you get uh to the point that you are going to interview for job dat uh for data science jobs then you will notice that AB testing experimentation data analytics will come time and time again during your interviews so
this part is specifically for those who want to get into that point so through real world case study we will then get into from Theory into practice and you will learn and you'll Master how to perform data collection how to perform data analysis how to perform from AB testing end to end and then how to make a decision based on your data using your Python and AB testing skills then we're going to get into the traditional machine learning the fancy part so we're are going to predict Californian house prices using techniques like linear regression and
causal analysis from Econometrics while learning how to clean data how to detect outliers using statistics and how to interpret the results very strategically and very um very much technically so this would be ideal for those who want to see traditional yet simple machine learning into practice we're going to cover the business transformation and conduct the uh case study end to end such that you see the entire process from a data science Perspective and I think this would be ideal for people who want to not just hear those Buzz words or who want to hear oh
the data science is that data science is that but to actually see what data science is so kind of have a taste of it and then we are going to get into another case study so this will be about building a movie recommender system this will help us to bring our machine learning and data science skill set to the next level here vah my co-founder Will help you to create a practical machine learning powered app to enhance your skills in this field and also to productionize this recommender system this process that we often refer as
mlops which is the continuation of machine learning this days is the fundamental part of the data science job so we will also give you insights into this in the next step we are going to finalize this crash course with a career accelerator insid this will be a high Level conversation between me and the aliances the data science manager Cornelius who is an expert in the field of data science and he will tell you about strategies for breaking into data science even without a tech degree which I'm sure will be uh many of our audiences watching
us today and he will help you to understand how to build a stand out portfolio how to get into this field how to prepare for your interviews how to get promoted to build this trust And Authority as a data scientist I'm sure this will be really interesting for those who want to get started with their data science career but just did need this um extra notch or this extra extra confirmation so if you're serious about becoming a job rate data scientist this will be just a starting point but you will need to spend much more
time to become uh to this level so we have opened our applications to our most in demand the ultimate data science boot Camp so all you need to do is to go to lunch. a to look for our lunarch Academy section and pick there our data science boot camp there you can scroll down and you can apply to our data science boot camp you will need to uh conduct a prerequisite test to ensure that you have the uh required skill set although we don't ask much we are going to cover everything from scratch so not
too many prerequisite from our side so for those who are following any of our free or Paid courses consider also subscribing to the lunar Tech Academy's page CU there you can also get a certification for the courses that you are following this course is more than just a theory it's an introduction to the tools and skills that you can use to turn ideas into datadriven impactful results and remember this is just the beginning let's get started into your journey becoming a data scientist today I will tell you how to get into Data science in 2024
in the most effective and convenient way I've been in data science and in AI for the last 5 years have worked across many top tech companies in Europe and North America I've also co-founded the data science nii company run Tech where we are making data science and AI accessible to everyone people often ask me how how they could get into data science if they would learn data science day they often asked me to become their personal mentor And to guide them and they are willing to pay significant amount of money for that but I don't
want to charge you today and instead I'm going to give you that information for free I'm going to tell you everything that you need to know in order to get into data science in 2024 in this video I'm going to tell you what data science is what are the common applications and how you can use data science in business es what you can expect as a data scientist from your Career and what are the common career paths and entry position you can use in order to get into the field of data science and artificial intelligence
what kind of salary you can expect and what are the must have skills that you must learn in order to get into data science and at the end I'm also going to provide you ton of resources and courses and wood camps you can follow to become a dual data scientist in 20 24 so without further Ado let's get started so data Science is all about mixing data with mathematics statistics machine learning to find out cool things spot Trends and make predictions think of data science as a full of this it's everything that we need together
we then smartly analyze and tell us stories about what happened what's likely to happen next in the business world it's pretty easy to tell who is actually using data science those companies are often ahead of the game making smart moves based on what the uh The data tells them whether it's in the healthare using data to better predict patient outcomes or in marketing understanding customer Behavior how to serve them better how to increase customer satisfaction data science is making a huge difference why does this matter why does this matter well for starters about 85% of
companies are hunting for people that have a data science skills they know and understand that data can lead to making a better Decisions being more competitive and yes making more money in some Industries this could mean a difference of millions if not billions cutting costs increasing Revenue increasing customer satisfaction knowing what product to launch how to optimize operations making it more effective or how to uh ensure that the business is profitable can make or break the business this video was sponsored by lunarch at lunarch we are all about about making you ready for your dream
Job in Tech making data science and AI accessible to everyone with is data science artificial intelligence or engineering at lul Tech Academy we have courses and boot camps to help you become a job ready professional we are here to help also businesses and schools and universities with a topn training modernization with data science and AI corporate training including the latest topics like G generative AI with luner tech learning is easy fun and super Practical we care about providing an endtoend learning experience that is both practical and grounded in fundamental knowledge our community is all about
supporting each other making sure you get where you want to go ready to start your Tech Journey luner Tech is where you begin for students or aspiring data science and AI professionals visit Lun Tech Academy section to explore our courses and boot camps and just in general our programs businesses in Need For employee training upscaling or data science and AI Solutions should head to the technology section on the learner. a page Enterprises looking for corporate training curriculum modernization and customize AI tools to enhance education please visit the lunch Enterprises section at lunch. for a free
consultation and customize estimate join Lun Tech and start building your future one data point at a time so let's dive deeper and explore couple of Applications of data science in businesses and the power of data science across various Industries adding a bit more detail and introducing an example from the energy sector so data science is this art of turning data into insights and predictions using a mix of Statistics computer science machine learning and definitely a deep domain knowledge it enables decision makers to cut through the noise and focus on what the data reveals about the
future Trends Operations efficiencies and More in case of healthcare uh in this industry data science is revolutionizing the way we approach disease prevention and treatment in the healthcare industry data science is revolutionizing the way we approach the disease prevention the treatment of the patient oper uh the when it comes to the health care industry data science is used heavily this days so data science is used in healthcare in order to improve the Efficiency of the operations in the hospitals and also is revolutionizing the way we approach disease prevention Diagnostics treatment of the patients by analyzing
this large amount of patient data including the health records genetic information data scientists are able to identify these different patterns risk factors for diseases long before the symptoms even occur for instance in uh Predictive Analytics models for heart diseases can analyze a Patient's lifestyle family history biomarkers to forecast their risk level this proactive approach then allows for earlier intervention such as lifestyle adjustment or even preventative medication ultimately improving the patient care and reducing the health care cost in the retail industry personalization targeting marketing uh identifying what are these different customer behav Behavior what are the
customer groups can definitely help help To improve this customer customer experience but also improve the sales and the profits for the retailers so retailers collect data from various sources including online browsing habits purchase history and even social media activity by applying data science algorithms they can then identify what products the customer is most likely to be interested in and then recommend them for example if you have been searching for running shes online data science Helps retailers to show you ads for related products like sportsware or fitness equipment this is not only something that improves the
shopping experience for the customers but also increases the sales the marketing efficiency for the retailers in case of Finance in this sector this relies heavily on the data science for fraud detection investment strategy whether they need to go long and short into different um assets so by examining These patterns in the trans action data or maybe the prices of the assets data scientists can pinpoint different anomalies that suggest the fraudulent activity and this can help to protect customers and institutions in case of Finance in this sector uh this relies heavily on data science for fraud
detection for investment strategies by investigating these patterns in the transaction data in the behavior of a customer Bank can identify whether they Are dealing with fraudulent activity or not and in this way protecting also the customers and the institution itself sophisticated algorithms can analyze the market trends Financial indicators to advise the investment opportunities and this can help the investors to maximize the Returns on their Investments when it comes to the energy sector data science can be used there to improve the overall uh level of sustainability of the sector so sustainability in oil production for Instance
a compelling application of data science in making of oil companies more sustainable and eco-friendly is the following consider this oil company that can reduce the carbon production by uh Opera operations efficiency so by understanding what are the areas that the uh company uh has that um for instance if we are dealing with this oil company and uh the oil company has this different parts of the operations that help you um find the oils and then uh Generate the final products of course in different stages it produces different amounts of uh carbon dioxide that goes onto
the uh environment and if we want to make this company more sustainable more eco-friendly we want to identify the parts of the operations that produces the largest amount of carbon dioxide so it's the uh most dangerous for the environment and then we can see whether by not compromising the exis operations we can still remove certain Parts and by how much in those operations to ensure that the overall um production of this carbon dioxide will be reduced and in this way we can then improve the uh overall sustainability and eco-friendliness of this oil company but at
the same time the oil company will continue the operations and produce the same amount of uh refined oils for example by deploying sensors across their operations even from drilling sites to like refineries the company can Also collect a fast area of data on energy consumption emissions and also the production output then data scientists can use this data to build models that will exactly identify where this highest carbon emission occur within the operations perhaps it's a particularly Refinery process or maybe certain drilling techniques that are less efficient then this can help the company to implement and
targeted strategy to reduce the emission such as Optimizing operations adopting cleaner techniques or uh Technologies or enhancing the equipment efficiency this will not only help the company to become more sustainable but it will also align with this Global efforts for combating the climate change also by reducing this waste and improving this operational efficiency the company can lower the costs demonstrating that sustainability and profitability can go hand in hand at the moment a starking 85% of companies Are looking for data scientists but they are unable to find them marketing the demand has shot data science into
the spotlight making it as not just a cool job but a lucrative one too it's also your getaway to get into Fields like Ai and machine Learning Without a solid grip on data science knowing all these different skills that you need to be a data scientist you cannot really go into Ai and machine Learning Without a PhD but beyond the hype it's a field That offers the chance to solve real problems earn a great salary it's fun and exciting and also it helps the businesses in significant ways and also the lives so stick around as
we break down what you need to get started in data science if you are curious about how to dive into AI make businesses smarter change the healthcare and make a lot of impact let's now start with must have skills to get into data science skills from the Bedrock of your data Science Journey will equip you with the tools needed to explore Advanced territories such as artificial intelligence machine learning first you must have this skill in the mathematics and here I'm not just talking about the high school mathematics like basic arithmetic but really about Concepts like
bit more complex Concepts like differential calculus differential Theory and also linear algebra so when it comes to differential Theory this is Usually uh from the um so these mathematical concepts are very important for understanding and applying not just basic data science but also machine learning algorithms and they can later help you to get into ML and then artificial intelligence but also the Deep learning Concepts and understanding this all you will need to know and you will have a need to have a solid understanding of linear algebra solid understanding of differential calculus Differential Theory and bit
of uh geometry and these different concepts intuition and interpretation so let's start with the high school mathematics in order to get into data science you need to feel comfortable with calculations solving quadratic equations working with polinomial understand how you can calculate the discriminant find the solutions to equations understanding mathematical notation basic triog gometry geometry such as Pythagorean Theorem the basic concepts like sinus cosinus tangent tangent then we get into differential Theory so a good understanding of calculus so differential theory is usually part of Calculus 1 and two and three this will be very important this
includes knowing how to find the slope inflation points derivatives gradients concept of integration for data science knowing also multivariate calculus especially up to two variables is necessary you should Be able to calculate partial derivatives understand gradients Dehan which plays significant role in machine learning and in deep learning this differential Theory will be very important when it comes down to more advanced data science Fields like machine learning deep learning understanding these different optimization algorithms HGD GD HGD with momentum or RMS prob for all this you will need to understand this concept of gradients how you can
calculate them the Partial derivatives and this will help you to later on advance in your career of data science so when it comes comes to the linear algebra linear algebra is usually part of the Bachelor of econometrics or bit more advanced sciences and this is fundamental for data science key Concepts include understanding dot products the concept of vectors vectors operations metrics operations like transpose inverse determinant properties Of symmetric and diagonal matrices able to solve different equations um and linear systems using gaan elimination gaan reduction understanding this different different concepts of De composition igon values igon
vectors calculation of them the concept of linear Independence the span no space uh com space bases uh able to understand this concept of uh gramme uh algorithms so different algorithms that you need to use in order to solve linear systems and Also to understand different metrix factorization Concepts including the uh I value de compositions based on I n vectors the QR uh the composition you need to understand this uh different operations you can apply on matrices how to multiply a matrix with a vector what are the properties this concept of rank concept of Dimensions uh
I do not want to scare you uh that's definitely not uh my goal but I want to make this different than all The other videos that you would see because um the videos saying at high level you need linear algebra but not provide providing you those details and those are Concepts that from the first you might seem complex but if you hear the clear explanation it will make much more sense so uh this uh here my goal is not to scare you but to truly provide you this step by step actionable insights and you rather
can be prepared now from what you can expect to become a Well-rounded professional rather than to every time understand that there is something missing in your theory of mathematics to truly Master the art of artificial intelligence and machine learning because if you want to go beyond of using libraries and understand those different concepts knowing linear algebra and these different concepts from linear algebra will be very helpful another field that you must know is the statistics so understanding fundamentals Of Statistics is very important in order to get into data science it's essentially the bread and butter
of this field without s grasp of statistics navigating this complex Concept in data science can become significantly challenging here is the breakdown what exactly you need to know in statistics in order to get into data science in 2024 you need to know this concept of variables you need to understand what are this random variables this concept Of population and Sample what is this differences why we need the sample when uh we can use population why we can't use population understanding concepts like probability Theory knowing how to define and work with random variables and their outcomes
and also understand this uh different concepts that are crucial for inferences about larger data sets or using smaller data sets for larger uh entire population mastering the rules and the uh fundamentals for Probability understanding what are the different probability distribution functions that we uh must know including the normal distribution poison distribution posterior understanding the foundations of probability Theory knowing how to define and work with this random variables their outcomes the samples and population so to understand the differences between the two and why we will need to use the sample to make Conclusions about the larger
population when it comes to the probability Theory we need to master the basics of probability uh lows including conditional probability the bias rule which are really important then you also need to understand the descriptive statistics how to calculate the mean the variance standard deviation why we need them how they can describe the data also correlation Co variance what is the difference why we need them how we can Use them to identify and describe the data when it comes to probability distribution you need to be familiar with uh normal distribution poison distribution bomal distribution binomial distribution
understanding this will help you to uh get ahead and understand more complex Concepts the uh predictions prior probability posterior probability some regression now is also important what is this uh linear regression why we need linear regression This concept of statistical significance uh variables independent variables dependent V variables hypothesis testing how hypothesis testing is related to the statistical significance why we use linear regression for causal analysis what is causal analysis uh what is this different sorts of test that we can use to test sta discal significance what is the T Test what is the F test
what is the nonparametric Kare test is difference between non-parametric and Parametric what is this idea of statistical significance the P value type one error type to error um and also this uh different uh errors that we uh have in statistical testing and also you need to know inferential statistics so the law of large numbers the uh Central limit theorem why we need them uh how they uh help us to make conclusions about the population uh by using this um uh sample um in here this is a huge list of Concepts from the first few but
actually uh when you have this good guidance and you have this good resources and you know what exactly you need to learn it will become much more Digestive and knowing the mathematics that I mentioned before this will definitely help you to learn learn the statistics in easier way a good explanation with the corresponding examples will definitely help you to uh better understand this different statistical concept and uh These different concepts they um come all together and later strong we will see in different uh case studies some of which are also in this Channel and I
can help you to understand why those concepts are actually important because whenever you have a data science case study or just in general a study related to data then um the mean the variance and the standard deviation those are usually Concepts that are um must uh as your first step to learn about your data So case study usually will then start with identifying the problem but then the next step will be to describe data and you will describe the data with descripted Statistics like the mean the standard deviation variance correlation covariance and this all will
help you to explain your data and tell a story about your data so uh this is just an example on how statistics is highly related to this concept of data science and how these different topics that I just Mentioned are must in order to get into data science in 2024 so next up is the concept of machine learning so you definitely need to have a traditional machine learning skill set to get into data science so you need to understand these fundamentals and this includes the traditional machine learning and not the advanced ones so you can
explore the advanced um topics but I would suggest you start with the basic So you must distinguish uh between supervised and unsupervised learning know the classification versus regression uh models how to evaluate this type of models what is this entire cycle of training and testing and validating different machine learning models what is this idea of splitting the data why we need the training uh data and where exactly uh why we need the validation data the test data and understand this different uh linear Regression and classification models definitely know the um traditional and most popular algorithms
for linear uh from linear regression to the boosting techniques including linear regression logistic regression D Canon LDA uh also decision trees for classification and for regression understand the begging the boosting random Forest the different boosting algorithms like egg sheet boost a boost light GBM uh and different resampling Techniques including the cross viation different sorts of cross validation like kold cross validation leave one out cross validation then you also need to understand this concept of boot shopping and how you can just in general use machine learning for Predictive Analytics so you also need to understand this
concept of um uh Matrix that you can use in order to evaluate different machine learning algorithms both for classification and for regression like Uh mean squared error uh or uh roof mean squ error which is for regression problems and uh using for instance the cross entropy or the F1 score recall or Precision for the uh classification type of models so this entire process of learning uh the skill set of um machine learning the traditional machine learning here what I'm talking about is really important for getting into data science and also um being at the top
of it so not just the data analytics part Of data science but really the fancy and fun part which is the Predictive Analytics and that's something that you can do with the traditional machine learning so when it comes to the Deep learning which is more advanced machine learning I will suggest you lay low for for for the moment and uh in the beginning of your career but once you master the traditional machine learning then you are ready to get into bit more advanced Mach learning and then Understand the Deep learning concept including the neural networks
Etc next up is the AB testing so AB testing or split testing is very important for your data science skill set especially in product data science and product development and feature testing so it involves comparing two versions A and B to determine which performs better on given Matrix so uh if you want to in an intelligent way test out different features or of your product for instance You want to uh compare one algorithm with the other one or you have a new feature and you're are proposing to launch this new feature but you actually want
to learn from the customers whether they like it or not then you can compare this new feature with the existing one using AB testing so in a data driven way without using the guesses or speculations you will ask the customers to tell you whether they like this new feature or not we being able to Intelligently conduct an AB test is really important and will set you apart as a good data science practitioner as part of the AB testing skill set I would say that designing an AB test understanding the uh Power analysis this uh three
different parameters that form the power analysis how you can design this complete test from the primary metric how to choose primary metric like conversion rate or click through rate and um up to the point of uh estimate in Or calculating the minimum sample size these are all really important understanding how to conduct these sort of analysis you know the calculation of the duration of the test conducting the test knowing how to monitor the test what is this idea of Integrity avoiding the pitfalls that can screw up the results uh how to not do packing and
then once you get your data and you have collected the data how to conduct an AB test results analysis in Python Calculating the standard error the pulled variance the pulled estimates the pulled variance and then the corresponding uh test statistics standard of the test statistics and then calculate a p volum understand how to compare to this uh significance level how to conclude whether you have a statistical significance or not how to conclude then whether you have a practical significance and then next up to uh ensure that you are not running Into one of the AB
testing pitfalls is an art on one hand but it's actually a data science concept on the other one so knowing how to conduct this test end to end understand all these technical details behind it the statistics behind it mathematics behind it is really important and can help you a lot in making a datadriven decision and helping the businesses not to uh screw up when it comes to making changes in their products and uh different decisions Getting into data science in 2024 means that you not only need to know those Basics but bat more few important
areas like uh NLP or at least introductory level NLP programming especially Python and understanding the business acent here is a simple breakdown so understanding NLP means uh knowing at least the basics like how to process text Data you need to know how to uh do tokenization some basic NLP techniques like um how to Use backup words counter vectorizer tfidf uh different sorts of uh semantic uh analysis techniques how to work with embeddings what they represent how they are related to this uh Infamous language models and know at least high level what this model represent like
the bird gpt3 gp4 the GPT series T5 it's actually good to know this when it comes to I would say that as an introductory data scientist you don't need to know this uh Concepts uh In detail like Transformers you know multi-head attention Etc you don't need to know those but at least you need to understand at higher level what they uh represent and how they can be utilized for different uh businesses in order to interact with AI professionals with machine learning engineers and later on once you master data science you can always get into AI
so when it comes to the programming programming is really important so independ what kind of area In data science you go python is the go-to language for data science because it's easier to learn and it has a lot of tools for analyzing data even if you are considering learning R I would suggest that um R is good for people who want to stay with data analytics side of data science and they just want to do statistical analysis because R is great when it comes to doing visualizations running um different sorts of uh regression models like
uh linear Regression GMM fgs GLS and um logistic agression do clustering even but when it comes to uh doing machine learning or AI so training machine learning models which are a bit more advanced later on training neural networks then I would say that python will be the go-to option because uh for people who want to get into data science and then later on have the option to get into machine learning and AI which are more advanced levels um after data science is a field that you Mastered I would say python will be a much better
decision for you than to go with r because R will be helpful up to the point when you decide that you actually want to learn machine learning and AI so python is uh able to do everything that R can do but it goes beyond the uh data wrangling data processing and a basic statistical analysis it um has many great libraries um including the pie torch and the tensor flow that will allow you to learn The applications of machine learning deep learning how to train um different large language models how to train different generative AI models
like gens vaes so Auto encoders um but also to do basic stuff like uh performing data analytics python has great libraries like the pendus the numpy for data wrangling data analysis data loading data preprocessing and you have to know this when it comes to both Python and R if You want to get into data science so python can do that by utilizing the different libraries that it has then when it comes to going Beyond data preprocessing python has also many other libraries including the uh psychic learn great for machine learning python has the stats models
for statistical modeling the math Li um library nltk for NLP the uh Spa uh Spacey uh also a great library and then uh many other libraries uh for visualizations including the InFAMOUS two which is the uh Med PL Le and the uh Seaburn both uh great uh visualization techniques that you can use to visualize your data was St with in the interface of python and here when I'm saying python I mean not only knowing all this but also the basic stuff that everyone needs when it comes to programming like basic data structures Understanding Variables arrays
matrices uh how to work with these different indices um working with For Loops IFL statements different conditions while loops and also understand how you can uh go beyond this Basics and um how to perform data analytics in Python how to do Predictive Analytics in Python by using these different machine learning libraries uh how to train and test and validate um different machine learning algorithms and uh later on you will also learn how to train more complex machine learning algorithms including the Deep learning Algorithms in Python another skill that I would say is the communication combined
with the business acument and uh it would not be fair for the interns or the entry level candidates um from from them businesses to expect from uh this early professionals to have the uh enough communication and the business skills to um perform in such way as the senior data scientist would do because when you are entering the field of data science you are usually either just from Graduated from your boot camp or you are just graduated from undergrad or your master's degree in data science therefore uh for entry uh level data scientists I would not
say that the communication and business acument is a must having the theory and the Practical and this hard skills is what will set you part as an entry level candidate but having the communicational skills and the business acument will definitely set you apart from the other candidates that Have the hard skills but don't have the communication skills and the business acument because at the end of the day in the data science we not so much care about the underlying moral that we use or what kind of statistical or machine learning Technique we use but really
we care about the story that the data tells us that the uh data tells us and help us to improve the business and at the end of the day all we care about is to use that data to improve business or improve Operations in the uh institutions or help the customers uh make them happier and uh build some products that the customers will be really happy about so we want to use these different heart skills in order to benefit a stakeholder and the stakeholder can be an institution can be a business can be a a
user but we what we want to do is to make use of the data science statistics mathematics programming in order to uh come up with Solutions and uh come up With products that will help the end user and that's exactly where we need our business acument and our curiosity in order to ask questions why and not how that will also differentiate you from other data scientists when you care not only about how so what kind of model to use and how to use it but also the why why we need that Morel and why we
need that solution and what kind of impact will that make and also when it comes to communication so not just to Know how to train this machine learning model model but to truly understand why we need that how we can formulate the problem that the stakeholder will understand because sometimes an actually most of the time the stakeholders are people with no Tech background so be able to make those translations to know what is the business problem translate it back to the data science problem and then understand how you can relate them solve the problem in
data science the Technical one yet translated and communicated in a non-technical words those all are actually part of this uh optional skill set that I would suggest for entry level candidates but definitely a must when it comes to growing in your data science career because you need to understand how you can communicate and tell your data story which goes beyond those uh technical parts of your data project and also to have a business acument to understand What it means to have a efficiency in the business to understand what are the different parts that most likely
are struggling and the data usage uh and uh using data science can help us to benefit understand and communicate with the business partners such that uh you can come up with a potential solutions for their problems and in many cases the businesses themselves don't even know how the data can help them and that will be your job as a data scientist to Identify those different areas that that can potentially improve the business and having this business Acumen especially once you grow your data science career can be fundamental in helping you to make your stakeholders happy
to grow in your career and to become a real professional in data science so let me give you an example it can be that you're um uh you are a meteor data scientist so you are working as a junior data scientist for two years you are at The level of major data scientist and then the business comes and says well here we got all this data we got this business but we don't know what we can do with it then as a data scientists you can if you have a business Acumen then you will go
and do your research and you will understand how you can um identify the different areas in that business what are those different areas in that specific uh industry and for that specific company and you will propose These different solutions and then the company can choose what they think is the most important one for them it can be that uh you are dealing with a marketing company and that marketing company has a lot of data but they don't know how they can use that data to um improve their marketing um efficiency you can use their data
and um help them to understand the behavior of their customers you can then cluster those customers into different groups you can Find out what those customers are likely to do in these different groups and then based on that you can advise them what kind of marketing strategies they can use in order to improve their marketing efforts while um decreasing their marketing costs but for doing this you need to understand what kind of business you are dealing with what kind of data they got the coverage of the data potential implications the um area that that company
is covering um what kind of Customers you are dealing with the nature of the data and whether you know different C uh customer segmentation techniques for instance and what kind of segmentations techniques you can use in order to customize their C uh in order to segment their customers into groups and um how those customers need to be treated in order to improve the marketing efforts which means potentially increase revenue and then decrease the cost uh which means the Marketing um cost that they are spending on these marketing efforts knowing this is really important once you
interact with product managers once you interact with uh stakeholders business clients because it will be much easier for you to communicate with people with no Tech ground yet understanding the business really well and then learning from them too which will help to improve your data science project um impact let's now talk about the list of resources that you can Use free and paid in order to become a job ready data science professional many people choose the academic path which means that they graduate from bachelor's degree or from master's degree in data science and many people
that you see now in data science great professionals with a well-rounded technical skill set and making a lot of impact they don't have a master's in data science but they do have some sort of technical degree or they have graduated from a strong boot Camp now I do know that a boot camp can cost a lot of times 15 till 20K per person which is not very accessible and uh academic path might not be the choice for you so here is the list of free and paid resources to help you become a job ready data
science professional so let's restart with mathematics for learning mathematics you can use resources free resources like K Academy that has ton of great material when it comes to those uh fundamental Concepts from linear algebra differential calculus pre-algebra in order to master those skill set and become a job ready professional the thing is uh with that uh is that uh the curriculum still need to be selected which means that you need to know exactly what to learn and what not to learn to not end up with a case that you learn everything and then spend a
lot of time in there that you can actually spend to not learn those um optional Skills from mathematics and instead you can spend that time learning other skill set also very important for data science what I would suggest you then to do is to get in touch with us to uh join our short list uh because we are going to launch our because our mathematics boot camp is about to launch and uh that one is going to cover all the must know knowledge when it comes to mathematics to become um person that knows all this
must know skills from the field of Mathematics including linear algebra differential Theory and prealgebra now when it comes to the uh fundamentals to statistics uh you can when it comes to fundamentals to statistics here I got both free and paid resources for you so the resource that you can use for free is the fundamentals to statistics handbook which covers at high level all these must know Concepts that you need to know this fundamental Theory when it comes to statistics it Doesn't contain too many examples or applications but it does contain the theory that you need
to get into dat science in 2024 when it comes to the more detailed lay down approach uh the course of fundamentals to statistics will be great choice for you it is Al uh it is also part of our ultimate data science boot camp so you got a choice here to either go with the course itself to join the fundamentals statistics it covers all The theory and a lot of practical material and quizzes to help you master the fundamentals of statistics and at the same time you can also decide to go for the ultimate data science
boot camp which is very similar to uh the uh which which covers not only fundamentals statistics course but also all the other Concepts that you need to become a jbre professional so unless you are someone who alternatively you can also go with the fundamentals if you want more Detailed approach such that you learn both the theory as well as the implementation examples and quizzes you can also go with our fundamentals to statistics course which is by the way also part of our ultimate data science boot camp so you have a choice in here you can
go for the ultimate data science course or the ultimate data science boot camp and both of them will help you to master the field of fundamental Su statistics the next skill set uh and the Resource corresponding to which I will provide to you is the fundamental Sue machine learning for fundamentals to machine learning you can make use for free and you can download this handbook which is the fundamentals to machine learning handbook this handbook covers all the algorithms that I mentioned today the must no algorithms uh at a higher level with their corresponding python code
you can download this for free in our free resources section at Lunch. alternatively if you want to master the machine learning learn the theory with our demos with our lectures but also combine this with examples and uh quizzes then go for the fundamentals to machine learning course at lunch. this is also part of our ultimate data science boot camp like in case of fundamentals to statistic section when it comes to the AB testing we have a comprehensive AB testing course covering everything in one place at lunch. a so You can follow the fundamentals to uh
AB testing course at lunch. which is also part of the ultimate data science bootcamp 2o when it comes to the NLP you can follow our introduction to NLP course and this is also the part of the ultimate data science boot camp so python for data science is also um course that you can follow at l. to master all the python topics that you need to get into data science there are Also many free courses out there including some introductory level course a crush course at our YouTube channel that you can uh try in order to
learn the very basics of python not a comprehensive list just a small portion but still something to get started if you want to combine all these resources in one place and become a job ready professional in three to 6 months I would highly Rec recommend you to try our ultimate data science boot camp Which is the most accessible yet most comprehensive data science boot camp out there this is combined with a corresponding certification which will show that you have completed not just the theory but also the Practical case study projects across different fields in data
science as well as interview preparation and many assignments and quizzes so this ultimate data science boot camp covers the fundamentals to statistics fundamentals to machine Learning introduction to NLP complete AB testing Python programming for data science specifically and this is all combined with multiple case studies from different fields including product data science machine learning statistical modeling causal analysis even recommender systems from Ai and NLP we are updating those case studies every month which means we are updating this list of case studies continuously and the uh ultimate data science boot camp Not only contains the fundamental
theory in a most comprehensive and detailed way with many examples quizzes and assignments combined with the Practical projects that you can put on your resume and to Showcase your skill set but it also contains career guidance to Showcase what exactly the road map for data science and machine learning looks like what you can expect from a career in data science what you can expect from an interview process of data science and What are this different uh type of interviews that you'll need to prepare how you can use our different courses as part of data science
boot camp to prepare entirely end to endend for your data science interviews and to master them with confidence and to Ace them with confidence beside this you'll also get ton of resources all this handbooks that I mentioned you can complete the ultimate data science boot camp the most Accessible yet a comprehensive boot camp providing all inone solution to become a job ready professional in three to 6 months or you can do it on your own face part-time virtually and this can be whether in 6 months or 12 months depending on what your motivation is depending
on what kind of free time you have what uh your work life balance looks like and how much time you want to put into I would say you have a choice of a subscription plan and we have three Different options you have the monthly option which you can use for exploring the product this is 150 now on monthly basis alternatively you have so you have three different pricing Cho you have three different pricing options the first one is the exploration package which is the $150 on monthly basis this one comes also with the free trial
which means that you can make use of this to explore understand what is in it and some of the It items will be locked which means that this is really for just trying out you will have a choice when it comes to the pricing options for taking this boot camp the boot camp is entirely virtual which you can take at your own face virtually and you can either take it in 3 months 6 months or 12 months it's really up to you and what kind of Lifestyle you have the work life balance you got and
the responsibilities that you have outside of this um of this goal Of becoming data scientist so the uh monthly plan is for exploration so you can make use of this plan to explore to understand what is out there what is um what kind of lectures you are dealing with uh and the how what kind of curriculum uh you have as part of this ultimate data science boot camp alternatively if you are sure about the curriculum you have seen the quality that we provide you can go directly to half yearly option which contains all The boot
camp so all these different courses interview preparation material quizzes assignments certification included in it too but on the top of that you also get a uh fast-tracked support So 24/7 support when it comes to your uh emails so your emails will be uh very quickly answered with a priority line and you will get the help that you need from our customer support when it comes to our yearly plan yearly plan includes uh three sessions oneon-one Sessions with one of our trained professionals to provide you career coaching or provide you help oneon-one help when it comes
to different parts of uh boot camp that you have difficulties with to answer any questions you got so basically provide you personalized help it can also be um a oneone consultation help with your resume building personal brand building this video was sponsored by lunarch at lunarch we are all about making you Ready for your dream job in Tech making data science and AI accessible to everyone with is data science artificial intelligence or engineering at lul Tech Academy we have courses and boot camps to help you become a job ready professional we are here to help
also businesses and schools and universities with a topnotch training modernization with data science and AI corporate training including the latest topics like generative AI with luner tech Learning is easy fun and super practical we care about providing an endtoend learning experience that is both practical and grounded in fundamental knowledge our community is all about supporting each other making sure you get where you want to go ready to start your Tech Journey lunner Tech is where you begin for students or or aspiring data science and AI professionals visit lch Academy section to explore our courses and
boot camps and just in General our programs businesses in Need for employee training upscaling or data science and AI Solutions should head to the technology section on the lunch. page Enterprises looking for corporate training curriculum modernization and customize AI tools to enhance education please visit the lunch Enterprises section at lunch. for a free consultation and customize estimate join lunch and start building your future one data point at a time welcome to the Handsone a Testing crash course where we will do some refreshment when it comes to AB testing if you're looking for that one course
where you can learn and quickly refresh your memory for AB testing and how to actually do an AB testing case study hands on in Python then you are in the right place in this crash course we are going to refresh our memory for the a test design including the power analysis and defining those different parameters such As minimum detectable effect statistical significance level and also the uh type two probability so the power of the test and then we are going to do Hands-On case study project where we will be conducting an a testing results analysis
in Python at the end of this course you can expect to know everything about designing an AB test what it means as to design a proper AB test and how to do a Ab test results analysis in Python in a proper way I'm dat Vasan co-founder at Lunch and I have been in data science for the last 5 years I have learned AB testing and to end after following numerous blogs and numerous research papers and courses and I've noticed that there is not a one place one course that will cover all the fundamentals and necessary
stuff both the theory and implementation in Python in one place and that's about to change as we have this crash course that will help you to do exactly that to learn how to design An AB test in a proper way as a good and solidated scientist and to Showcase your skills by doing python AB testing results analysis don't forget to subscribe like and comment to help the algorithm to make this content more accessible to everyone across the world and if you want to get free resources make sure to check the free resources section at lunch.
and if you want to become a job ready data scientist and you are looking for this accessible boot Camp that will help you to make your job ready data scientist consider enrolling to the data science boot camp so whether you are a product scientist whether you are a data analyst data scientist or a product manager who wants to learn about AB testing at high level and how it can be done in Python then you are in the right place because in this crash course we are going to refresh our memory what it means to properly
design an a test which means doing power analysis and Also calculating the sample size by hand by following the statistical guidelines and ensuring that everything is done properly and then as the second part of this Crush course we are also going to do an Hands-On case study in Python when it comes to performing AB testing results analysis so we are going to cover all these important Concepts such as P values sample size and also uh interpreting the ab test results using standard error calculating those uh Estimates pulled variant and then evaluating the ab test results
including confidence interval generalizability of the results reproducibility of the results so without further Ado let's get started AB t in is an important topic for data scientists to know because it's a powerful method for evaluating changes or improvements to the products or Services it allows us to make data driven Decisions by comparing the performance of two different versions of A product or a service usually referred as treatment or control for example A tes sync allows data scientists to measure the effectiveness of changes to a product or a service which is important as it enables data
scientist to make data driven decisions rather than relying on Intuition or assumptions secondly AB testing helps data Sciences to identify the most effective changes to a product or a service which is really important because it allows us to Optimize the performance of a product or a service which can then lead to increased customer satisfaction and sales AB testing helps us also to validate sear and hypothesis about what changes will improve a product or service this is important because it helps us to build a deeper understanding of the customers and the factors that influence customers Behavior
finally AB testing is a common practice in many Industries such as e-commerce digital Marketing website optimization and many others so data scientists who have knowledge and experience in a testing will be more valuable to this companies no matter in which industry you want to enter as a data scientist and what kind of job you will be interview for and even if you believe more technical data science is your cup of tea be prepared to know at least higher level understanding and the details behind This method will definitely help you to know about this topic when
you are speaking with product owners stakeholders product scientists and other people involved in the business let's briefly discuss a perfect audience for the section of the course and prerequisites there are no prerequisites of this section in terms of AB testing Concepts that he should know already but knowing the B basics in statistics which you can find in the Fundamentals to statistics section is highly recommended this section will be great if you have no prior AB testing knowledge and you want to identify and learn the essential AB testing Concepts from scratch so this will help you
to prepare for your job interviews it will also be a good refresher for anyone who does have AB testing knowledge but who wants to refresh their memory or want to fill in the gaps in their knowledge in this lecture we will start off the topic About a testing where we will formally Define what AB testing is and we will look at a high level overview of AB testing process step by step by definition AB testing or split testing is originated from the statistical randomized control trials and is one of the most popular ways for businesses
to test new ux features new versions of a product or an algorithm to decide whether your business should launch that new ux feature or should Production lach that new recommender system create that new product that new button or that new algorithm the idea behind a testing is that you should show the variated or the new version of the product to sample of customers often referred as experimental group and the existing version of the product to not a sample of customers referred as control group then the difference in the product performance in experimental versus control group
is Tracked to identify the effect of these new versions of the product on the performance of the product so the goal is then to cheack the metric during the test period and find out whe there is a difference in the performance of the product and what type of difference is it the motivation behind this test is to test new product variants that will improve the performance of the existing product and will make this product more successful and optimal showing a Positive treatment effect what makes this testing great is that businesses are getting direct feedback from
their actual users by presenting them the ex existing versus the variated products version and in this way they can quickly Test new ideas in case of ab Test shows that the variated version is not effective at least businesses can learn from this and can decide whether they need to improve it or need to look for other ideas let us go through the steps Included in the AB testing process which will give you a higher level overview into the process the first step in conducting AB testing is stating the hypothesis of the ab test this is
the process that includes coming up with business and statistical hypothesis that you would like to test with this test including how you measured the success which we will call primary metric next step in AB testing is to Perform what we call power analysis and design the entire test which includes making assumptions about the most important parameters of the test and calculate the minimum sample size required to claim statistical significance the third step in AB testing is to run the actual AB test which in practical sense for the data scientist means making sure that the test
runs smoothly and correctly collaborate with engineers and product Managers to ensure that all the requirements are satisfied this also includes collecting the data of control and experimental groups which will be used in The Next Step next step in AB testing is choosing the right statistical test whether it is z test T Test Ki Square test Etc to test the hypothesis from the step one by using the data collected from the previous step and to determine whether there is a statistically significant Difference between the control versus experimental group The Fifth and the final step in AB
testing is continuing to analyze the results and find out whether besides statistical significance there is also practical significance in this step we use the second steps power analysis so the assumptions that we made about model parameters and the SLE size and the force steps results to determine whether there is a practical significance beside Of the statistical significance this summarizes the AB testing process at a higher level in next couple of lectures we will go through the steps one at a time so buckle up and let's learn about AB testing in this lecture lecture number two
we will discuss the first step in AB testing process so let's bring our diagram back as you can recall from the previous lecture when we were discussing the the entire process of AB testing at A high level we saw that in the first step in conducting AB testing is stating the hypothesis of ab test this process includes coming up with a business and statistical hypothesis that you would like to test with this test including how you measure the success which we call a primary metric so what is the metric that we can use to say
that that the product that we are testing performs well first we need to State the business hypothesis for our AB test from business Perspective so formally business hypothesis describes what the two products are that being compared and what is the desired impact or the difference for the businesses so how to fix a potential issue in the product where a solution of these two problems will influence what we call a key performance indicator or the kpi of the interest business hypothesis is usually set as a result of brainstorming and collaboration of relevant people on the Product
team and data science team the idea behind this Hyo process is to decide how to fix a potential issue in the product where a solution of these problems will improve the target kpi one example of business hypothesis is that changing the color of learn more button for instance to Green will increase the engagement of the web page next we need to select what we call primary metric for our av testing there should be only one primary metric in Your AV test choosing this metric is one of the most important parts of a test since this
metric will be used to measure the performance of the product or feature for the experimental and control groups and then will be used to identify whether there is a difference or what we call statistically significant difference between these two groups by definition primary metric is a way to measure the performance of the product being tested in the ab test for The experimental and control groups it will be used to identify whether there is a statistically significant difference between these two groups the choice of the success metric depends on the underlying hypothesis that is being tested
with this AB test this is if not the most one of the most important parts of the ab test because it determines how the test will be designed and also how will the proposed ideas perform choosing po metrics might disqualify a large Amount of work or might result in wrong conclusions for instance the revenue is not always the end goal therefore in AB testing we need to tie up the primary metric to the direct and the higher level goals of the product the expectation is that if the product makes more money then this suggest the
content is great but in achieving that goal instead of improving the overall content of the material and writing one can just optimize the Conversion funel one way to test the accuracy of the metric you have chosen as your primary metric for your ab test could be to go back to the exact problem you want to solve you can ask yourself the following question what I tend to call the metric validity question so if this chosen metric were to increase significantly while everything else stays constant would we achieve our goal and would we address our business
problem is it higher Revenue is it higher customer engagement or is it high views that we are chasing in the business so the choice of the metric will then answer this question though you need to have a single primary metric for your ab test you still need to keep an eye on the remaining metric to make sure that all the metrics are showing a change and not only the target one having multiple metrics in your ab test will lead to false positives since you will identify many significant Differences while there is no effect which is
something you want to avoid so it's always a good idea to pick just a single primary metric but to keep an eye and monitor all the remaining metrics so if the answer to the metric validity question is higher Revenue which means that you are saying that the higher revenue is what you are chasing and better performance means higher revenue for your product then you can use as your primary metric what we call A conversion rate conversion rate is a metric that is used to measure the effectiveness of a website a product or a marketing campaign
it is typically used to determine the percentage of visitors or customers who take a desired action such as making a purchase filling out a form or signing up for a service the formula for conversion rate is conversion rate is equal to number of conversions divided to number of total visitors multiplied by 100% for example If a website has thousand visitors and 50 of them make a purchase the conversion rate would be equal to 50 divide 2,000 multiply by 100% which gives us 5% this means that our conversion rate in this case is equal to 5%
conversion rate is an important metric because it allows us and businesses to measure the effectiveness of their website a product or a marketing campaign it can help businesses to identify areas for Improvement such as increasing the number of conversions or improving the user experience conversion rate can be used for different purposes for example if a company wants to measure the effectiveness of an online store the conversion rate would be the percentage of visitors who make a purchase and on the other hand if a company wants to measure the effectiveness of landing page the conversion rate
would be the percentage of visit who fill out a form Or sign up for a service so if the answer to the metric validity question is higher engagement then you can use the clickr rate or CTR as your primary metric this is by the way a common metric used in a testing whenever we are dealing with e-commerce product search engine recommender system clickr rate or CTR is a metric that measures the effectiveness of a digital marketing campaign or the user engagement or some feature on your web page or your website And it's typically used to
determine the percentage of users who click on a specific link or button or call to action CTA out of the total number of users who view it the formula for the click through rate can be represented as follows so the CTR is equal to number of clicks divided to number of Impressions multipli by 100% not to be confused with click through probability because there is a difference between the clickr rate and clickr probability for example if an Online advertisement resists thousand of Impressions which means that we are showing it to the customers for a thousand
times and there were 25 clicks which means 25 out of all these Impressions resulted in clicks this means that the clickr rate for this specific example would be equal to 25 IDE 2,000 multiply by 100% which gives us 2.5% this means that for this particular example our clickr rate is equal to 2.5% cure rate is an important metric because it allows businesses to measure the effective of their digital marketing campaigns and the user engagement with their website or web pages High clickr rate indicates that a campaign or the web page or feature is relevant and
appealing to the target audience because they are clicking on it while low clickr rate in the case that the campaign or the web page needs an improvement click show rate can be used to measure the Performance of different digital marketing channels such as paid search display advertising email marketing and social media it can also be used to measure the performance of different ad formats such as text advertisements Banner advertisement video advertisements Etc next and the final task in this first step in the process of AP testing is to State the statistical hypothesis based on the
business hypothesis and the Chosen primary metric next and in the final task in this first step of the AB testing process we need to State the statistical hypothesis based on the business hypothesis we stated and the chosen primary metric in the section of fundamentals to statistics of this course in lecture number seven we went into details about statistical hypothesis testing included what n hypothesis is and what alternative Hypothesis is so do have a look to get all the insights about this topic AB testing should always be based on a hypothesis that needs to be tested
this hypothesis is usually set as a result of brainstorming and collaboration of relevant people on the product team and data science team the idea behind this hypothesis is to decide how to fix a potential issue in a product where a solution of these problems will influence the key performance indicators Or the kpi of interest it's also highly important to make prioritization out of a range of product problems and ideas to test while you want to P that fixing this problem would result in the biggest impact for the product we can put the hypothesis that is
subject to rejection so that we we want to reject in the ideal World under the null hypothesis what we Define by AG zero while we can put the hypothesis subject to acceptance or the desired Hypothesis that we would like to have as a result of AB testing under the alternative hypothesis defined by H1 for example if the kpi of the product is to increase the customer engagement by changing the color of the read more button from blue to green then under the N hypothesis we can state that click sh rate of learn more button with
blue color is equal to the click through rate of green button under the alternative we can then state that the click through Rate of the learn more button with green color is larger than the clickr of the blue button so ially want to reject this n hypothesis and we want to accept the alternative hypothesis which will mean that we can improve the clickr rate so the engagement of our product by simply changing the color of the button from blue to green once we have set up the business hypothesis selected the primary metric and stated the
statistical Hypothesis we are ready to proceed to the next stage in the AB testing process in this lecture we will discuss the next Second Step In AB testing process which is designing the ab test including the power analysis and calculating the minimum sample sizes for the control and experimental groups stay tuned as this is a very important part of AB testing process commonly appearing during the data science interviews some argue that AB testing is an art and Others say that it's a business adjusted common statistical test but the borderline is that to properly Design This
experiment you need to be disciplined and intentional while keeping in mind that it's not really about testing but it's about learning following art steps you need to take to have a solid design for your ab test so let's bring the diagram back so in this step we need to perform the power analysis for our AB test and calculate The minimum sample size in order to design our AB test AB test design includes three steps the first step is power analysis which includes making assumptions about model parameters including the power of the test the significance level
Etc the second step is to use these parameters from Power analysis to calculate the minimum sample size for the control and experimental groups and then the final third step is to decide on the test Duration depending on several factors so let's discuss each of these topics one by one power analysis for AB testing includes these three specific steps the first one is determining the power of the test this is our first parameter the power of the statistical test is the probability of correctly rejecting the N hypothesis power is the probability of making a correct decision
so to reject the N hypothesis when the null Hypothesis is false if you're wondering what is the power of the test what is this different concepts that we just talk about what is this null hypothesis and what does it mean to reject the null hypothesis then head towards the fundamental statistic section of this course as we discuss this topic in detail as part of that section the power is often defined by 1 minus beta which is equal to the probability of not making a type two Error where type to error is the probability of not
rejecting the null hypothesis while the null is actually false it's common practice to pick 80% as the power of the ab test which means that we allow 20% of type to error and this means that we are fine with not detecting so failing to reject no hypothesis 20% of the time which means that we are fine with not detecting a true treatment effect while there is an effect which means that we are failing To reject n however the choice over value of this parameter depends on nature of the test and the business constraints secondly we
need to determine a significance level for our AB test the significance level which is also the probability of type one error is the likelihood of rejecting the no hence detecting a treatment effect while the no is actually true and there is no statistically significant impact this value often defined by a Greek letter Alpha is a probability of making a false Discovery often referred to as a false positive rate generally we use the significance level of 5% which indicates that we have 5% risk of concluding that there exists a statistically significant difference between the experimental and
control variant performances when there is no actual difference so we are fine by having five out of 100 cases detecting a treatment effect while there is no effect it also means that you have A significant result difference between the control and the experimental groups within 95% confidence like in the case of the power of the test the choice of the alpha is dependent on the nature of the test and the business constraints that you have for instance if running this AB test is related to high engineering course then the business might decide to pick a
high offer such that it would be easier to detect a treatment effect on the other hand the Implementation costs of the proposed version in production are high you can then pick a lower significance level since this proposed feature should really have a big impact to justify the high implementation cost so it should be harder to reject no hypothesis finally as the last type of power analysis we need to determine a minimum detectable effect for the test last parameter as part of the power analysis we need to make assumptions About is what we call minimum detectable
effect or Delta from the business point of view so what is the substantive to the statistical significance that the business wants to see as a minimum impact of the new version to find this variant investment worthy the answer to this question is what is the amount of change we aim to observe in a new versions metric compared to the existing one to make recommendations to the business that This feature should be launched in the production that it's investment worthy an estimate of this parameter is what is known as a minimum detectable effect often defined by
a Greek letter Delta which is also related to the Practical significance of the test so this MD or the minimum detectable effect is a proxy that relates to the smallest effect that would matter in practice for the business and it's usually set by stakeholders as this parameter is highly Dependent on the business there is no common level of it instead so this minimum detectable effect is basically the translation from statistical significance to the Practical significance and here we want to see and we want to answer the question what is this percentage increase in the performance
of the product that we want to experiment with that will tell to the business that this is good enough to invest in this new feature or in this New product and this can be for instance 1% for one product it can be 5% for another one and it really depends on the business and what is the underlying kpi a popular reference to the parameters involved in the power analysis for AB testing is like this so one minus beta for the power of the test Alpha for the significance level Delta for the minimum detectable effect to
make sure that our results are repeatable robust and can be generalized To the entire population we need to avoid P hacking to ensure real statistical significance and to avoid biased results so we want to make sure that we collect enough amount of observations and we run the test for a minimum predetermined amount of time therefore before running the test we need to determine the sample size of the control in experimental groups as well as later on in this lecture we will see also how long we need to run the test so This is another important
part of AB testing which needs to be done using the defined power of the test which was the one minus beta the significance level and a minimum detectable effect so all the parameters that we decided upon when conducting the power analysis calculation of the sample size depends on the underlying primary metric as well that you have chosen for tracking the progress of the control and experimental versions of the product so We need to distinguish here two cases so when discussing the primary metric we saw that there are different ways that we can measure the performance
of different type of products if we are interested in engagement then we are looking at a metric such as clict rate which is in the form of averages so the case one will be where the primary metric of AB testing is in the form of a binary variable it can be for instance conversion or no conversion click or no Click and in case two where the primary metric of the test is in the form of proportions or averages which means mean order amount or mean click through rate for today we will be covering only one
of these cases but you can find more details on the second case in my blog which I have posted also as part of the res resources section this blog post contains all the details that you need to know about AB testing including the statistical tests and their Corresponding hypothesis the descriptions of different primary metrics that go beyond what we have covered as part of this section as well as many more details that you need to know about AB testing so let's look at a case two where the primary metric of the test is in the
form of proportions or averages so let's say we want to test whether the average click to rate of control is equal to the average to rate of Experimental group and under H we have that the MU control is equal to M experimental and under H1 we have that the MU control is not to Mu experimental so here the MU control and mu experimental are simply the average of the primary metric for control group and for the experimental group respectively so this the formal hypothesis we want to test with our AB test and we can assume
that this new control is for instance the clickr rate of the control group and The MU experimental is the click rate of the experimental group so this is the formal statistical hypothesis we want to test with our AB test if you haven't done so I would highly suggest you to head towards the fundamental statistic section of this course where in lecture number seven and8 of the statistical part of this course I go in detail about statistical hypothesis testing the means averages significance level Etc this also holds For the theorem that the some precise calculation is
based upon called Central limit theorem so check out the last lecture about inferential statistics where I covered the central limit theorem which we will also use in this section and finally also check the lecture number five in that section where we cover the normal distribution another thing that we will use as part of this section so the central limit theorem states that given a sufficiently Large sample size from an arbitrary distribution the sample mean will be approximately normally distributed regardless of the shape of the original popul distribution this means that the distribution of the sample
means will be approximately normal if we take a large enough sample even if the distribution of the original sample is not normal so when we are dealing with a primary performance tracking metric that is in the form of average such as this one That we are covering today which is a clickr rate we intend to compare the means of the control and experimental groups then we can use the central limit theorem as state that the mean sampling distribution of both control and experimental groups follow normal distrib tion consequently the sampling distribution of the difference of
the means of these two groups also will be normally distributed so this can be expressed Like this where we see that the mean of the control group and mean of the experimental group follows normal distribution with mean mu control and mu experiment respectively and then with the variance of Sigma control squared and sigma experimental squared respectively though derivation of this Pro is out of the scope of this course we can St say that the difference between the means of the true group so xar control minus xar experimental also Follows normal distribution with the mean new
control minus new experimental and with a variance of Sigma control squ / to n Control Plus Sigma experimental squ ided to n experimental so the sample size of the experimental group and the sample size of the control group hence the sample size needed to compare the means of the two normally distributed samples using a two-sided test which prespecify significance the alpha power level and minimum detectable eff can be Calculated as follows so here you can see the mathematical representation of the minimum sample size so the N which stands for the minimum sample size is equal
to and in denominator we have Sigma Control squ Plus Sigma experimental squ multip by Z1 - Alpha / to 2+ Z1 - beta squ / to the Delta squ and here the Alpha and the beta and the Delta we have made assumptions about as part of the power analysis and a sigma Control squared and a sigma experimental squared are the uh estimates of the variance that we can come up with using the so-called AA testing I would say you do not necessarily need to know this derivation as there are many online calculators that will ask
you for the alpha the beta and the Delta values as well as the sample estimates for the sigma squ control and experimental and then these calculators will automatically calculate the minimum samp Size for you if you're wondering what this AA testing is and how we can can come up with the sigma control squar and sigma experimenting squared as well as all the other values then make sure to to check out the blog that I posted before and that I mentioned before as I explain in detail all these values as well as check out the resource
section where I've included many resources regarding this but for now just keep in mind that the Z1 minus Alpha / to 2 and Z1 minus beta are just two constants and come from the normal distributed and standard normal distributed tables I would say you do not necessarily need to know this derivation as there are many online calculators that will ask you for this Alpha Beta And Delta values as well as the sule estimates for the sigma squ control and sigma experimental control and then we'll calculate automatically the sample size for you for the control and
experimental group Effectively one example of such calculator is this AB testy online calculator but if you Google it you will find many others that will ask you for the minimum detectable effect for the statistical significance or the statistical power and then it will automatically calculate for you the minimum sample size that you should have in order to have a statistical significance and in order to have a valid AB test one thing to keep in mind Is that you will notice that the statistical significance level is set to 95% in here which is not what we
have seen when we were discussing the alpha significance level so sometimes these online calculators will confuse or will interchangeably use the significance level versus the confidence level which are the opposite so the significance level is usually at the level of 5% or 1% confidence level is around 95% so which is basically 100% minus the alpha Therefore whenever you see this 95% know that this means that your Alpha should be 5% so it's really important to understand how to use this calculator not to end up with the wrong minimum sample size conduct an entire AB test
and then at the end realize that you have used the wrong uh significance level the final step is to calculate the test duration this question needs to be answered before you run your experiment and not during the experiment sometimes People stop the test when they detect statistical significance which is what we call P hacking and that's absolutely not what you want to do to to determine the Baseline of a duration time a common approach is to use this formula as you can see duration is equal to nided to the number of visitors per day where
n is your minimum sample size that we just calculated in the previous step and the number of visitors per day is the average number of visitors that you Expect to see as part of your experiment for instance if this formula results in 14 days or 14 this suggest that running the test for two weeks is a good idea however it's highly important to take many business specific aspects into account when choosing the time to run the test and for how long you need to run it and simply using this formula is not enough for example
if you want to run an experiment at the end of the month December with Christmas breaks when higher than expected or lower than expected number of people are usually checking your web page then this external and uncertain event had an impact on the page you search for some businesses this means for example if you want to run an experiment at the end of the month of December with Christmas breaks when higher than expected or in some cases lower than expected number of people are Usually checking the web page so depending on the nature of your
business or the product then this EX interal and uncertain event can have an impact on the page usage for some businesses which means that for some businesses a high increasing the page usage can be the result and for some a huge decrease in usability in this case running gab test without taking into account this external Factor would result in inaccurate results since the activity Period would not be true representation of a common page usage and we no longer have this Randomness which is a crucial part of AB testing beside this When selecting a specific test
duration there are few other things to be aware of firstly two small test duration might result in what we call novelty effects users tend to react quickly and positively to all types of changes independent of their nature so it's referred as a novelty effect and it Vares of in time and it is considered illusionary so it would be wrong to describe this effect to the experimental version itself and to expect that it will continue to persist after the noble T effect wears off hence when picking a test duration we need to make sure that we
do not run the test for too short amount of time period otherwise we can have a novelty effect novelty effect can be a major threat to the external validity of an AV test so It's important to avoid it as much as possible secondly if the test duration is too large then we can have what we call maturation effects when planning an AB test it's usually useful to consider a longer test duration for allowing users to get used to new feature or product in this way one will be able to observe the real treatment effect by
giving more time to returning users to cool down from an initial positive reaction or a spike of Interest due to a Change that was introduced as part of a treatment this should help to avoid novelty effect and is better predictive value for the test outcome however the longer the test period the larger is the likelihood of external effect impacting the reaction of the users and possibly contamin ating the test results this is what we call maturation effect and therefore running the AP test for too short amount of time or too long amount of time is
not Recommended as it's a very involved topic we can talk for hours about this part of the ab test and also a topic that is asked a lot during the data science and product scientist interviews therefore I highly suggest you to check out this book about AB testing which is a Hands-On tutorial about everything you need to know about AB testing Tes in as well check out the interview preparation guide in this section that contains 30 most popular AB testing related Questions you can expect during your data science interviews so stay tuned and in the
next couple of lectures we will cover the next stages of AB testing process if you are looking for one place to learn everything about AB testing without unnecessary difficulties but also with a good statistical IND Science Background then make sure to check out the ab test course at lunch. a so if you want to learn all this background information including what is Statistical significance what is AB testing how can AB testing be done and you want to have this endtoend AB testing course then make sure to check the AB testing for data science course at
l. that's the only course that is available at the moment on Internet that covers the most fundamental concept of a testing including the theory and the implementation in Python without know the extra details and right going Straight to the point in order to help you to Kickstart your journey with AB [Music] test the resource that I would suggest you to keep by the hand is the blog called complete guide to I testing design implementation and pitfalls which is part of the Hands-On tutorials of the towards data science so in here and specifically this part where
we are discussing the two sample that test I would suggest to go through it as we are Going to conduct this two sample set test as part of our Python and we are going to learn how to implement this in Python in this book you can learn everything out there that you need to know about A tes syc including different uh pits includ of a testing the process behind it how you can conduct the ab test end to end how you can calculate the sample size how you can choose a test the primary metric definitions
different statistical test that you can Use including the Ki Square test the two sample Z test and two sample T Test so given that as part of the lectures of the um AB testing and specifically lecture number five we have already discussed the two sample T Test and how to implement it I thought it would be more useful for you to know how to implement the two sample Z test such that you know both of them and you know their theory behind it and also how to implement them and finally if you're Wondering how you
can Implement them in Python then head towards my uh blog uh in the medium as well as my GitHub repository that I will post in the resource section where you can find all the different statistical tests you can use for analyzing your ab test results including the two sample T Test two sample Z test K Square test and much more so without further Ado let's get started with our demo so uh as you can see here I'm Generating the data myself assuming that uh the uh primary metric follows binomial distribution so the output is in
the form of zeros and ones because we are looking into the click event and click can be either zero or one and then I'm using here the binomial distribution to randomly sample from it and in case of the experimental version I'm using a probability of success equal to 0.4 and in case of the control version I'm using a probability of success equal to 0.2 Because I want to have a quiet difference between the two groups and then later on we can also adjust this and we can change the difference to see how our test behaves
so um I'll assume that um the uh at the end of the uh data generation process we have a data that is similar to the form that you will get from the uh Engineers once they uh finish up collecting all the data from your customers and I will also assume that the Integrity of the ab test is Held which means that the observations who were in the control group they only saw the control version of the product and observations who were in the experimental group they only saw the experimental version of the product and let's
actually go ahead and see how the data looks like so so as you can see here we are generating our data so the data is in this format so you can see that we have an observation in total we have 20K Observations because we have two different groups each with 10K observations and then the First Column describes the click event so we will either have a click or we will have no click and the primary metric is in the form of a click so we are measuring the performance of the product both control and the
experimental with the same metric which is whether there is a click event or no click event and the primary metric is in the form of a binary Variable so we have either zeros or we have ones Whenever there is a click then the corresponding value is one whenever there is no click then the corresponding value is zero and then we have the corresponding Group which helps us to understand whether the observation belongs to the experimental group so X or the control group which is a uh Co so uh this is how the data looks like
and it's also what you can uh expect from uh data Engineers uh once the uh AB test is Conducted so you have run your ab test and Engineers have collected the data assuming that the data Integrity has been kept and also that there was no systematic error when collecting and measuring the performance of the uh control and the experimental versions of the product first thing that we are going to do is to estimate the P hat control and a p hat experimental and for that what we need to do first is to count the number
of clicks per group so We saw earlier that we have this data that we generate ourselves consisting of 20 K RS where 10 belongs to the uh control group and the 10K belongs to the experimental group and each consists of this click variable and the group The Click variable is an indicator uh that says that the observation clicked on the uh page versus uh not clicked on the page so whenever there was a click we have here one whenever there was no click we have here zero and then we have The corresponding uh group such
that we can use to group this data based on the control versus experimental group and that's exactly what we are going to do as the first step in our process so we are going to calculate the number of total clicks for control group and for the experimental group so here we are making use of the function Group by in order to group this data frame so this data frame based on the group and then we want to click uh The we want to get the uh click variable and we want to sum this variable because
the variable is of a binary nature so we have ones and zeros if we do the sum we are basically counting the number of times we have the uh observation click equal to one so by summing a binary variable we are simply getting the number of ones in that variable and that's exactly what we are doing in this part and then what is remaining is to get the uh number of clicks from control Group and number of clicks from the experimental group by using this fun F code look so we saw earlier when we were
discussing the um accessing of observations in a pendis data frame that there is a difference between iog and loog and the reason why we are using here the loog is because the uh group uh data that we are getting in here it will provide us an output where the index is in the format of a string so let's actually go ahead and print that part Because I think it's an important part to see how the data looks like and also will make sense why I'm using here the look function to access the uh control groups
number of clicks and the experimental groups number of clicks so this is the a group data frame that we are getting as you can see we are getting here the group and here we are getting for the control uh index the number of clicks is equal to 1,924 and for the experimental group It's equal to 5,00 2017 so then the next thing what we need to do is actually access this value and for that we need to specify that we want to access the value corresponding to the index equal to control and this can be
done by using this log function so you cannot use ilog or any other way of accessing this because the index is of string type and therefore we are using the look so let's actually also add some print statements to make our code more Readable so this will then print the number of clicks per control group and per experimental group here we go so as you can see we are nicely accessing the correct values then the next step is to calculate the P head control and the P head experimental so basically the estimate of the click
probabilities of the control group and the experimental group respectively and for that we just need to take the uh Number number of clicks and we need to divide it to the number of observations for that group so it is this part let's go ahead and calculate those values so as you can see I'm taking the number of clicks that we just obtained and I'm dividing it to the number of observations that we have defined in the very beginning here we go so as you can see for the control group The uh click probability is equal
to 0.20 and in case of the experimental group is equal to 0.5 so we see that there is a large difference between the click probability for these two groups which is um a reflection of what we saw here because we generated the data such that the uh success probability for the experimental group is equal to 0.5 and the um for the control group is equal to 0.2 so we see this numbers ref cting also in here and the reason for that is because we have sampled our data Large enough and we see that the um
probability so the the mean of our sample um converges in a probability to the mean that we use and this is also the idea behind the low of large num something that we have also discussed as part of the fundamentals to statistic section of this course so the next thing what we need to do is to compute the p po had or the uh estimate of the PED uh success probability and we saw uh when we were Discussing the theory behind it that it's equal to the sum of the clicks for both control and experimental
group divided to the total number of observations in both control and the experimental group so basically the P pulled head is equal to xcore control plus xcore experimental / to the ncore Control Plus ncore experimental then the next thing we need to do is to compute a pulled variance and we just so that the pulled variance Can be calculated by taking the pulled uh estimate for the click probability so this P pulled and then multiply by one minus P ped head and then multiplied by the inverses of the uh observ number of observations in each
of the groups and there's some so 1 / 2 N control plus 1 / 2 N experimental so it can be calculated as follows so pulled variance then is equal to P ped head multip by 1 minus P ped head multiply by 1 / to n control + 1 / To n experimental let's also add some print statements here we go go and then the next step is to calculate the standard error so the standard error is the square root of the pulled variant so quite straightforward and here we are going to make use of
the nonp function so the SE is equal to NY Dot and a square roof is simply uh calculated by using the function sqrt which stands for square roof and then Here we need to mention the PED variance let's also add the print statement explaining the uh the code and this really can help your reviewer the code reviewer to understand what you are doing okay so now we have also the standard eror and now we are ready to calculate our test statistics so we saw that the test statistics is equal to the P control head minus
P experimental head divided to the standard error and that's exactly what we are going to implement In here so as you can see the test statistics is equal to P control head minus P experimental head / 2DS so the standard error and then finally what we need to do is to compute the Z critical value the P value and a confidence interval but for doing that we need to assume the significance level so usually this is done before conducting the test but here I'm assuming that before conducting the Test there was a power analysis and
as part of that we have decided that the statistical significance level is equal to 5% so let's add that here so Alpha is equal to 0.05 therefore we are going to use this specific Alpha so 5% in order to calculate our critical value coming from the normal table and to do this there are uh various options so one way of doing that is to hard code the value which I would not recommend but it is Definitely uh an easy way to go if you um haven't used uh the python libraries to automize this process but
here I will provide you the code and I will also tell you how you can use the CPI Norm um function in order to calculate the critical volum and I think keeping the code as general as possible will help you in the long term too because it can be that this time you're calculating the critical value corresponding to Alpha is equal to 0.05 but maybe next time you Want to calculate the critical value when your Alpha is equal to 1% so you're interested in the case when your type on probability is equal to 1% so
for those cases uh you want to keep your code AS General as possible such that by changing your uh variable let's say Alpha you don't need to go each time and then in the chat GPT look for the corresponding uh value coming from the standard normal table so for this what we are going to use is the norm function So the norm function come from the CPI stats library for that we need to import from ci. stats the norm function which stands for the normal distribution so in here what we need to use is the
function called ppf which is the uh percentage Point function so the norm done ppf function stands for the percent Point function and it's usually known as the inverse cumulative distribution function or the CDF of the standard normal distribution And it takes as an input the probability value and it Returns the corresponding value on the xaxis of the CDF once you provide a p so here we are providing the P which is equal to 1us Alpha / to 2 then this function calculates the X so the XX such that the probability of observing a value less
than or equal to 2 or 2x in a standard normal distribution is equal to P so we have this inverse CDF and we have the x-axis and we have the y- axis on the y- axis We have the probabilities and on the x-axis we have the X values so here what we are basically doing is that we are providing the probability that we have which is equal to 1us Alpha / to 2 and we want to know the corresponding X volume therefore it's also called inverse humity distribution function and in this way we can calculate
that critical value which can help us to identify the place where we need to have our rejection region and so here is the Uh rejection region of this test and as you can see we have two-sided test therefore we have also a two regions and whenever the um test statistics is larger than the critical volum in the right hand side and it is smaller than the critical volum from the left hand side then we are saying that we can reject the N hypothesis therefore it's also called the rejection region so uh once we calculate this
that critical value we are ready to go to the Next step but before that let's also add some statement print statement for readability in here so the next step is to calculate a p valum and a P value can be calculated by using the norm. ASF function so the norm function comes once again from the cpis T library and the SF stands for survival function the norm. SF function stands for a survival function and it stands for the complement of the CDF function so the Commity distribution function of the standard normal distribution it calculates the
probability of observing a value greater than a given threshold so in this case we want to calculate the uh probability that our test statistics will be smaller than equal the critical volume and as we saw that the standard normal distribution was symmetric here we are multiplying just one side of that probability by two in order to obtain our final value so Here once we run this test we will finally get our P value and as you can see here the P value of the two sample Zas that we got is equal to zero well now
once we have the P value and also we know what is our Alpha we are ready to test for the statistical significance of our results so given that our P value is equal to zero and it's smaller than 0.05 so our Alpha we can state that the null hypothesis can be rejected and we can say that there is a statistically Significant difference between our experimental version of the product and the control version of the product so this will help us to uh test for the statistical significance of our AB test however if you were for
instance to have a different samples so let's say we would compute uh we would randomly sample from the binomial distribution so as you can see once we are getting the the uh probability of the success the same for the two groups Then the P value becomes large at least much larger than the alpha which means that we can no longer reject the neural hypothesis and we can no longer State there is a statistical evidence at a 5% statistical level that the control version is statistically significantly different from the experimental version and this uh verifies that
everything that we have done here is correct so the ab test results analysis is accurate now the question is whether we uh also have A practical significance once we pass the statistical significance test so let's move this back to what we had before so this is 0.5 and once again the P value is just zero and let's go ahead and calculate our confidence inal such that we can test for the Practical significance and we can comment on the accuracy of the test and the general ability of our AB test so we saw that the confidence
interval can be calculated as follows So we have the difference between the P head experimental and the P head control and then for the lower bound we need to uh subtract from this the standard or multiply by Z critical value and then for the upper bound we need to do the same only with summing the standard a multiply by Z critical volume so the difference here you might notice is this round function and the reason why I'm adding this is because I want you have nice numbers that will be rounded uh Just three numbers out
to the decimal instead of having a long uh floating numbers so once we go ahead and print this confidence interval we can also see the lower bound and upper Bound in numbers here we go so as you can see we are getting a confidence interval which is quite narrow so this is a suggestion that our AB test results are most likely accurate and that the Precision of our AB test is high and this is a good sign because then we can say that the ab test We have conducted in here is most likely generalizable to
the entire population then the next question is okay do we have a practical significance or not and for that we do need the final assumption regarding the minimum detectable effect so let's say during the power analysis before conducting our a test we got an MBE which or let's actually call it Delta let's keep the Greek letters uh and the Delta let's say is equal to uh 3% so 0.03 well in this case we can notice that the Delta 0.03 so 3% is much lower than the lower bound of our confidence interval which is equal to
30% so 29.7% this means that in that case we would have said that there is a practical significance also but if the uh Delta would have been for instance the uh 0.31 so we have a 31% Delta then in that case the uh Delta is no longer smaller than the lower bound of our Confidence interval and in that case we cannot say that our results are also practically significant so depending on the business and depending on the Assumption regarding the delta or the minimum detectable effect we can then compare this to the lower bound of
the confidence interval and we can State whether there is a practical significance or not in case there is a practical significance then we are good to go so we can say that we have a Statistical significance we have a practical significance and we also have a narrow confidence interval which is a suggestion that our result are also generalizable and accurate so uh this completes our uh AB test results analysis and this is all that you need to do in order to have a valid and uh good quality AB test looking to elevate your data science
or data analytics portfolio then you are in the right place with this AB testing and Trend Case study you can showcase your AB testing and coding skills in one place I'm D Vasan data scientist an AI professional and I'm the co-founder of lunar Tech where we are making data science and AI accessible to everyone individuals businesses and institutions in this case study we are going to complete an endtoend case study with AB testing where we are going to test in a datadriven way whether it's worth to change one of our features in Our ux design
in the lunar text landing page this is a real life data science case study that you can conduct and you can put it on your resume in order to Showcase your experience in datadriven decision making where you will showcase your statistical skills experimentation skills with AB testing and your coding skills in Python using Library such as T models but also the app pendas npy also metp liip and curn we are going to start with the business objective of this case Study then we are going to translate the business objective into our data science problem then
we are going to start with the actual coding we are going to Lo libraries we are going to look into Data visualize the data The Click data we are going to look into the motivation behind choosing that specific primary metric which is a clickr rate then we are going to talk about the statistical hypothesis for our AB testing I will also teach you step by step all the calculations Starting from the calculation of the pulled estimate from the clickr rate and then a computation of the uh pulled variance the standard error but also the motivation
behind choosing the search statistical test that I will be using such as the two sample Z test and then how you can calculate the test statistics how you can calculate the P value of the test statistics and then use that with the statistical significance to test the uh statistical Significance of your ab test after this we will also then compute the confidence interval comment on the generalizability of the ab test and then at the end we will also test for the Practical significance of the ab test then we will conclude and we will wrap up
and we will make a decision based on our data driven approach using the ab test to check whether it's worth it to change a feature in our ux design in the lunar text landing page so without further Ado Let's get started so let's now start our case study in here I have in the left hand side this uh version of our landing page so which is our control version so to say the existing version where you can see that here we have start freet trial and here we got us our button secure free trial in
the right hand side we got this new experimental version that we would like to have which is the Andro Now button so as we saw in the introduction what we are trying to Understand is that or whether our customers click more on the new version the experimental version versus the existing version the control version so um as of the day of uh loading this and uh conducting this case study our lending page uh has a secure free trial but what we wanted to test with our data is whether the uh enroll now is more engaging
such that we can go from the secure free trial version to the Andro now version and uh here um for this Specific case not only but also in general as we know from a testing is that whenever we got an existing algorithm or existing feature existing button then we are referring this group that we will um where we will expose this existing version of the product we are referring this as a control group so all the users to whom we will show the existing version of our landing page we will refer them as the uh
control group participants and then we have the in the Right hand side our um experimental version and our experimental users so the users our existing customers that are selected to be taken part um in our experimental group and in our experiment they will be then uh exposed to this new version of our landing page which contains this endro now button so our end goal in terms of the business as we saw in the introduction is to understand whether we should release the new button which will end up being higher engaging Which means that we will
have higher CTR or higher uh more uh clicks that will come from our user s which uh automatically means better business because we want to have highly engaging users if they are clicking on this button it means that it interests them more compared to the control version and uh if something on our landing page in this case our call to action is more interesting and highly engaging it means that we are doing something right and Our users might uh either make use of our free products or uh purchase our products or um just stay engag
with us to keep lunner Tech in mind and whenever there is someone who uh is interested in data science or AI um Solutions or products then they can at least refer their friends if they are just clicking to understand and to learn more about our products that's also possibility so from a business perspective we therefore are using here as our primary Metric uh our click through rate the CTR of this specific button which in our control version is the secure free trial and in our experimental version is the enroll now and what we want to
understand is that that whether this new button will end up having higher CTR or not because higher CTR from the technical perspective will translate to higher engagement from the business perspective so here we are making this translation from business versus Technical um when it comes to AB testing we can have different sorts of primary metrics we can have a clickr rate as a primary metric we can have a conversion rate as a primary metric or any other primary metric what we want to have as our metric that will work as the single measure that will
will compare our control and experimental group to understand which version performs better is first to understand what this Definition of better is and how that translates back to the business because if the engagement is what we are referring as Better Business for some reason and I will explain you in a bit why we think the engagement in this case is what we what matter for us at ler Tech then it means that clickr rate can be used as a primary metric this is just a universal metric that has been used across um different web applications
search engines recommender systems and Many other digital products to understand whether the engagement of that specific algorithm feature web design whether that is better or not and in this case in this specific case study we are also going to use the CTR because we are interested in the the engagement so at luner Tech we really care about the engagement um with our users and we want our users to make use of our products but uh ultimately to engage with us because if they engage with us It means that our products are being seen our uh
landing page is being visited and the user is actually interested to click on that button and then the action point and then to start either free trial or to en R to see what is going on because all these are signs of Interest coming from the user side and in the control version as our click to action is to secure a free trial which directly uh lends the user to our free trial to our ultimate data science Boot camp but given that we are expanding which means that we are now offering more courses we are
offering free products and also we have uh Enterprise clients uh we have businesses asli cents who want data science and AI Solutions and who want corporate training therefore we want to go from this Niche uh version of a landing page so secure free trial to enroll now because we already have a lot of Engagement in terms of the free trial we Want to make it more General so that's the business perspective and on the other hand we also want to change beside of changing this um main um call Action we want to make it generalized
and at the same time we want to see whether this generalized version will end up leading us um a higher engagement not only in terms of the other products but also for the tree trial free trial itself because we always are looking for educating people and providing this free Trial such that they can make use of our Flagship product which is the the ultimate data science boot camp so now when we understand why we care about the engagement here at ler Tech and we understand why we want to check whether this new button in our
ux design will end up increasing the engagement or not we can now make this translation back to the data science terms because we know now from the business perspective All We Care is to understand whether this Experimental version of the product is performing better or not but then this means that we need to conduct an AB test and we need to understand whether the ideas that we got and the speculation that the enroll now more General Button as a call to action will be better than the secure free trial version whether this is actually true
or not from the customer perspective because if we want to call us a data driven company we cannot just base our conclusions and our Decisions for our products or for just in general for our product road map based on Intuition or logic we want this to be data driven which means that the customers are at the first place we are customer driven and our customers need to tell us whether the new um button is better or not and here we have conducted conducted an Navy test and um here I won't be using the real data
I will be using the uh proxy data or simulated data that I uh generated myself and um This one contains the similar structure and this uh the same um idea of the data that we got when we were conducting our IB test and collect in this data and what is our business uh hypothesis in our business hypothesis we can say that we have at least 10% increase in our click through rate so 10% higher engagement when we have our enroll Now versus the secure free trial version of the product so this is our business hypothesis
which means that our enroll Now CTR so click through rate of the enroll Now button will result in at least 10% higher CTR then the secure free trial so there exist uh 10% at least 10% difference in terms of the engagement when we compare this new version of the product versus the old version of this new uh button and when we translate this back to statistical hypothesis we can say that under the N hypothesis we are saying that there is no statistically Significant difference between the um control p and then P experimental which means the
um um probability click through probability clickr rate for control group versus experimental group so under AG n the no hypothesis we are stating what we ideally want to reject we are saying there is no difference between the experimental and control group CTR and under the alternative hypothesis so the H1 we are saying no uh we do have a difference which means that The uh control groups CTR is different from the control experiment group CTR and one key part here is to mention that there are not just different but they are statistically significantly different so uh
when it comes to starting the case study first things first is to load the libraries in this case study we are going to use a numpy we are going to use a pendas as usual for any sort of data analytics data science um case studies You always need those two usually penders will be needed for our data wrangling to load the data process the data visualize it nonp will be used to uh work with different arrays and part of the data then we are going to use the ci. stats uh model and from that we
will import the norm function later on um we will see that we are using this in order to visualize this um uh rejection region that we get from for our test to understand whether we need to reject our N hypothesis or not then in this case study we also want to visualize our results and visualize our data for which we are going to need our visualization libraries from python which are the curn and the med plot L let's look into our data so what we have in our data we have four different columns and of
course this is a filter data that contains the information that we need but in general you can have a larger database you can have more sorts Of um um Matrix mat es and uh different other Matrix but for conducting your ab test the pure AB test you actually need only the following information so you need your user ID to understand uh what are the user you are dealing with so it's the user one user two user 10 it can be that you have other way of referring to your users and uh those can be for
instance this long strings that we use to refer to our user but given that our case is a simple one our case Study we have just a user ID and this user ID is just a integers that go from one till uh until the end of our uh data and here we got in total 20,000 users therefore this number user ID goes to um 20,000 and those 20,000 um are all part of the user group which means that they are all users and and they contain both the experimental and control users then we have our
uh click variable and this click variable is a binary variable which can be uh either one or zero where One refers that the user has clicked on the button and zero means the user didn't click on the button this is our primary metric for our AB test then we have the group reference which is this um string variable and this string variable helps us to understand whether the user comes from the experimental group or from the control group so this can contain only two different values two strings and it is X referring to the experimental
and control referring to The uh control group if you can see here we got just these three letters X referring to the experimental group and then if we go in here because we have first the experimental and then the control resp you can see that here we got the uh control group now we have also some time stamp which is uh not something relevant so we will be skipping that for now um given that this uh data that we have here it's not the actual data our data but it's a Synthetic one but similar in
terms of its structures in terms of the uh nature of variables and you can Implement exactly the same steps when you have your data and you are getting it from your a has and then you are conducting your aest uh case study so in here what we are going to make use of the most is our click variable and the group variable because we want to find out per group what are the users that have clicked on the uh button and to be more Specific we are looking for this averages so we are not so
much interested that that specific user from that specific group has clicked on the product or not that's something that we can explore later but for now we are interested on the more high level so what is this uh percentages what is the click probability or click true rate perir group and here we got groups of experimental and control as it should be in any source of ab test so once we have Conducted our AB test then I will also provide you more insights on what you can do with your data especially with this user ID
to learn more about uh the idea behind this different decisions or whether your ab test is different per group but the idea is that this AB test that we are conducting by following all the steps and by ensuring that the uh pitfalls are avoided that we are making a decision that um represents the entire population so we are using a sample that Is large enough for us to make a decision for our product and for our business that will be generalized and will be a representation and representative when we apply this decision on our population
so let me close this part because we no longer need this and let's go ahead and load this data so here I'm using the pendis library and the common uh approvation of PD and I'm saying pd. read oncore CSV and then I'm here Referring to the name of the data that contain my click data and here you can see that dat that dat data is here so abore testore click unor data. CSV and I will be providing you this data because you w have this in your own Google collab you will have the link to
this Google clap and I'll provide you also the data such that you can put that data you can download it first from my source and then load it in here by using this specific button in here here and by Doing that you can then go to that specific folder where you downloaded the data and then you will have also this uh corresponding CSP file in your folders so once you have that then you will uh smoothly run this code and uh here I'm loading that data and putting under the name of DF abore test so
basically the data frame containing my ab test click data what I want you to do is to Showcase you how the data looks like so here you will see the header given that Here I haven't provided any argument it just looks at the top five elements so the top five rows and here I got only the first five users from the experimental group I see that some of them have clicked some of them didn't click and the corresponding user ID and the time stamp uh that they um done the click action then um when we
look at at the describe function you can see here that this gives us more general idea uh Of uh what the data contains not so much what the top five rows just look like which is great in terms of to understand what kind of data you are dealing with with what kind of variable you have now you can see more the uh total uh picture so high level picture what kind of uh data what amount of data you got so the descriptive statistics so here we can see that in total we got 20,000 of users
included in this data so 20,000 observations 20K rows and then we have The mean for the user ID of course it it's not relevant the mean is 10,000 and um this is an interesting number so we see that um average click uh when we look at both user and control the experimental and control groups it is 40% so 0.40 Point uh uh 52 so 40. 52% however this is not what we are too much interested so this is not to be confused with the click through rate perir group what we are interested is The click
through rate or the mean click through um when it comes to the experimental group and the control group so then we have our standard deviation we see a high standard deviation which is understandable given that we have this uh large variation in our data we got a control group and experimental group and this variation shows that we have a huge difference in the these different values uh when it comes to the click Event and then we have the mean and the maxim which doesn't give us too much information because the click event so the click
variable is a binary variable it contains the zeros and one so naturally the minimum will be the value zero because the click can take value zero and one and the largest one is of course one which means the maximum would be one and then for the rest the 25% so the first quantile the second quantile the 15% which SS the median or the third Quantile the 75th percentile is not that much relevant so when it comes to the descriptive statistics for this kind of data especially if it's filtered it's not super relevant but if you
would have a larger data more matrices beside of Click which is your primary matric but you all have also measured some other Matrix which is recommendable then you would see more um values which would be interesting to look at so not only to look at the click Rate but also to look at for instance the mean or maybe the median of conversion rate or the uh mean uh amount of time the average amount of time the user has spent on your landing page or how much time did that user end up spending before making that
decision of a click those can be all very interesting Matrix to look into from the product uh they data science perspective to understand the decision process and the channel and The Funnel of these Clicks but for now for our case study what we are purely interested is in our primary metric which is the click event so what we can also see in here is that we got um uh in our group um when it comes to the control group we got uh 1,989 users out of all control users that end up clicking versus the experimental
group where we have 6, 16 users who did click so do not confuse this with the total amount of users per group this amount is the um Grouping of the uh data so using the group bu and then group so we are grouping that data per group and we want to see per group what is the sum of the this variable some of the clicks and given that the click is a binary variable we know from basics of python that we are basically accounting the number of Click events because if you got a binary variable
containing zeros and ones if you do the sum of the clicks adding the zeros do not doesn't have any Impact which means that um you end up just summing up all the ones to each other and then you end up getting the number of or the total amount of uh cases when this click variable is equal to one so in this case when there is a click event therefore we can see that perir experimental group um we got 6,16 uh users out of all the experimental users that end up clicking and then out of control
group this amount is much lower so we end up having Uh only 989 users clicking so let's now go ahead and visualize this data I want to showcase in a bar chart using this click what is the total number of clicks so I want to show the distribution of the clicks when it comes to um The Click event pair group and here I want to uh see next to each other the experimental group and control group and as you can see here here we are getting our bar charts and the yellow corresponds to no Which
means that there was no click where is the uh black corresponds to the yes which means there was a click so whenever you see this amount it means that that amount uh corresponds to no click no engagement from the user side and this is per group so this is what we are referring as a click distribution in our uh data in our experimental uh and control groups and the way that I generated this bar chart is by first creating this um uh list that will Contain the colors that I want to assign to each of
my groups and I'm saying zero corresponds to the yellow and one corresponds to Black which means that if my variable contains an amount of zero in this case my click is equal to zero it means that I don't have a click so it's a no and this I want to visualize by yellow otherwise I have a black which means that um the um one corresponds to the case when we have um click and in this case we will get a black as you can See here the uh yes which means a click is um visualized
by this black color and then what I'm doing is that I'm initializing this uh figure size by saying that I I want to have a figure size of 10 and six you you can also skip it but I I think it's always great to put the size of a figure to ensure that you are getting the size like you want it to be such a lat sh you can also download or take a screenshot then we have this uh here I'm Using uh you can see a combination of the Met plot leap. pyplot Library as
well as the uh seabo because SEO has much nicer colors and here I'm saying uh we are going to uh make use of the Seaburn to um create um count plot because we are going to count and we are going to showcase the counts per group uh what is the number or the count of the clicks versus no clicks for a group group called experimental and what is the number of um or the percentage of Clicks versus no clicks when it comes to the group control and then here I'm specifying that the Hue should be
on the click which means that we are looking at the click variable and we are going to use the data dfab test which means that we are going to look in this data from here we are going to select this specific variable called click and we are going to use this in order to group our data based on this group so you can see that we are doing the grouping on The variable called group so the argument is called x x is equal to group we are grouping our dfab test data on this group and
we are going to do the count in our count plot based on this variable click basically what I'm saying here is that go and group our data dfap test based on Group which means that we will group based on experimental versus control and then I'm saying go and count the click events count pair group so pair experimental perir control group What is the number of times when we have a no so we have a zero and what is the number of times when we have a yes or we have a one as a value for
click variable and then as a padam using my custom pet that I just created which should be in the form of a list as you can see in here if I would have here also my third group or fourth group then I of course need to extend this color palette because I need to have the same amount of colors as the number of groups per my Target variable in this case The Click has only two possible values 0 and one which means that I'm only only specifying the two colors in my list so then we
have the title of our plot always nice to add by and then we have our labels which means that I want to emphasize uh as my X label so here I want to have my group you can see here is my group because I will either have group experiment or control that's my variable on my xais and on my y AIS of Course I have the the count so I'm counting the number of times I got uh the uh no click versus click event so here note that the um Y axis is in terms of
this count so here you can see it's uh 8,000 here saou 7,000 or 6,000 5,000 which means that we are talking about the numbers and the counts rather than percentages and this is important because um another thing that I'm also doing is that I'm going the extra mile and I'm also adding beside of this Counts on the top of each bar I I want to visualize and clarify what are the corresponding percentages it's always great to enhance your data visualization with some percentages percentages is easier for the uh person who follows your presentation to understand
for instance if you got an experimental group and the the users is here 6,000 and um 4,000 they might not quickly understand that you got for instance in total 10,000 of users and then 6,000 has Then uh clicked and then 4,000 didn't click so um then the idea is that by adding these percentages we can then see that 61.2% % has clicked in this experimental group and 38.8% has not clicked of course this a simulated data IEM specifically pick the extreme in such way that we can clearly see this difference in the clickr rates but
um in the reality you can have a click through rate of 10% up to 14% Which is usually a good number if you have a click through rate of 40% is great but it really depend on underlying user base what kind of product you got how large is your user base base because if you have very large user base then 10% can be a good clickthrough rate versus if you have a very small user base maybe uh 61% is considered uh good or average so uh in here we have just a simulated data of course
and I have Added this percentages uh by using the following code so I won't go too much into detailing here um feel free to check back and see uh and if something doesn't make sense go back to our python for data science course that contains lot of information on the basics in Python but here just quickly what I'm doing is that I am uh calculating the percentages and I'm annotating the bars so I want to know what are these percentages which means that per group I Want to take the total amount of clicks I want
to understand what is the number of Click event when the click variable is equal to one and what are the number of cases when there was no click from the user side which is what are the number of cases when the click variable is equal to zero and then I'm counting those amounts and then using the total amount to calculate the percentage for instance in this specific case I'm filtering the data for experimental Group I'm looking at the total number of users for this group which is 10K and then I'm counting the number of times
when out of this 10,000 users the amount of users that end up clicking on that button which is the click is equal to one case and then I'm taking that number dividing it to the total number of users for this experimental group multiplying by 100 in order to get that in percentages and this is the calculation that you can see in here one thing that Is important here is that here I'm using this um uh percentage um so for the current bar I'm saying U as a way to identify whether we are dealing with experimental
or control group is by getting by looking into this uh p and uh this p in here is the basically the patches so in this case I'm basically saying if I'm dealing with experimental group then go ahead and calculate what is this uh total amount of observations and then take what is The uh number of clicks and then divide the two numbers uh multiply this with 100 and this will then give us the percentage and then I'm doing this for each of those groups so I'm doing it for this group I'm doing for this group
and for this one and for this one so I got two groups but then within each group I got clicks and no clicks and I'm calculating these four different percentages and then I'm adding these percentages on the top of those bars so I not only want to have numbers represented in my visualizations but I also want to add this corresponding percentages at the top just for visualization purposes I wanted to put this out there because this can help your uh data visualiz ation toolkit and it also will um make your audience from your presentations be
more thankful to you when you are telling the story of your data so uh this is about the data that We have we see that uh 38.8% of our experimental group users have not clicked on the button versus the 61.2% have clicked on the button based on the simulated data and then uh in the control group we have a quite the opposite situation we got the majority of the users 80.1% not clicking on the button versus the remaining 19.9% have actually clicked on that button so we got a huge difference a dis anounce when it
comes To the experimental group and uh control group this kind of gives us an indication hey something is going on here we kind of have already um higher level intuition what the remaining anales will look like um which is that there most likely will be a difference in their ctrs when it comes to the uh the um uh control versus experimental group and the corresponding buttons but uh hey let's continue that's the entire goal behind AB testing is to ensure that Our intuition our conclusions are all based on the data rather than on our intuition
so what are the parameters that I'm using here for conducting our AB test when I was designing this AB test uh the first step was to of course do all these different translations that we learn as part of our AB test course um conducting it properly which means coming up with this three different parameters when doing our power analysis and usually this should be done when you Are collaborating also with your colleagues and uh with your product managers or your product people domain experts because they have um a lot of information on what it means
to have um threshold that you need to pass in order to say that for instance this new version of your feature is different and is uh considerably uh different from the existing one and here um in order to for us to understand this uh and make these conclusions we need to come up with this Three parameters that can help us to properly conduct an AB test as we learned when we were looking into designing a proper AB test so first we have our significance level the significance level or the alpha the Greek letter that we
are using to refer to the significance level which is also the probability of the type one error and that amount we have chosen following the industry standard which is 5% given that we didn't have any uh previous Information or specific reason to choose a different significance level so lower or higher we decided to go with the industry standard which is the 5% this means that we want to have um we want to compare our P value of our uh statistical test to this 5% and then say whether we have a statistically significant difference between the
control and experimental group based on this 5% significance level and let's refresh our memory on this Alpha this Alpha or significance level is also the probability of type one error so this is the amount of error that we are comfortable making when we um reject the N hypothesis well the null hypothesis is actually uh true which means that we are detecting a difference between the experimental and control version while there is no difference and we are making that mistake and here we are saying that we are fine and we are comfortable with making this mistake
at a maximum of 5% But higher than that it's not allowed we are not comfortable making uh error um higher than 5% then the next variable uh in this case the B uh or beta the probability of type two error which is the opposite of the type 1 error which is a false negative rate or the amount of time the um um a proportion of time when we end up failing to reject to null hypothesis while null hypothesis is false and it should have been rejected then the 1 minus beta is actually power Of the
test so what is the amount of time we are correctly rejecting our null hypothesis and correctly stating that there is indeed a statistically significant difference between our experimental group and our control group so we have chosen for this the uh industry standard as well which is the 80% but given that for your results analysis in this case for conduct this case study that part of the power analysis is not relevant we use that When calculating our minimum sample size but we don't need that when conducting our results analysis therefore I'm not initializing that as part
of this code so here I'm only providing to my program the values for my significance level which is 0.05 or this is the same as 5% and then the Delta which is the third parameter and this Delta is our minimum detectable effect so this Greek letter Delta which is the minimum detectable effect helps us to understand whether Beside of having this statistically significant difference whether this difference is large enough for us to say that we are comfortable making that business decision to launch this new button so it can be that when we are conducting an
AB test we are finding out that the experimental group group has indeed higher engagement than the uh control group and we are uh getting a small P or at least smaller than the alpha and we are seeing that P is small Than the alpha level which means that we can reject a null hypothesis and we can say that the uh CTR or the clickr rate of the experimental group is statistically significantly different from the control group at 5% significance level but we know from the theory of ab test that only that is not enough only
statistical significance is not enough for the business to make that important decision to launch an algorithm or to launch a feature in this Case to change our landing page the button from the uh start free trial to the endroll now which means that we want to have enough users and we want to have enough difference large difference in our clickr rat or enough users saying that we are more happy with this uh new version of the landing page for us to go and change our feature and what is this definition of enough what is the
difference in the click through rate that we need to detect after we have Detected the statistical significance in order for us to say that we also have a practical significance so practically we are also comfortable making that business decision and then launching this new feature and changing our landing page button and that is exactly what we have under our Delta this minimum detectable effect in this case we have chosen for Dela of 10% so you can see here 0.5 this is 10% this means that our Delta or the MD the Min detectable effect is 10%
this means that we are saying not only we should have a statistically significant difference between the experimental group and control group but also we need to have this difference to be at least 10% which means that we need to have detected that the experimental version of the landing page results in at least 10% higher click rate compared to the control version for us to go Ahead and to launch this new version and deploy this new uh ux uh feature so this is really important because many people go and check for statistical significance so they do
Alpha and then check uh whether the P value is for the alpha and then say hey we have a statistically significant difference and then they are done with that but that's not correct after you have conducted your uh statistical significant analysis and you have detected that your uh Experimental version has a statistically significant difference um CTR then the control version at your Alpha significance level the next thing you need to do is to ensure that you also have a practical significance beside of this statistical significance and this practical significance you can detect and you can
check when you use your mde or your Delta and you compare it to your confidence interval that you have calculated something that we have also Learned as part of the theory of conducting a proper AB test but once we come to that point so after we check for our statistical significance I also explain how exactly uh we will need to do this check and at the same time we will also be refreshing our theory on the Practical significance so let's now go ahead and calculate the total number of clicks per group by summing up these
clicks and I also want to calculate and group by this amounts Just to show case how you can do that on your own so here what I'm doing is that I'm taking my ab test data I'm grouping by by group group is the uh variable that contains the reference whether we are dealing with experimental group or control group and as you know from our uh python series and demos python for data science course that uh whenever we want to group that data app pendis data frame first we need to say pendis data frame name. Group
by with in parenthesis The variable that we are using to do the grouping which is in this case group and then within Square braces I want to emphasize and put the name of a variable that I want want to um apply operations on so I want to group my data on the group variable and I want to count the number of times I have a click in my control group and in my experimental group this will be my ex control and X experimental variables so X control will then compute contain information about Number of Clicks
in my control group and then each experimental will contain the number of Clicks in my experimental group and given that um I want to refer to the name of that uh Group after I did my grouping so I am getting this kind of this shape of data frame of course I then need to uh use my do log function in order to properly call that amount so to understand what is this amount corresponding to this index and what is this amount corresponding to this index And given that my index exis in strings I'm then using
here my do log function something that we also learned as part of our python for data science course so here is basically the printing just writing nicely what are the results which means that we are counting that the let me count again that the uh number of uh clicks for my control group is 1,989 so you can see see that it is want to double check and see what we Got yes so we got the same number so we are dealing with the same data set just to make sure and here the number of clicks
for experimental group is equal to 6K and then 116 so 6,16 clicks so then we are calculating the uh pulled estimates for the clicks per group let me quickly fix typo so calculating the uh pull estimate for the clicks perir Group which means the P estimate for the experimental group and For the uh control group so let me quickly add here how I can calculate the uh total cases when we got uh experimental group users so what is the number of users in the experimental group and what is the number of users in the uh
control group so here what I want to do is that I want to say that the group The DF test group should be equal to experimental and this of course should be my filter and I want to Count this and let me quickly copy this I saw that is already under the control so here I'm changing to the control and this will need to give me the number of users in each of these groups too so the number of users in control and number of users in Click and here I will simply check this so
I will print then the number of users per group and at the same time I will also click the number of clicks per Group there we go so now when we have done this what we are ready to do is to go ahead and calculate the F estimate for clicks perir Group which means perir control group and perir experimental group for that what we need to do is to take the number of clicks of the control group divide to the number of all users for control group as you can see in here x control divided
to n control and we are referring to this variable as P control head because we know that the Estimate of this click probability um is always with a hat it's just a way that we reference it in um statistics and in a testing so this is the estimate something that we are estimating therefore we are saying hat and then we have the same for experimental group which means that the estimate of the experimental groups uh click probability is equal to X exp and then divide it to NX then um in order to calculate the uh
pulled estimate or uh pulled click Probability which means the value that will describe both the uh control group and experimental group we need to follow this formula which means that we are taking the X control we are adding to that X control the X experimental this is our nominator of our uh value and then we are dividing this to the uh sum of the sizes of each of those groups which is n control and N experimental so this is the common formula of the pulled estimate uh when It comes to this type of experimentation when
you are dealing with um primary uh metric that is in the form of zeros and ones and if you want to refresh your memory on this type of formulas then make sure to also check our AB testing course because in there we go in detail in this uh specific lesson of the uh AB test result results analysis we are looking into this uh all these formulas on how we can calculate the pulled estimate of this uh click probability So click probability but then we are calling it fold click probability and then what we got is
this volum so that amount is then 0 and 40 this number should look familiar because this is then the mean that we saw when we were looking at the um uh the descriptive statistics table if you can recall this Table let me see this number so now basically we are then calculating this manually because we need a variable that will hold this uh volume so it is simply summing up all the clicks for control group and experimental group to get the total number of clicks and we're dividing it to the total number of users so
n Control Plus experiment so now when we have this we Are ready to also calculate what we are referring as a pulled variance also something that we have learned as part of the theory for AB testing so the pulled variance is equal to the pulled estimate of the clicks so P ped head something that we just calculated multiplied by one minus P head so the uh click event the estimate of the click probability multiplied by the estimate of no click and we know already did this idea of berol Distribution that the variable that uh describes
this process of clicks and no clicks follows kind of this idea of ber no distribution when we have a click and no click so we have probability of click and then we have a probability of no click which is the one minus that click probability so that's the idea or the part of the formula that we are following as kind of an intuition and then this multiplied by 1 / to n control + 1 divid you in Experimental so here I'm purely following the formula for the P variance if you want more details and explanations
and sure to check the corresponding Theory lecture because we are going into details of each of those formulas and understanding why we calculate this um P variance and P estimates uh in this specific way and using these specific formulas so here by just follow following the uh formula I'm getting that the uh pulled uh variance Is this amount so this is in nutshell how I calculated my uh pulled click probability and a pulled variance of that click event and we are going to need that in the next very important step which is calculating the standard
error and calculating the test statistics because in this case what we are doing is that we are dealing with a case when the primary metric is in the form of zeros and one so we let's Now Quickly talk about the uh choice of a statistical test be uh before conducting the actual calculation of standard error and the test statistics so here I went for the two samples at test and let me explain you why and what is the motivation because as we learned as part of the theory um whenever we have a primary metric that
is in the form of an averages like we have now because we are using the P control head and P experimental head head so we have a Primary metric that is the uh click through rate which is the average clicks per group so we have calculated the average click per experimental group and per control group then the primary metric the form of it already dictates even that it's in averages that we need to look at uh either parametric test corresponding to this averages or non-parametric test corresponding to the um averages in this case I went
for the parametric case because uh it has Better Properties if I have this information about the distribution of my data and why do I have this information and then this also dictates the uh choice of my um statistical test well I have a size of my sample which is over 100 and actually over 30 that's the threshold that we tend to use in statistics and in AB testing in order to say whether we have a large size or large data or not if our sample is not large so it contains less than 30 users per
group Which happens as well then we say that we need to go for um statistical test uh that will be specific for this kind of cases because we can no longer make use of the uh statistical Theory like the central limit theorem which helps us to um uh to take the to the inference so to make use of the inferential statistics and make conclusions regarding the distribution of our population just having the sample and what do I mean by that so if my Sample is larger than 30 like in this specific case I got 10,000
users per group so it is definitely larger than 30 uh users then in that case I can say that by making use of the central limit theorem I can say that my sampling distribution is normally distributed and this is simply making use of the central limit theorem something that we have also learned when we were looking into this concept of inferential statistics As part of the fundamental statistics course uh course um in lunar Tech so this is a powerful theorem that we use in AB testing in order to make our life easier because when we
have a sample that is larger than 30 for each of these groups then we can say that even if we don't know the actual distribution or the name of the distribution that our uh sample follows when it comes to the click um event so the random variable that describes this Number of clicks or the average click through rate what is that um distribution exactly but given that that we have that this size is large enough it's large than 30 users we can say that by making use of the central limit theorem we can say that
the uh the uh sample distribution follows a normal distribution if given that the sample size is large enough and this helps us to say that well in that case it doesn't matter whether we make use of the two Sample Z test or two sample T Test we can make use of either of these test in order to conduct our analysis and we had this specific template to make this Choice easier uh in our AB test course at ler Tech where we were making all this decisions and saying if the SLE size is this we need
to do this if the SLE size is this we need to do this and in this specific case following that exact structured and organized approach I ended up seeing that my sample size is Large so it's larger than 30 so I can then make use of the central limit theorem I then know what is the random um what my random variable describing this click through rate um follows the kind of distribution in this case a normal distribution and then this means that whether I use a t test or Z test doesn't really matter I'm going
to end up with the same conclusions therefore I will just go with the two sample set test Simply because um it is just easier for me to do do for example you can also go with the two sample T Test and you can even change this case study and twick it and then make it your own and put it on your resume in that way by making it more unique and that will be totally fine because you will see that you are going to end up with exactly the same conclusions as we do in this
specific case study because if you have a large Enough sample it won't matter whether you have a two sample Z test as your parametric test or the two sample T Test and um if you want to know why why this matters and all the different detail statistical insights make sure to check the actual uh course dedicated to AB testing because there will be cover this all and you will then become a master in the field of a testing now we know this uh decisions and the motivation behind choosing the uh two samples that test Let's
now go ahead and do the actual calculations so here we have a standard error which we calc calate by taking the pulled variance and taking the square root of it and this is again using the idea of this formulas that we learned as part of the ab test so we are using this P variance taking the square root of this which gives us the standard error and the standard error as you can see in here is then equal to 0.69 29499 this amount there we calculate our test statistic for our two sample that test so
the test statistic is equal to P control head minus P experimental heads divided to standard error so here uh you can now see the motivation behind not only Computing the P pulled head but really also the p uh control head and P experimental head and then I take the P control head and subtract the P experimental head and I divided to the Standard error to compute my test statistics once I did this as you can see this is this amount so test statistics for our two samples that test is this amount minus 5956 round it
then um we can also compute the critical value of our Z test which is uh by using this Norm function that we uh loaded in here from the s High and this will help us to understand what is this value from our normal distribution table the standard normal Distribution table uh where by making use of this table we identify what is this critical value that we need to have to uh create our rejection regions and to say whether we can uh reject our n hypothesis or not so to conduct our test we need to have
a critical volue for uh to which we will compare our test statistics and this critical value will be based simp simply on the standard normal distribution so this is This norm. ppf and then uh probability um uh function basically uh the the probability function that comes from the normal distribution standard normal distribution and as you can see this corresponds specifically to this percent Point function which is the inverse of the cumulative distribution function so this based on the alpha / 2 so 1 - Alpha / 2 is the argument that we need to put for
our percent Point uh probability Function and Y / to two because we have a two sample test so because we have a two-sided two sample test sorry so if you want to understand this difference between uh two sample um test two sided Test please check out the fundamental to statistics course at ler Tech because we covered this uh Topic in detail and it's a very involved topic it contains many complex U from statistical point of view so I won't be spending in this case too much time on that here I'm assuming that You know this
formula already but if you don't and if you quickly need to do your case study na testing feel free just just to copy this line which basically is a value that we need based on the corresponing chosen statistical significance level that we need to compute to compare our test statistics so our test statistics is this value and the value that we need to compare it to is the Z critical volum so we can see that this critical value is then equal To 1.96 this is actually a very common value that we know even without looking
at a standard normal table when you make use of this test enough often then you know that the uh critical value corresponding to a two-sided has when it comes to normal table is equal to uh 1.96 this is just a value that we know and in here by even without calculating the next step which is a P value we can even say already what is the decision we Need to make in terms of statistical significance because we know that one way we can test our hypothesis statistical hypothesis is by Computing the test statistics and checking
where the test statistics the absolute value of it is Lar ler than the critical value and we see that the test statistics is equal to minus 5956 the absolute value of that is 5956 and that value is much larger than our critical value which is equal to 1.96 this already gives us an idea that we can reject our null hypothesis at 5% statistical significance level but I want to uh go on to the next step actually because that's um more structured more organized way to doing and conducting experimentations as in the industry we tend to
make use of the P values instead of making use of this econometrical approach and statistical approach of um testing the statistical test so once we have calculated our test Statistics the next thing we need to do is to calculate our P value and then use that P value compareed to the significance level of and then make a decision whether we need to reject our new hypothesis and say that we have a statistical significance or we cannot reject our null hypothesis and then we need to say that we don't have a statistical significance so we don't
have enough evidence to reject a n hypothesis so the idea here is that we Need to make use of our uh normal function and specifically the norm. SF so making use of exactly the same Library the norm from CPI do DOS and then this time we're using the survival function which is the one minus the cumulative distribution function of normal distribution this comes again from statistics and then using the absolute value of our test statistics multiplying it by two given that we have a two-sided test I'm calculating my P Value this is simply by making
use of the same formula that we saw when we were uh studying theab test from uh technical point of view because we learned that the P value is then the probability that Z will be smaller than equal the minus test statistics or that the test statistic is smaller than equal to Z so uh we basically want to calculate what is this probability the P value which is equal to the probability that our test statistics will be smaller Than the critical value or our negative of the test statistics will be larger than equal or the critical
value and we want to know this probability because what this probability represents is that what is the chance that we will get a large test statistics well this is due to a random chance and not because we have uh uh actual statistical difference between the clickr rate of the experimental group versus control group so this is The idea behind P value so what is this chance that we are uh mistaking this random mistake this random observation that we got a large test statistic and saying that there is a statistical significance well there is no such
thing and we are purely getting this large test statistics um because of the random chance if the probability of getting a large test statistics by random chance is small so if this P value is small Then we can say that we have a statistical significance that's the idea behind it and this P value when we calculate uh we are storing it in this variable called pcore valum and then the next thing what I'm doing is that I'm writing this function called is statistically significant which takes argument as P value in Alpha so I just need
the P value that I just calculated for my test Set uh test uh statistics and then I Want the statistical significance level that I want to use for my test and then this is the volume that comes from my power analysis as I mentioned before that's the 5% this P value I'm calculating for my test statistics so in here and then I'm taking the two and I want to compare them so I want to assess whether I have a statistical significance by comparing my P value to my statistical significance level Alpha and what is this
comparison well we know Uh from the theory that um if we have a low P value and specific in the P that we are getting the P value is more than equal the 5% or 0.05 which is the significance level then this indicates that we have a strong statistical uh evidence that uh the N hypothesis is false and we need to reject it so we have a strong evidence against the null hypothesis and otherwise if the P value is larger than 0.05 so it's larger than 5% that we have Chosen as the maximum threshold of
that mistake so the significance level is uh uh no longer the largest element but the P value is larger than your significance level then this indicates that you don't have enough evidence against the null hypothesis so your evidence is weak this means that you fail to reject the null hypothesis so this is what I'm doing in here with this code so I'm saying print the P value first and we are rounding it Up with this round function I'm rounding it to the three decimal and then I want to check and determine whether I have a
statistically significant or not and the way that I'm doing that is I'm saying if my P value is more than my alpha or actually let's add smaller than equal than Alpha then we can print that there is a statistical significance which indicates that the observe differences between the experimental and control groups are un likely to occur due to Random chance which means that this is not random chance and uh we have a strong evidence that there is a statistical significance and this suggests that this new feature that we got this new version of our landing
page with this um uh call to action um as the and now is better and results in higher statistically significantly higher clickr rate than the the existing version of the control uh group so there is a real effect then Otherwise if this is not the case which means that my P value is larger than my Alpha then I'm saying print that there is no cisal significance and that the observe difference that we see in the clickr rate is not because uh of the real difference in the performance but through truly this is just a random
chance so here we can see that once we run our we call the function in here which is simply the function name and the argument so P value in alpha alpha Comes from the initialized value that we had from our power analysis so from here we initialize this value it's 0.05 and then here we got the P value that we just calculated then what we are getting in here is that our P value is actually so small that it's um rounded to the zero so what this means is that that there is evidence that suggests
that at 5% statistical level significance level that the uh clickr rate of the Experimental group is different from the clickr rate of the control group note that I'm not saying higher or lower because our statistical test was two-sided so under no hypothesis we had that the uh P control so in here as you can see our P control was equal to p experimental and under the alternative we had that the P control is not equal to P experimental this means that we um have now rejected the null hypothesis we have found evidence that suggests that
The uh null hypo can be rejected since our P value is zero and is smaller than the statistical significance level 5% and this means that we can reject the hle and we can say that uh there is enough evidence to say that P control is not equal to P experiment and given that that we saw from the uh visualizations and from our calculations that the um clickr rate for our experimental group is much higher than the click rate of the uh Control group we can also say that we have found evidence that at 5% significance
level we have found out that there is a statistically significant difference between the experimental and control groups clickr rate and that the experimental groups clickr rate is actually higher so statistically significantly higher than the control versions clickr rate so this is really important because this suggests that this difference in their Clickr rate is not due to random chance Alan but truly that there is evidence statistical evidence that can support this hypothesis that there is a true true difference between the performance of the experimental version of the product so in this case in our case the
landing page that has enroll Now button versus the control version of the product which had the uh uh start free trial version of the landing page the existing version so beside of Calculating this P value it's always a great practice to also visualize your results and this is great for your audience who are technically sound and who know uh these different concepts and you want to visualize uh the results that you got not only by showing some number that is P value and say hey I have a statistical significance but you also want to showcase
the actual picture of what you got what is your test statistic what is the significance level That you use to kind of tell a story around your numbers and that's the uh art behind the data science I would say so let's go ahead and do some art so what I'm doing here is that I am making use of my standard normal distribution or the gausian distribution the way that we are referring to the standard normal distribution in statistics I'm saying that my mean or the MU is equal to zero my Sigma is equal to one
which is my standard Deviation and I'm saying that my uh I want to now plot my uh standard normal distribution by getting my uh X values which are the uh number of uh X elements that I want to have my x-axis and and then taking the PDF or the probability distribution function for the normal distribution by using the s p Library I'm then providing my X values for which I want to get my uh corresponding uh values of Y so Basically here are all the values between let's say minus something minus three and then so
between minus 3 and three and I want to find all the Y's corresponding to this which basically plus the probability distribution function of the gaan distribution or the standard normal distribution and then I want to add to this graph also the uh corresponding rejection region and as you can see it is here so then what I'm adding here by using this part of the Plot is that I want to fill in the rejection regions so I'm saying for all the values in this figure whenever the uh value is lower than that threshold in this case
the threshold is z critical 1.96 so whenever my threshold is smaller than minus this uh 1.96 and larger than this 1.96 then we are in the rejection region we are saying then if my test statistics is falling in the rejection region in this case you can see that we are in the Far left so the test statistic is minus 9.44 and it's much lower than this threshold as you can see in here this is this left Blue Line in here then in this case it falls in this rejection region so actually this entire thing is
the rejection region it starts from here and it goes all the way to here anything anything in this region means that we need to we have a test statistic fuling in the rejection region which means that we can re object to no hypothesis if we Were to get a test statistic that is very large and very positive it means we would be in this part of the figure and again in the rejection region anything above this line is then uh going under this category of rejection region and also anything in here so for anything in
here we are in the rejection region being in the rejection region it means that we can reject the hypothesis and we can say that we have statistically significant Results so now when we have our statistical significance it's always a great idea to go on to the next step and it's actually mandatory to do this because not only a statistical significance is important but also the Practical significance as I mentioned in the beginning of this case study so for that what we are going to do is first we are going to calculate the confidence interval of
the test and this confidence interval will help us to First first of All make um comments regarding the quality of our test and its generalizability uh at our entire population and the accuracy of our results and then we will use this confidence interval to make a comments and to test for the Practical significance in our AB test so let's go ahead and calculate the confidence interval so as we learned as part of our lectures the confidence interval can be calculated by first taking the uh P Experimental head and P control head and the standard error
and the Z critical so here we need the two different estimates of the experimental groups click through rate and the control groups click through rate we also need the standard error of our two sample Z test as well as the critical volume and then we need to First calculate the lower bound of our confidence interval and then we need to calculate the upper bound of our confidence interval and in this case uh Given that the um statistical significance double we are using is Alpha uh the uh Z critical is based on that therefore we are
also saying that we are calculating the 95% confidence interval so in here the way we will calculate the lower bound is by taking the P experimental head subtracting from that the P control head and then once we have done that we then subtract from that the standard error multiply by Z Critical volume and we are are just rounding this up up to the three decimal behind the zero then we are doing the same thing only with a plus sign in here for the upper bound calculation of the confidence interval so this is just pure following
the formula of the confidence interval that I will set you here and let's go ahead and print this value which is this interval so what we are seeing here is that we have a confidence interval that is from 0.3 399 so 0.4 to 0 uh 43 so quite narrow confidence interal I would say which is actually a good sign because this confidence interal that provides this range of values within which the true difference between this control and experimental groups proportions or the clickr rate is likely to lie within a certain level of confidence in this
case 95% confidence this is very narrow and if it's neur confidence interval it means that the uh accuracy of our Results is higher and it means that the results we are getting based on our smaller sample it will most likely generalize well when we apply these changes and deploy this changes and we put this new product in front of the entire population of users because now we are doing all this experiment for a small group for the sample and this confidence info that is narrow it's not wide it's narrow it means that the results that
we are Getting is are accurate more or less accurate and this means that we the results that we are getting based on a sample are most likely a true representation of the entire population that we got this is the idea behind the width of the confidence interval the narrower it is the higher uh is the quality of your results which means that the uh more generalizable are your results so let's now go on to the final stage of our case study which is to test The Practical significance of our results so now when we know
that the uh statistical significance is there the experimental version of our feature is statistically significantly different from the control version in terms of the click true rate and we have seen that the competence interval is narrow which means that our results are accurate quite uh with quite high accuracy then we can now comment on the Practical significance of our results this means We want to see whether the significant difference that we obtained whether this difference is actually large enough from the business perspective to say that it's worth to put our engineering resources and our money
and our uh uh product into uh to put through through this change and to uh say that it's wor from the business perspective to change this button and to put this into uh the production and in front of our users and of course here we are not only talking About the engineering resources that it will take from us to change this and the deployment and the monitoring but also in terms of the quality of the product we are providing to our users because whenever we are making a change to our product it is a risk
because we are changing what our user is used to see and this can always be scary uh when it comes uh to the business because we don't want to uh make our customers scared so therefore we need to also Check for this practical significance so for that what I'm doing is that I'm creating this python function that will take two arguments so two values that is the minimum detectable effect and then the 95% confidence interval that I just calculated those will be the two arguments for my function and I'm calling this function is practically significant
and this function will go and check whether the uh practical significance is there or not and it will Then return true or false and then it will also print whether we have a practical significance or not and we learned from the theory and we know from this AB testing concept that whenever the uh MD or the Delta that we got the minimum detectable effect is larger than the lower bound of our confidence interval it means that the lowest possible value that we can get based on the results that we obtain in our sample that that
amount is smaller than The minimum detectable effect that we assumed before even conducting our AB test this suggests that we have a practical significance and the difference the minimum difference that we will obtain is large enough for us to have motivation to make this change in our product for that what I'm doing doing is that first I'm taking my 95% confidence interval and I'm taking the first element because we know that a confidence interval is actually range so Two pull of two numbers the lower bound and upper bound I need the lower bound because all
I care for this practical significance is to compare the lower bound of the 95% confidence interval to this minimum detectable effect which is my Delta so therefore I'm taking this lower bound of confidence interval putting that into variable and then I'm using this variable this lower uncore bound uncore C confidence interval and I'm comparing this to my Datta I'm saying if my lower Bond of the confidence interval actually I'm noticing that here I got a mistake it should be the other way around we need to say that if our Delta is larger or equal the
uh lower bound of the confidence interval which is the same save us if our lower bound of the confidence interval is smaller than equal our Delta so if our we can also write this the Other way around so if our Delta is larger than equal than our lower underscore bounde underscore CI then we can say that we have a practical significance so with the MDA of in this case so I want to use my initial Delta therefore I won't be initializing this so you might recall here a Delta of 10% I want to still make
use of that Delta so therefore I will just go ahead and then in here what I want to do is to Call this function by using that specific Delta so I want to have a 10% as my MD and whenever this Delta will be larger than the lower bound of my confidence interval that I just obtain I will then say that we have a practical significance and with an MDA of 10% the difference between control and experimental group is also practically significant so you can see that the lower bound is 0.04 is something that we
obtain here and that amount is then Compared to this Delta and here you can see that we have concluded that we also have a PR practical significance so amazing we have come to the end of this case study and in this involved case study we have conducted entire um AB test results and Asis so this case study in AB test and to and going from the point of loading the data and then understanding this business concept or business objective of ab test where we were testing whether the um enroll Now Button which is the new
version the experimental version should replace the existing button which is the secure fre trial and based on this case study what we found out is that we have a statistical significance at 5% significance level suggesting that we can reject the N hypothesis and we can say that indeed there exists a statistical significant difference between the click through rate in the experimental group versus control group Uh and specifically that the enr now experimental button results in statistically significantly higher click through rate than the uh secure free trial button and beside this we also checked the um
accuracy of our results by looking at a confidence interval and we saw that the confidence interval was quite narrow suggesting that the results we obtained were quite uh accurate and this means that the results that we got for the sample will generalize to our Population of users and finally we have also check the Practical significance of our results by using the 95% confidence interval and comparing the lower bound of that interval with our minimum detectable effect Delta and we saw that we will have at least 10% uh significant difference between the control groups CTR and
the control uh the experimental group CTR and the experimental group CTR will be at least 10% higher than the uh control groups and this suggests that uh From the business perspective we also have a motivation uh beside of this statistic significance we also have practical significance suggesting that we also have enough motivation and reason from the business perspective to put this new button into production and we can conclude that uh based on this datadriven approach and conducting an AB testing we uh can see a clear motivation of deploying this new button and draw Now and
replace the existing one secure free trial version and we will then expect to see more users clicking on this and engaging with our product and for now this will be all for this case study if you want to learn more about AB testing make sure to check our AB testing course as well as the ultimate data science boot camp don't forget to try our free trial this time using our enroll Now button and if you want to see more case studies like this make sure to Check our a case studies we have many case studies
also included as part of our ultimate dat science boot camp where we go in detail of these different steps and we conduct different sorts of case studies to put our data science theory into practice including from the field of NLP machine learning recommended systems Advanced analytics and also AB testing and soon also from AI so for now thank you for staying with me and conducting this case study happy Learning thank you for watching this video looking to step into machine learning or data science it's about starting somewhere practical yet powerful and that's the simple yet
most popular machine learning algorithm linear aggression linear aggression isn't just a jargon it's a tool that is used both for finding out what are the most important features in your data as well as being used to forecast the future That's your starting point in the Journey of data science and Hands-On machine learning work embark on a Hands-On data science and machine learning project where we are going to find what are the drivers of California house prices you will clean the data visualize the key trends you will learn how to process your data and how to
use different Pyon libraries to understand what are those drivers of California House values you are going to Learn how to implement linear regression in Python and learn all these fundamental steps that you need in order to conduct a proper hands own data science project at the end of this project you will not only learn those different python libraries when it comes to data science and machine learning such as pandas pyit learn ASAT models uh metf Le se but you will also be able to put this project on your personal website and on your resume I'm
D Co-founder of Lun Tech where we are making data science and AI more accessible for individuals and businesses if you're looking for machine learning deep learning data science or AI resources then check out the free resources section in lunch. or our YouTube channel where you can find more content and you can dive into machine learning and in AI a coincides stepbystep case study and approach to build your confidence and Expertise in machine learning and in data science start Simple Start strong let's get started so we are going to use this simple yet powerful regression technique
called linear regression in order to perform causal analysis and Predictive Analytics so by caal analysis I mean that we are going to look into this correlations causation and we're trying to figure out what are the features that have an impact on the housing price on the house value so what Are these features that are describing the house that Define and cause the the variation in the uh house prices the goal of this case study is to uh practice linear regression model and to get this first feeling of how uh you can use a machine learning
model a simple machine learning model in order to perform uh model training model evaluation and also use it for causal analysis where you are trying to identify features that have a Statistically significant impact on your response variable so on your dependent variable so here is the step-by-step process that we are going to follow in order to find out what are the features that Define the Californian house values so first we are going to understand what are the set of independent variables that we have we're also going to understand what is the response variable that we
have so for our multiple linear regression model we are going to Understand what are this uh techniques that we uh need and what are the libraries in Python that we need to load in order to be able to conduct this case study so first we are going to load all these libraries and we are going to understand why we need them then we are going to conduct data loading and data preprocessing this is a very important step and I deliberately didn't want you to skip this and didn't want you to give you the clean data
cuz uh usually in Normal real hands on data science job you won't get a clean data you will get a dirty data which will contain missing values which will contain outliers and those are things that you need to handle before you proceed to the actual and part which is the uh modeling and the uh analysis so therefore we are going to do missing data analysis we are going to remove the missing data from our Californian house price data we are going to conduct outli detection so we Are going to identify outliers we are going to
learn different techniques that you can use visualization uh techniques uh in Python that you can use in order to identify outliers and then remove them from your data then we are going to perform data visualization so we are going to explore the data and we are going to do different plots to learn more about the data to learn more about this outliers and different statistical techniques uh combined with Python so then we are going to do correlation analysis to identify some problematic features which is something that I would suggest you to do independent the nature
of your case study to understand what kind of variables you have what is the relationship between them and whether you are dealing with some potentially problematic variables so then we will be uh moving towards the fun part which is performing The uh multiple the ne regression in order to perform the caon nazes which means identifying the features in the Californian house blocks that Define the value of the C Californian houses so uh finally we will do very quickly another uh implementation of the same multiple uh multiple linear regression in order to uh give you not
only one but two different ways of conducting linear regression uh because linear regression can be used not only For caal analysis but also as a standalone a common machine learning gression type of model therefore I will also tell you how you can use psychic learn as a second way of training and then predicting the Californian house values so without further Ado let's get started once you become a data scientist or machine learning researcher or machine learning engineer there will be some cases some handson uh data science projects where the business will come to You and
will tell you well here we have this data and we want to understand what are these features that have the biggest influence on this a factor in this specific case in our case study um let's assume we have a client that uh is interested in identifying what are the features that uh Define the house price so maybe it's someone who wants to um uh invest in uh houses so it's someone who is interested in buying houses and maybe even renovating them and they're Reselling them and making a profit in that way or maybe in the
long-term uh investment Market when uh people are buying real estate in a way of uh investing in it and then longing for uh holding it for a long time and then uh selling it later or for some other purposes the end goal in this specific case uh for a person is to identify what are this features of the house that makes this house um to be priced at a certain level so what are The features of the house that are causing the price and the value of the house so we are going to make use
of this very popular data set that is available on kagal and it's originally coming from psyit learn and is called California housing prices I'll also make sure to put the link uh of this uh specific um data set uh both in my GitHub account uh under this repository that will be dedicated for this specific case study as well as um I will also Point out the additional links that you can use to learn more about this data set so uh this data set is derived from 1990 um us sensus so United uh States census using
one row Paris sensus blog so a Blog group or block is the smallest uh geographical unit for which the US cus Bureau publishes sample data so a Blog group typically has a population of 600 to 3,000 people who who are living there so a household is a group of people residing within a single home uh Since the average number of rooms and bedrooms in this data set are provided per household these cones may be um May take surprisingly large values for blog groups with few households and many empty houses such as Vacation Resorts so um
let's now look into uh the variables that are available in this specific data set so uh what we have here is the med Inc which is the median income in blog group so uh this um touches the uh Financial side and uh Financial level of the uh block uh block of households then we have House age so this is the median house age in the block group uh then we have average rooms which is the average number of rooms uh per household then we have have average bedroom which is the average number of bedrooms per
household then we have population which is the uh blog group population so that's basically like we Just saw that's the number of people who live in that block then we have a uh o OU uh which is basically the average number of household members uh then we have ltitude and longitude which are the latitude and longitude of this uh block group that we we are looking into so as you can see here we are dealing with aggregated data so we don't have the uh the data per household but rather the data is Calculated and average
aggregated based on a block so this is very common in data science uh when we uh want to reduce the dimension of the data and when we want to have some sensible numbers and create this crosssection data and uh crosssection data means that we have multiple observations for which we have data on a single time period in this case we are using as an aggregation unit the block and uh we have already learned as part of the uh Theory Lectures this idea of median so we have seen that there are different descriptive measures that we
can use in order to aggregate our data one of them is the mean but the other one is the median and often times especially if we are dealing with skew distribution so if we have a distribution that is not symmetric but it's rather right skewed or left skewed then we need to use this idea of median because median is then better representation of this um uh Scale of the data um compared to the mean and um in this case we will soon see when you're presenting and visualizing this data that we are indeed dealing with
a skewed data so um this basically a very simple very basic data set with not too many features so great um way to uh get your hands uh uh on with actual machine learning use case uh we will be keeping it simple but yet we will be learning the basics and the fundamental uh in a very good way such That uh learning more um difficult and more advanced machine learning models will be much more easier for you so let's now get into the actual coding part so uh here I will be using the Google class
so I will be sharing the link to this notebook uh combined with the data in my uh python for data science repository and you can make use of it in order to uh follow this uh tutorial uh with me so uh we always start with importing uh Libraries we can run a linear regression uh manually without using libraries by using matrix multiplication uh but I would suggest you not to do that you can do it for fun or to understand this metrix multiplication the linear algebra behind the linear regression but uh if you want to
um get handson and uh understand how you can use the ne regression like you expect to do it on your day-to-day job then do expect to use um instead Library Such as psyit learn or you can also use the stats models. API libraries in order to understand uh this topic and also to get Hands-On I decided to uh showcase this example not only in one library in psycher but also the STS models and uh the reason for this is because many people use linear regression uh just for Predictive Analytics and for that using psyit learn
this is the go-to option but um if you want to use linear crash for causal Analysis so to identify and interpret this uh features the independent variables that have a statistically significant impact on your response variable and then you will need to uh use another Library a very handy one for linear regression which is called uh STS models. uh API and from there you need to import the SM uh functionality and this will help you to do exactly that so later on we will see how nicely this Library will provide you the outcome Exactly like
you will learn on your uh traditional eon metrics or introduction to linear regression uh class so I'm going to give you all this background information like no one before and we're going to interpret and learn everything such that um you start your machine Learning Journey in a very proper and uh in a very um uh high quality way so uh in this case uh first thing we are going to import is the pendis library so we are importing pendis Library as PD and Then the non pile Library as NP we are going to need pendas
uh just to uh create a pendas data frame to read the data and then to perform data wrangling to identify the missing data outliers so common data wrangling and data proposing steps and then we are going to use numpy and numpy is a common way to uh use whenever you are visualizing data or whenever you are dealing with matrices or with arrays so pandas and non pies are being used Interchangeably so then we are going to use M plot lip and specifically the P platform it uh and this library is very important um when you
want to visualize a data uh then we have cburn um which uh is another handy data visualization library in Python so whenever you want to visualize data in Python then meth Po leip and Cy uh Seaburn they are two uh very Andy data visualization techniques that you must know if you like this um cooler uh undertone of colors the Seaburn will be your go-to option because then the visualizations that you are creating are much more appealing compared to the med plot Le but the underlying way of working so plotting scatter plot or lines or um
heat map they are the same so then we have the STS motors. API uh which is the library from which we will be importing the uh SM uh that is the kol uh linear regression model uh that we will be using uh for our caal Analysis uh here I'm also importing the uh from psyit learn um linear model and specifically the linear regression model and um this one uh is basically similar to this one you can uh use both of them but um it is a common um way of working with machine learning model so
whenever you are dealing with Predictive Analytics so we you are using the data not for identifying features that have a statistically significant impact on the response variable so features that have An influence and are causing the dependent variable but rather you are just interested to use the data to train the model on this data and then um test it on an unseen data then uh you can use psyit learn so pyit learn will uh will be something that you will be us using not only for linear regression but also for other machine learning models think
of uh Canon um logistic regression um random Forest decision trees um boosting techniques such as light GBM GBM um also clustering techniques like K means DB scan anything that you can think of uh that fits in in this category of traditional machine learning model you will be able to find insy learn therefore I I didn't wanted to limit this tutorial only to the S models which we could do uh if we wanted to use um if we wanted to have this case study for specifically for linear regression which we are doing but instead I wanted
to Showcase also this usage of psychic Learn because psychic learn is something that you can use Beyond linear regression so for all these added type of machine learning models and given that this course is designed to introduce you to the world of machine learning I thought that will combine this uh also with psychic learn something that you are going to see time and time again when you are uh using python combined with machine learning so then I'm also uh importing The uh training test plate uh from the psychic learn model selection such that we can
uh split our data into train and test now uh before we move into uh the uh actual training and testing we need to first load our data so therefore uh what I did was to uh here uh in this sample data so in a folder in Google collab I uh put it this housing. CSV data that's the data that you can download uh when you go to this specific uh page so uh when you go here um then Uh you can also uh download here that data so download 409 kab of this uh housing data
and that's exactly what I'm uh downloading and then uploading here in Google clap so this housing. CSV in this folder so I'm copying the path and I'm putting it here and I'm creating a variable that holds this um name so the path of the data so the file uncore path is the variable string variable that holds the path of the data and then what I need to do is that I need to uh take This file uncore path and I need to put it in the pd. CSV uh which is a function that we can
use in order to uh load data so PD stands for pandas uh the short way of uh naming pandas uh PD do and then read CSV is the function that we are taking from pandas library and then within the parenthesis we are putting the file onore path if you want to learn more about this Basics or variable different data structures some basic python for data science then um to Ensure that we are keeping this specific tutorial structured I will not be talking about that but feel free to check the python for data science course and
I will put the link um in the comments below such that you can uh learn that if you don't know yet and then you can come back to this tutorial to learn how you can use python in combination with linear regression so uh the first thing that I tend to do before moving on to the actual execution stage Is to um look into the data to perform data exploration so what I tend to do is to look at the data field so the name of the varibles that are available in the data and that you
can do by doing data. columns so you will then look into the columns in your data this will be the name of your uh uh data fields so let's go ahead and do command enter so we see that we have longitude latitude housing uncore median age we have toal rooms we have total bedrooms population so Basically the the um amount of people who are living in the in those households and in those houses then we have households then we have median income we have median housecore value and we have ocean proximity now you might notice
that the name of these variables are a bit different than in the actual um documentation of the California house so you see here the naming is different but the underlying uh explanation is the Same so here they are just trying to make it uh nicer and uh represented in a better uh naming but uh it is a common um thing to see in Python when we are dealing with uh data that uh we have this underscores in the name approvation so we have housing uncore median AG which in this case you can see that it
says house um age so bit different but their meaning is the same this is still the median house age in the block group so uh one thing uh that you can also uh Notice here is that the um in the official uh documentation we don't have this um one extra variable that we have here which is the ocean proximity and this basically uh describes the uh closeness of the house from the ocean which of course uh for some people can definitely mean a increase or decrease in the house price so I basically uh we have
all these variables and next thing that I tend to do is to look into the actual data and one thing that we can do Is just to look at the um top 10 rows of the data instead of printing the entire uh data frame so when we go and uh execute this specific part of the code and the command you can see that here we have the top 10 rows uh of our data so we have the longitude the latitude we have the housing median age you can see we are seeing some 41 year 21
year 52 year basically the number of years that house the median age of the house is 41 21 52 and this is PA per block then we Have the number of total bedrooms so we see that uh we have um in this blog uh the total number of rooms that this houses have is 7,99 so we are already seeing a data that consists of these large numbers which is something to take into account when uh you are dealing with machine learning models and especially with theine regression then we have total bedrooms um and we have
then population house hols median income median house Value and the ocean proximity one thing that you can see right of the bed is that uh we have longitude and latitude uh which have some uh unique uh characteristics um and longitude is with minuses latitude is with pluses uh but that's fine for the linear regression because what it is basically looking is U whether a variation in certain IND dependent variables in this case longitude and latitude but that will Cause a change in the dependent variable so just to refresh our memory what this linear regression will
do in this case um so we are dealing with multiple linear regression because we have more than one independent variables so we have as independent variables those different features that describe the house except of the house price because median house value is the dependent variable so that's basically what we are trying to figure out we want to see what Are the features of the house that cause so Define the house price we want to identify what are um the features that cause a change in our dependent variable and specifically uh what is the uh change
in our median house price uh volue if we apply a one unit change in our independent feature so if we have a multiple linear regression we have learned during the theory lectures that what linear regression tries to use during causal Analysis is that it tries to keep all the independent variables constant and then investigate for a specific independent variable what is this one unit uh change uh increase uh in the specific independent variable will result in what kind of change in our dependent variable so if we for instance change by one unit our uh how
in median age um then what will be the corresponding change in our median household value keeping everything else Conent so that's basically the idea behind multi multiple linear regression and using that for this specific use case and in here um what we also want to do is to find out what are the uh data types and whether we can learn bit more about our data before proceeding to the next step and for that I tend to use this uh info uh function in pandas so given that the data is a pandas data frame I will
just do data. info and then parentheses and then this will uh show Us what is the data type and what is the number of new values per variable so um as we have already noticed from this header which we can also see here being confirmed that ocean proximity is a variable that is not a value so here you can see nearby um also a value for that variable which unlike all the other values is represented by a string so this is something that we need to take into account because later when we uh will be
Doing the data prop processing and we will actually uh actually run this model we will need to do something with this specific variable we need to process it so um for the rest we are dealing with numeric variable so you can see here that longitude latitude all the other variables including our dependent variable is a numeric variable so float 64 the only variable that needs to be taken care of is this ocean uncore proximity uh which um we can actually Later on also see that is um categorical string variable and what this basically means is
that it has these different categories so um for instance uh let us actually do that in here very quickly so let's see what are all the unique values for this variable so if we take the name of this variable so we copy it from this overview in here and we do unique then this should give us the unique values for this categorical variable So here we go so we have actually five different unique values for this categorical string variable so this means that this ocean proximity can take uh five different values and it can be
either nearby it can be less than 1 hour from the ocean it can be Inland it can be near Ocean and it can be uh in the Iceland what this means is that we are dealing with a feature that describes the distance uh of the block from the ocean and here the underlying idea is That maybe this specific feature has a statistically significant impact on the house value meaning that it might be possible that for some people um in certain areas or in certain countries living in the uh nearby the ocean uh will be increasing
the value of the house so if there is a huge demand for houses which are near the ocean so people prefer to uh leave near the ocean then most likely there will be a positive Relationship if there is a uh negative relationship then it means that uh people uh if uh in that area in California for instance people do not prefer to live near the ocean then uh we will see this negative relationship so we can see that um if we increase uh the uh if if people uh if the house is in the uh
um area that is uh not close to Ocean so further from the ocean then the house value will be higher so this is Something that we want to figure out with this linear regression we want to understand what are the features that uh Define the value of the house and we can say that um if the house has those characteristics then most likely the house price will be higher or the house price will be lower and uh linear aggression helps us to ignore only only understand what are those features but also to understand how much
higher or how much lower will be the value of the House if we have the certain characteristics and if we increase the certain characteristics by one unit so next we are going to look into uh the missing data in our data so in order to have a proper machine learning model we need to do some uh data processing so for that what we need to do is we need to check for the uh missing values in our data and we need to understand what is this amount of n values per data field and this will
help us to Understand whether uh we can uh remove some of those missing values or we need to do imputation so depending on the amount of missing data that we got in our data we can then understand which all those Solutions we need to take so here we can see that uh we don't have any n values when it comes to Long ude latitude housing median age and all the other variables except a one variable one independent variable and that's the Total bedrooms so we can see that um out of all the observations that we
got the total uh underscore bedrooms variable has 207 cases when we do not have the corresponding uh information so when it comes to representing these numbers in percentages which is something that you should do as your next tep we can see that um out of um the entire data set uh for total underscore bedrooms variable um only 1. n n uh 3% is missing now this Is really important because by simply looking at the number of times the uh number of missing uh observations per data field this won't be helpful for you because you will
not be able to understand relatively how much of the data is missing now if you have for a certain variable 50% missing or 80% missing then it means that for majority of your house blocks you don't have that information and including that will not be beneficial for your model nor will be It accurate to include it and it will result in biased uh moral because if you have for the majority of observations uh no information and for certain observations you do that inform you have that information then you will automatically skew your results and you
will have biased results therefore if you have uh for the majority of your um data set that specific uh variable missing then I would suggest you choose just to drop that independent variable In this case we have just one uh% uh of the uh house blocks missing that information which means that this gives me confidence that uh I would rather keep this independent variable and just to drop those observations that do not have a total uh underscore bedrooms uh information now another solution could also be is uh to instead of dropping that entire independent variable
is just to uh use some sort of imputation technique so uh what this means is that Uh we will uh try to find a way to systematically find a replacement for that missing value so we can use mean imputation median imputation or more model based more advanced statistical or econometrical approaches to perform imputation so for now this is out of the scope of this problem but I would say look at the uh percentage of uh observations that for which this uh independent variable has missing uh values if this is uh low like less than 10%
and you have a large data set then uh you should uh be comfortable dropping those observations but if you have a small data set so you got only 100 observations and for them like 20% or 40% is missing then consider from imputation so try to find the values that can be um used in order to replace those missing values now uh once we have this information and we have identified the missing values the next thing is to uh Clean the data so here what I'm doing is that I'm using the data that we got and
I'm using the function drop na which means drop the um uh observations where the uh value is missing so I'm dropping all the observations for which the total uncore bedrooms has a null value so I'm getting rid of my missing observations so after doing that I'm checking whether I got rid of my missing observations and you can see here that when I'm printing data Dot is null. sum So I'm summing up the number of uh Missing observations n values per uh variable then uh now I no longer have any missing observations so I successfully deleted
all the missing observations now the next state is to describe the data uh through some descriptive statistics and through data visualization so before moving on towards the causal analysis or predictive analysis in any sorts of machine learning traditional machine Learning approach try to First Look Into the data try to understand the data and see where you are seeing some patterns uh what is the mean uh of different um numeric data fields uh do you have certain uh categorical values that cause an un unbalanced data those are things that you can discover uh early on uh
before moving on to uh the model training and testing and blindly believing to the numbers so data visualization techniques and data Exploration are great way to understand uh this uh data that you got before using that uh in order to train and test machine learning model so here I'm using the uh traditional describe function of pendis so data. describe parenthesis and then this will give me the descriptive statistics of my data so here what we we can see is that in total we got uh 20,600 observations uh and then uh we also have a mean
of uh all the variables so you Can see that perir variable I have the same count which basically means that for all variables I have the same number of rows and then uh here I have the mean which means that um here we have the mean of the uh variables so pair variable we have their mean and then we have their standard deviation so the square root of the variance we have the minimum we have the maximum but we also have the 25th percentile the 15 percentile and the 75th percentile so The uh percentile uh
and quantiles those are uh statistical terms that we oftenly use and the 25th percentile is the first quantile the 15 percentile is the second quantile or the uh median and the 75th percentile is the thirdd quantile so uh what is basically means is that uh this percentiles help us to understand what is this threshold when it comes to looking at the um observations uh that fall under the 25% uh and then above the 25% so when we Look at this uh standard deviation standard deviation helps us to interpret the variation in the data at the
unit so scale of that variable so in this case the variable is median house value and we have that the mean is equal to 206,000 approximately so more or less that uh range 206 K and then the standard deviation is 115k what this means is that uh in the data set we will find blocks that will have the median house value That will be uh 206k 206k plus 115k which is around 321k so there will be blocks where the median house value is around 321k and there will also be blocks where the um median house
value will be around uh 91k so 206,000 minus 100 and 15K so this the idea behind standard deviation this variation your data so next we can interpret the idea of this uh minimum and maximum of your data in your data fields the minimum will Help you to understand what is this minimum value that you have per data field numeric data field and what is the maximum value so what is the range of values that you are looking into in case of the median house value this means what um are the uh what is this minimum
median house value per uh block and uh in case of Maximum what is this um highest value per block when it comes to the meeting house value so this can uh help you to understand um when we look At this aggregated data so the Medan house value what are the blocks that have the uh cheapest uh houses when it comes to your valuation and what are the most expensive uh blocks of houses so we can see that uh the chipest um block uh where in that block the median house value is uh 15K so 14,999
and the house block with the um highest valuation when it comes to the median house value so uh the median um valuation of the houses is equal to $500,000 And1 which means that when we look at our blocks of houses um that uh the median house value in this most expensive blocks will be a maximum 500K so uh next thing that I tend to do is to visualize the data I tend to start with the dependent variable so this is the variable of interest the target variable or the response variable which is in our case
the median house value so this will serve us as our dependent variable and What I want you to do is to upload this histogram uh in order to understand what is the distribution of Med and house values so I want to see that when when looking at the data what are the um most frequently appearing median house values and uh what are this uh type of blocks that have um unique less frequently um appearing uh meded house values by plotting this type of plots you can see some Al lers some um Frequently appearing values but
also some values that uh go uh and uh are lying outside of the range and this will help you to identify and learn more about your data and to identify outliers in your data so in here I'm using the uh Seaburn uh Library so given that earlier I already imported this libraries there is no need to import here what I'm doing is that I'm setting the the GD so which basically means that I'm saying the background should be white and I also Want this GD so this means those this GD behind then I'm initializing the
size of the figure so PLT this comes from met plotly P plot and then I'm setting the figure the figure size should be 10 by 6 so um this is the 10 and this is the six then we have the main plot so I'm um using the uh his plot function from curn and then I'm taking from the uh clean data so from which we have removed the missing data I'm picking the uh variable of interest which is the median house Value and then I'm saying PL this uh histogram using the forest green color and
then uh I'm saying uh the title of this figure is distribution ofan house values then um I'm also mentioning what is the X label which basically means what is the name of this variable that I'm putting on the xaxis which is a median house value and what is the Y label so what is the name of the variable that I need to put on the Y AIS and then I'm saying p do show which Means show me the figure so that's basically how in Python the visualization works we uh first need to write down the
the actual uh figure size uh and then we need to uh Set uh the function uh and the right variable so provide data to the visualization then we need to put the title we need to put the X Lael y label and then we need to say show me the visualization and uh if you want to learn more about this visual ation Techniques uh make sure to check the python for data science course because that one will help you to understand slowly uh and in detail how you can uh visualize your data so in here
what we are visualizing is the frequency of this median house values in the entire data set what this means is that we are looking at the um number of times each of those median house values appear in the data set so uh we want to understand are there uh certain uh median house Values that appear very often and are there certain house values that do not appear that often so those can be maybe considered outliers uh because we want in our data only to keep those uh most relevant and representative data points we want to
derive conclusions that hold for the majority of our uh uh observations and not for outliers we will be then using that uh representative uh data in order to run our linear regression and then Make conclusions well looking at this graph what we can see is that uh we have a certain cluster of um median house values that appear quite often and those are the cases when this frequency is high so you can see that uh we have for instance houses in here in all this block that appear um very often so for inance is the
median house value um of about 160 170k this appears very frequently so you can see that the frequency is above 1,000 those are the Most frequently appearing Medan house values and um there are cases when the um so you can see in here and you can see in here houses that uh whose median house value is not up here in very often so you can see that their frequency is low so um roughly speaking those houses they are unusual houses they can be considered as outliers and the same holds also for these houses because you
can see that for those the frequency is very low which means that in our Population of houses so California house prices you'll most likely see houses uh blocks of houses whose medium value is between let's let's say um 17K up to uh let's say uh 300 or 350k but anything below and above this is considered as unusual so you don't often see a houses that are um so house blocks that have a median house value less than uh 70 or 60k and then uh also uh houses that are above um 370 or 400k so do
consider that uh we are dealing With 1990 um a year data and not the current uh prices because nowadays uh California houses are much more expensive but this is the data coming from 1990 so uh do take that into account when interpreting this type of data visualizations so uh what we can then do is to use this idea of inter quantile range to remove this outl what this basically means is that that we are looking at the lowest 25th uh percentile So uh we are looking at this first quantile so 0 25 which is a
25th percentile and we are looking at this upper 25th um percent which means the third quantile or the 75th percentile and then we want to basically remove those uh by using this idea of 25th percentile and 75th percentile so the first quantile and the third Quant we can then identify what are the um observations or the blocks that have a median house value that is below the uh 25th perenti and above the 75 perti so basically we want to uh get the middle part of our data so we want to get this data for which
the median house uh value is above the 25th percentile so um above all the uh Medan house values that is above the uh lowest 25% uh percent and then we also want to remove this very large median house values so we want to uh keep in our data the so-called normal uh and representative blocks blocks where the Meting house uh value is above the lowest 25% and smaller than the larg is 25% what we are using is this statistical uh term called inter Quant range you don't need to know the name but I think it
would be just Worth to understand it because this a very popular way of uh making a data driven uh removal of the outliers so I'm selecting the um 25th percentile by using the quanti function from pandas uh so I'm saying find for me the uh value That divides my entire uh block of observations or block observations to observations for which the median house value is below the um the um 25th percentile and above the 25th perti so what are the largest 75% and what are the smallest 25% when it comes to the median house value
and we will then be removing this 25% so that I will do by using this q1 and then uh we will be using the Q3 in order to remove the very large median House Valu so the uh upper 25th percentile and then uh in order to um calculate the inter quantile range we need to uh pick the Q3 and subtract from it the q1 so just to understand this idea of q1 and Q3 so the Quantas better let's actually print this uh q1 and this uh Q3 so let's actually remove this part for now and
they run it so as you can see here what we are finding is that the uh q1 so the 25th Percentile or first quantile is equal to $119,500 so it basically is a number in here what it means is that um we have uh 25 % um of the um observations the smallest observations have a median house value that is below the $119,500 and the remaining 75 uh% of our observations have a median house value that is above the $90,500 and then the uh Q3 which is the third quantile or the 75th percentile it describes
this threshold the volume where we make a distinction between the um uh lowest median house values the first 75th uh% of the lowest uh median house values versus the most expensive so the highest median house values so what is this upper uh 25% uh when it comes to the median house value so we see that that this Distinction is 264,000 $700 so it is somewhere in here which basically means that when it comes to this uh to this blocks of uh houses the most expensive ones with the highest valuation so the 25% top rated median
house values they are above 2647 that's something that we want to remove so we want to remove the observations that have a small speed Med house value and the largest median house Values and usually it's a common practice when it comes to the inter quantal uh range approach to multiply the inter quanti range by 1.5 in order to um obtain the lower bound and the upper bound so to understand what are the um thresholds that we need to use in order to remove the uh blocks uh so observation from our data where the Medan house
value is very small or very large so for that we will be multiplying the IQR so inter quanti range by 1.5 and When we uh subtract this value from q1 then we will be getting our lower bound when uh we will be adding this value to Q3 then we will be using and getting this threshold when it comes to the uh upper bound and we will be seeing that um after we uh clean this uh outliers from our data we end up uh getting getting um smaller data so this means that uh previously we had
uh 20K so 2,433 observations and now we have 9,369 observations so we have roughly removed um like about 1,000 or bit over 1,000 observations from our data so uh next let's look into some other variables for instance the median uh income and um one other technique that we can use in order to identify outliers in the data is by using the box plots so I wanted to showcases different approaches that we can use in order to visualize the data and to identify outliers such that you will be familiar With a different techniques so let's go
ahead and plot the uh box plot and box plot is a statistical um way to represent your data uh the central boook uh represents the inter Quant range so um that is the IQR uh and with the uh with the bottom and the top edges they indicate the 25th percentile so the first quantile and the 75th percentile so the third quantile respectively the length of this box that you see here uh this dark part is basically the 50% of Your data for the median income and uh this uh median uh line inside this box uh
um this is the uh the one with uh contrasting color that represents the median of the data set so the median is the middle value when data is sorted in an ascending order then we have this whiskers in our box flot and this line of whiskers extends from the top and the bottom of the box and indicate this range for the rest of the data set excluding the Outliers they are typically this 1.5 IQR above and 1.5 times um IQR uh below the q1 something that we also saw uh just previously when we were removing
the outliers from the median house volume so in order to um identify the outliers you can quickly see that we have all these points that um lie above the 1.5 times IQR above the um third quantile so the 75 percentile and um that's something that you can also see here and this means That those are uh blocks of houses that have unusually high median income that's something that we want to remove from our data and therefore we can use the uh exactly the same approach that we use previously for the median house value so we
will then identify the uh 25th percentile or the first quantile so q1 and then Q3 so the third quantile or the 73 percentile then we will compute the IQR R um and then we will be obtaining the lower bound and the upper bound Using this 1.5 um ASA scale and then we will be using that this lower bound and upper bound to then um use these filters in order to remove from the data all the observations where the medium income is above the lower bound and all the observations that have a median income below the
upper bound so we are using lower bound and upper bound to perform double filtering we are using two filters in the same row as you can see And we are using this parenthesis and this end functionality to tell to python well first look that this condition is satisfied so the observations have a median income that is above this lower bound and at the same time it should hold that the observation so the block should have a median income that is below the upper bound and if this uh block this observation in the data set ize
to two of this criteria then we are dealing with a good point a normal point And we can keep this and we are saying that this is our new data so let's actually go ahead and execute this code in this case we can see too high as all our outliers lie in this part of the box fote and then we will end up with the clean data I'm taking this clean data and then I'm putting it under data just for Simplicity and uh this data now uh is much more clean and uh it's better representation
of the population Something that ideally we want because we want to find out what are the features that uh describe and Define the house value not based on this unique and rare houses which are too expensive or which are in the blocks that have uh very high income uh people but rather we want to see the uh the uh true representation so the most frequently per data what are the features that Define the house value of the prices uh for common uh houses and for common Areas for people with average or with normal income that's
what we want to uh find so uh the next thing that I tend to do uh when it comes to especially regression nases and quaal nases is to plot the correlation heat map so this means that uh we are getting the um uh correlation Matrix pairwise correlation score uh for each of this pair of variables in our data when it comes to the linear regression one of The uh assumptions of the linear regression that we learned during the theory part is that we should not have a perfect multicolinearity what this means is that there should
not be a high correlation between pair of independent variables so knowing one should not help us to automatically Define the value of the other independent variable and if the correlation between this two independent variables is very high it means that we might potentially be Dealing with multicolinearity there's something that we do want to avoid so heat map is a great way to identify whether we have this type of problematic independent variables and whether we need to drop any of them or maybe multiple of them to ensure that we are dealing with proper linear regression model
and the assumptions of linear regression model is satisfied now when we look at this correlation heat map um and uh here we use the Seaburn in order To plot this as you can see here the colors can be from very light so white from till very dark green where uh the light means um there is a negative strong negative correlation and very dark uh green means that there is a very strong positive correlation so uh we know that correlation a value Pearson correlation can take values between minus one and one minus one means uh very
strong negative correlation one means very Strong positive correlation and um usually when uh we are dealing with correlation of the variable with itself so a correlation between longitude and longitude then um this correlation is equal to one so as you can see on the diagonal we have therefore all the ones because those are the pairwise correlation of the variables with themselves and then um in here uh all the values under the diagonal are actually equal to the uh Mirror of them in the upper diagonal because the variable between so the correlation between uh the same
two variables independent of how we put it so which one we put first and which one the second is going to be the same so basically coration between longitude and ltitude and correlation ltitude and longitude is the same so um now we have uh refreshed our memory on this let's now look into the actual number and this heat map so as we can see here we have This section where we um have uh variables independent variables um that have a low uh positive correlation with the uh remaining independent variables so you can see here that
we have this light green uh values which indicate a low positive relationship between pair of variables one thing that is very interesting here is the middle part of this heat map where we have this dark numbers so the numbers uh below the Diagonals are something we can interpret and remember that below diagonal and above diagonal is basically the mirror we here already see a problem because we are dealing with variables which are going to be in independent variables in our model that have a high correlation now why is this a problem because one of the
assumptions of linear regression like we saw during the theory section is that we should not have a multiple uh colinearity so multicolinearity Problem when we have perfect multicolinearity it means that we are dealing with independent variables that have a high correlation knowing a value one variable will help us to know automatically what is the value of the other one and when we have a correlation of 0.93 which is very high or 0.98 this means that those two variables those two independent variables they have a super high positive relationship this is a problem because this might
Cause our model to result in uh very large standard errors and also not accurate and not generalizable model that's something we want to avoid and uh we want to ensure that the assumptions of our model are satisfied now um we are dealing with independent variable which is total underscore bedrooms and households which means that number of total bedrooms uh per block and the uh households is highly correlated Positively correlated and this a problem so ideally what we want to do is change drop one of those two independent variables and uh the reason why we can
do that is because uh those two variables given that they are highly correlated they already uh explain similar type of information so they contain similar type of variation which means that including the two just it doesn't make sense on one hand it's uh violating the moral assumptions Potentially and on the other hand it's not even adding too much value because the other one or already show similar variation so um the total underscore bedrooms basically contains similar type of information as the households so we can as well um so we can better just drop one of
those uh two independent variables now uh the question is which one and that's something that we can uh Define by also looking at other correlations in here because we uh have A total bedrooms uh having a high correlation with households but we can also see that the total underscore rooms has a very high correlation with our households so this means that there is yet another independent variable that has a high correlation with our households variable and then this total underscore rooms has also High uh correlation with the total underscore bedroom so this means means that
um we can decide which One is has um more frequently uh High correlation with the rest of independent variables and in this case it seems like that the largest two numbers in here are the um this one and this one so we see that the total bedroom has a 0.93 as correlation with the total underscore rooms and uh at the same time we also see that hotal bedrooms has also um very high correlation with the household so 0.98 which means that a total underscore bedrooms has the highest correlation With the remaining independent variables so we
might as well drop this independent variable but before you do that I would suggest to do one more quick visual check and it is to look into the total underscore bedroom correlation with the dependent variable to understand how strong of a relationship does this have on the response variable that we are looking into so we see that the uh total underscore Bedroom uh has this one 0.05 correlation with the response variable so the median house volum when it comes to the total rooms that one has much higher so I'm already seeing from here that uh
we can feel comfortable uh excluding and dropping the total underscore bedroom from our data in in order to ensure that we are not dealing with perfect multicolinearity so that's exactly what I'm doing here so I'm dropping the um a total Bedrooms so after doing that we no longer have this uh total bedrooms as the column so before moving on to the actual C analysis there is one more step that I wanted to uh show you uh which is super important when it comes to the caal analysis and some uh introductory econometrical stuff so uh when
you have a string categorical variable there are a few ways that you can deal with them one easy way that you will see um on the VB is to perform one heart encoding Which basically means transforming all this uh string values so um we have a near Bay less than 1 hour ocean uh Inland near Ocean Iceland to transform all the values to some numbers such that we have for the ocean proximity variable values such as 1 2 3 4 five one way of doing that can be uh something like this but better way when
it comes to using this type of variables in linear regression is to transform this uh string uh category type of variable to What we're calling Dy varibles so D variable means that this variable takes two possible values and usually uh it is a binary Boolean variable which means that it can take two possible values zero and one where one means that the condition is satisfied and zero means condition is not satisfied so let me give you an example in this specific case we have that the ocean proximity has five different values and ocean proximity is
just a single Variable then uh what we will do is we will use the uh getor Dumis function in Python from pandas in order to uh go from this one variable to uh five different variable per each of this category which means that now we will have new variables that uh will uh basically be uh whether uh it is uh near Bay or not whether it's less than 1 hour uh uh from the ocean a variable whether it's Inland whether it's near Ocean or whether it's in Island this will we a Separate binary variables a
dami variable that will take value 0 and one which means that we are going from one string categorical variable to five different dami variables and in this case um each of those dami variables that you can see here we are creating five dami variables each of each for uh each of those five categories and then uh we are combining them and uh from the original data we will then be dropping the ocean proximity data so on one hand We are getting rid of the string variable which is a problematic variable for linear regression when combined
with the Pyar library because Pyar cannot handle this type of um data when it comes to linear regression and B we are making our job easier when it comes to interpreting the results so uh interpreting linear regression for closer analysis uh is much more easy when we have dami variables than when we have a one string categorical variable So just to give you an example if we are creating from this uh string variable uh five different dami variables and those are those five different dami variables that you can see in here so this means that
if we are looking at this one category so let's say uh ocean underscore proximity under Inland it means that for all the rows where we have the value equal to zero it means this criteria is not satisfied which means that uh ocean proximity uh Underscore Inland is equal to zero which means that the house block we are dealing with is not from Inland so that criteria is not satisfied and otherwise if this value is equal to one so for all these rows when the ocean proximity Inland is equal to one It means that the criteria
is satisfied and we are dealing with house blocks that that are indeed in the Inland one thing to keep in mind uh when it comes to uh transforming a string categorical Variable to um set of dummies is that you always need to drop at least one of the categories and the reason for this is because we learn during the theory that uh we should have no perfect multicolinearity this means that um we cannot have five different variables that are perfectly correlated and if we include all these values and these variables it means that um when
uh we know that the uh uh block of houses is not near the bay is not less than 1 Hour ocean is not Inland is not near the ocean automatically we know that it should be the remaining category which is Inland so we know that for all those blocks the um uh ocean proximity underscore uh uh iseland uh Iceland will be equal to one and that's something that we want to avoid because that is the definition of perfect multicolinearity So to avoid one of the O assumptions to be violated we need to drop one of
those Categories so uh we can see in here uh that's exactly uh what I'm doing I'm saying so let's go ahead and actually drop one of those variables so let's see first what is the set of all variables we got so we got less than one out our uh ocean Inland Iceland new bay and then uh new ocean let's actually drop one of them so let's drop the Iceland and uh that we can do very simply by let me see I tood in here so we are doing data is equal you uh and then data.
drop and then the name of the variable within the uh quotation marks and then uh X is equal to one so in this way I'm basically dropping one of the uh ding variables that uh I created in order to avoid the perfect multicolinearity assumption to be violated and once I go ahead and print the columns now we should see uh this uh Column uh disappearing here we go so we successfully deleted that variable let's go ahead and actually get the head so now you can see that we no longer have a string in our data
but instead we got four additional binary variable out of a string categorical variable with five categories all right now we are ready to do the actual work uh when it comes to the training of machine learning model uh or statistical model we learned During the uh theory that we always need to split that data into train uh and test set that is the minimum in some cases we also need to do train validation and test such that we can train the model on the training data and then optimize the model on validation data and find
out what is the optimal set of hyperparameters and then uh use this information to uh apply this fitted and optimized model on an unseen test data We are going to skip the validation set for Simplicity especially given that we are dealing with a very simple machine learning model as linear regression and we're going to split our data into train and test and here uh what I'm going to do is first I'm creating this list of the name of variables that we are going to use in order to um train our machine learning botel so uh
we have a set of independent variables and a set of dependent variable so in our multiple Linear regression here is the set of uh independent variables that we will have so we have longitude latitude housing median Edge total rooms population households median income median house value and the four different categorical dami uh four different uh dami variables that we built from the categorical variable then um I am specifying that the uh Target variable is so the target so the response variable or the Dependent variable is the um median house value this is the value that
we want to uh uh Target because we want to see what are the features and what are the independent variables out of the set of all features that have a statistically significant impact on the uh dependent variable which is the median house value because we want to find out what are these features um describing the houses in the block that cause a change cause a variation In the um Target variable such as the Medan house volum so here we have X is equal you and then uh from the data we are taking all the features
that have the following names and then we have the uh Target which is a mein house uh house value and that's uh the column that we are going to subtract and select from the data so we are doing data filtering so here we are then selecting and what I'm using here is the train test split function from the psych learn So you might recall that in the beginning we spoke and imported this uh model selection um library and from the cyler model selection we imported the train _ testore split function now this is the function
that you are going to need quite a lot in machine learning cuz this a very easy way to uh split your data so uh um in here uh the arguments of this function is first the uh Matrix or the data frame that contains the independent variables in our case X so Here you fill in X and then the second uh argument is the dependent variable so uh the Y and then we have test size which means um what is the uh proportion of um observations that you want to put in the test and what is
proportion of observation that you um don't want to put basically in the training if you are putting 0.2 it means that you want your test size to be uh 20% of your entire 100% of data and the remaining 80% will be your training data so if you provide .2 to this argument then the function automatically understands that you want this 80 20 division so 80% training and then 20% test size and then finally you can also so uh add the random State because the splitting is going to be random so the data is going to
be randomly selected from the entire data and to ensure that your results are reproducible and uh the next time you are running this um notebook you will get the same results and also to ensure That me and you get the same results we will be using a random State and a random state of one one1 is just um random number that I liked and decided to use here so uh when we go and um use this and run this command you can see that we have a training set size 15K and then test size uh
38k so when you look at these numbers you will then get a verification that you are dealing with 20% versus 80% thresholds so then we go and we do the Training one thing to keep in mind is that here we are using the SM Library uh an smm function that we imported from the uh stats model. API so this is one uh function that we can use in order to conduct our Cal analysis and train linear regression model so uh for that what we need to do so uh when we are using this Library uh
this Library doesn't automatically add the uh first uh column of ones uh in your uh set of independent variables which means that It only goes and looks at what are features that you have provided and those are all the independent variables but we learned from the theory that uh when it comes to linear regression we always are adding this intercept so the beta0 if you go back to the theory lectures you can see this beta0 to be added to both to the simple linear regression and to the multiple linear regression this ensures that we look
at this intercept and we see what is this Average uh in this case median house value if all the artif features are um equal to zero so um therefore given that the the this specific stats models. API is not adding this uh constant um column to the beginning for intercept it means that we need to add this manually therefore we are saying sm. addore constant to the exrain which means that uh now our uh x uh table or ex data frame uh adds a column of ones uh to the features so let me actually show
you uh Before doing the uh training because I think this also something that you should be aware of so if we do here a pass so I'm going to do xcore train uncore uh constant and then I'm also going to print um the same um feature data frame before adding this constant such that you see what I mean so as you can see here this is just the same set of all columns that form the independent variables the features so then when we add the constant now after doing that You can see that now we
have this initial column of ones this is now such that we can have uh uh beta0 at the end which is the intercept and we can then perform a valid multiple linear your regression otherwise you don't have an intercept and this is just not what you're looking for now the psychic learn Library does this automatically therefore when you are using uh this T model. API you should add this constant and then I use the Cy kit learn without Adding the conern and if you're wondering why to use this specific model as uh we already discussed
about this just to refresh your memory we are using the T models. API because this one has this nice property of visualizing the summary of your results so your P values your T Test your standard errors something that you definitely are looking for when you are performing a proper causal analysis and you want to identify the features that have a Statistically significant impact on your dependent variable if you are using a machine learning model including linear regression only for Predictive Analytics in that case you can use the psychic learn without worrying about using stats model.
API so this is about adding a constant uh now we are ready to actually uh fit our model or train our model therefore what we need to do is to use sm. OLS so OS is the ordinar Le squares estimation Technique that we also discussed as part of the theory and we need to provide first the dependent variable so Yore train and then the um a feature set which is xcore train uncore constant so then what what we need to do is to do do feed parenthesis which means that take the OS model and use
the Yore train as my dependent variable and xcore Trainor constant as my independent variable set and then feed the OLS algorithm and linear regression on this specific data If you're wondering why why train or ex train and what is the differ between train and test Ure to go and revisit the train um Theory lectures because there I go in detail into this concept of training and testing and how we can divide the data into train and test and uh this Y and X as we have already discussed during this tutorial is simply this distinction between
independent variables defined by X and the dependent variable defined by Y so y train y test is the dependent variable data for the training data and test data and then each train x uh test is simply the a training data features so XT train and then test data features X test we need to use XT train and Y train to fit our data to learn from the data and then once it comes down to evaluating the model we need to uh use the fitted model from which we have learned using both the dependent variable and
the independent variable Set so y train Mi train and then uh once we have this model uh that is fitted we can apply this to unseen data xcore test we can obtain the predictions and we can compare this to the true y so y underscore test and to see how different the Y uh underscore test is from the Y predictions for this unseen data and to evaluate how moral uh is performing this prediction so how moral is uh managing to identify the median uh house values and predict median house Values based on the um fitted
model and on an unseen data so exore test so this is just a background info and some refreshment and now um in this case we are just uh fitting the data on the training uh dependent variable and then training uh independent variable addit a constant and then we are ready to print the summary now let's now interpret those results first thing that we can see is that uh all the coefficients and all the independent VAR Ables are Statistically significant and how can I say this well um if we look in here we can see the
column of P values this is the first thing that you need to look at when you are getting these results of a caal analysis and linear oppression so here we are seeing that the P value is very small and just to refresh our memory P value says what is this probability that you have obtain too high of a test statistics uh given that this is just by a random chance so you You are seeing statistically significant results which is just by random chance and not because your uh null hypothesis is false and you need to
reject it so that's one thing in here you can see you can see that we are getting much more so first thing that you can do is to verify that you have used the correct dependent variable so you can see here that the dependent variable is a median house value the model that is used to estimate those coefficients in your model is the OS the method is the Le squares so Le squares is simply uh the uh technique that is the underlying approach of minimizing the sum of uh uh squared residuals so the least squares
the date that we are running this analysis is the 26th of January of 20124 uh so we have the number of observations which is the number of training observations so the 80% of our original data we have R squ which is the um Matrix that showcases what is the um Goodness of fit of your model so r s is a matrix that is commonly used in linear regression specifically to identify how good your Motel is able to Feit your data with this linear regression line and the r squared uh the maximum of R squ is
one and the minimum is zero 0.58 uh in this case approximately 59 it means that uh all your data that you got and all your independent variables so those those are all the independent variables that you have included they Are able to explain 59% so 0.59 out of the entire set of variation so 59% of variation in your response variable which is the median house value you are able to explain with a set of independent variables that you have provided to the model now what does this mean on one hand it means that you have
a reasonable enough information so anything above 0.5 is quite good which means that more than half of the uh Entire variation in your median house value you are able to explain but on the other hand it means also that there is approximately 40% of variation so information about your house values that you don't have in your data this means that you might consider going and looking for extra additional information so additional independent variables to add on the top of the existing independent VAR variables in order to increase this amount and to increase the Amount of
information and variation that you are able to explain with your model so the R square this is like the best way to uh explain what is the quality of your regression model another thing that we have is the adjusted R squ adjusted R squ and R squ in this specific case as you can see they are the same so 0 um 59 this usually means that uh you are fine when it comes to amount of features that you are using once you uh overwhelm your model with too many features you will Notice that the adjusted
R squ will be different than your R squ so adjusted R squ helps you to understand whether your Motel is performing well only because you are adding so many of you of those variables or because really they contain some useful information cuz sometimes the r squ it will automatically increase just because you are adding too many independent variables but in some cases those independent variables they are not useful so they are just adding to the Complexity of the model and possibly overfitting your model but not providing any eded information then we have the F statistics
here which corresponds to the F test and uh F test um it comes from statistics uh you don't need to know it but I would say uh check out the fundamentals to statistics course if you do want to know it because it means that uh you are testing whether all these independent variables all together Whether they are helping to explain your uh dependent variabl so the median house value and uh if the F statistics is very large or the P value of your F statistics is very small so 0.0 it means that all your independent
variables jointly are statistically significant which means that all of them together helped you explain your uh median house value and have a statistically significant impact on your median house value which means that you Have a good set of independent variables so then we have the log likelihood not super relevant in this case you have the AIC Bic which stand for AAS information criteria and bation information criteria those are also not necessary to know for now but once you advancing your career in machine learning it might be useful to know at higher level for now think
of it like um vol that helps you understand this uh information that you gain when you are Adding this uh set of independent variables to your model but this is just optional uh ignore it if you don't know it for now okay let's now go into the fun part so in this mate uh part of the summary uh table we got first the set of uh independent variables so we have our constant which is The Intercept we have the longitude latitude housing median age total ruls population households median income and the four dami variables that
we have created then we Have the coefficients corresponding to those independent variables those are basically the beta 0o beta 1 head beta 2 head Etc which are the um parameters of the linear regression model that our oils method has estimated based on the data that we have provided now before interpreting this independent variables the first thing you need to do as I mentioned in the beginning is to look at this p value column this showcases the set of all Independent variables that are statistically significant and usually this table that you will get from a sat.
API is at 5% significance level so the alpha the threshold of statistical significant is equal to 5% and any P value that is smaller than 0.05 it means you are dealing with um statistically significant independent variable now the next thing that you can see here in the left is is the T statistics this P value is based on a t test so this T Test is Simply stating as we have learned during the theory and you can also check the fundamental to statistics course from lunar tech for more detailed understanding of this test but for
now this T has um States a hypothesis whether um each of this independent variables individually has a statistically significant impact on the dependent variable and whenever this uh T Test has a P value that is smaller than the 0.05 it means you are dealing With statistically significant uh independent variable in this case we're super lucky all our independent variables are statistically significant then the question is whether we have a positive statistical significant or negative that's something that you can see by the signs of this numbers so you can see that longitude has a negative coefficient
latitude negative coefficient housing median age positive coefficient Etc negative coefficient Means that this independent variable causes a negative change in the dependent variable so more specifically when we look for instance the um let's say which one should we look uh let's say the uh total uh underscore rooms when we look at the total underscore rooms and it's minus 2.67 it means that when we look at this total number of rooms and we increase the number of rooms uh by uh one additional unit so One more room added to the total underscore rooms then the uh
house value uh decreases by minus 2.67 now you might be wondering but how is this possible well first of all the value the coefficient is quite small so on one hand it's it's not super relevant as we can see the relationship between them is not super strong because the margin of this um coefficient is quite small but on the other hand you can explain that at some point when you are Adding more rooms it just doesn't add any value and in fact in some cases just decreases the value of the house this might be the
case at least this is the case based on this data we can see that if there is a negative coefficient then one unit increase in that specific independent variables all else constant will result in um in this case for instance in case of the total rooms uh 2.67 decrease in the median house value everything else coner we are also Referring to this ass set that is parus in econometrics which means that everything else constant so one more time let's refresh our memory on this so ensure that we are clear on this if we add one
more room to the total number of rooms then the median house value will decrease by $267 and this when the longitude ltitude house median age population households median income and all the other criteria are the same so if we have uh for Instance this negative value this means that we are getting a decrease in the median house Val if we have an increase by one unit in our uh total number number of rules now let's look at the opposite uh case when the coefficient is actually positive and large which is the housing median age this
means is if we have two houses they have uh exactly the same characteristics so they have the same longitude latitude they have the same total number of rooms population Housing households median income they are uh the same in terms of the distance from the ocean then um if one of these houses has one more additional year added on the uh median age so housing median age so it's one year older then the house value of this specific house is higher by $846 so this house which has one more additional median age has $846 higher median
house value compared to the one That has all these characteristics except it has just the um uh house median age that uh is one year less so one more additional uh year in the median age will result in 846 uh increase in the median house value everything else concerned so this is regarding this idea of negative and then positive and then the margin of coefficient now let's look at one variable and um explain the idea behind it and how we can interpret it And uh it's it's a good way to understand how the dond variables
can be interpreted in the context of linear regression so one of the independent variables is the ocean proximity Inland and the coefficient is equal to - 2108 e plus 0.5 this simply means - 210 K uh approximately and um what this means is that that if we have two houses they have exactly the same characteristics so their longitude latitude is the same house median AG is The same they have the same total number of rooms population households median income all these characteristics for this two blocks of houses is the same with a single difference that
one block is located in the um Inland when it comes to Ocean proximity and the other block of houses is not located in the Inland so in this case the reference so the um category that we have removed from here was the Iceland you MRI call uh so if the block of houses is in the Inland that their value is on average uh smaller and less by 210k when it comes to the median house value compared to the block of houses that has exactly the same characteristics but it's not in the in so for instance
is in the uh Iceland so uh when it comes to this dummy variables where there is also an underlying referenced variable which you have deleted as part of your string categorical variable then you need to Reference your dumi variable to that specific category this might sound complex it is actually not I would say uh it's just a matter of practicing and trying to understand what is this approach of d variable it means that you either have that criteria or not in this specific case it means that if you have two blocks of houses with exactly
the same characteristics and one block of houses is in the Inland and the other one is not in the Inland for instance is In the Iceland then the block of houses in the Inland will have on average 210,000 less uh median house value compared to the block of houses that is in the for instance in the Iceland uh when it comes to the ocean proximity which kind of uh makes sense because in California people might prefer living uh in the iseland location and the houses might have more demand when it comes to the Iceland location
compared to the um Inland locations so The longitude uh has a statistically significant impact on the uh median house value latitude house median age has an impact and causes uh a statistically significant difference in the median house value if there is a change in the median age the total number of rooms have an impact on the median house volume and the population has an impact households median income as well as the uh proximity from the ocean and this is because all their P Values is uh zero which means that they are smaller than 0.05 and
this means that they all have a statistically significant impact on the media house value in the California housing market now when it comes to the uh interpretation of all of them uh we have interpreted just few uh for the sake of Simplicity and ensuring that this uh this entire case study doesn't take too long but what I would suggest you to do is to uh interpret all of the Uh coefficients here because we have interpreted just the housing median age and the um the total number of rooms but you can also interpret the population uh
as well as the median income and uh we have also interpreted one of those Dy variables but feel free also to interpret all the other ones so by doing this you can also uh Even build an entire case study paper in which you can explain in one or two pages the results that you have obtained and this Will showcase that you have an understanding of how you can interpret the the new gressional results another thing that I would suggest you to do is to uh add a comment on the standard error so let's now look
into the standard errors we can see a huge standard error that we are um making and this is the direct result of the fourth assumption that was violated now this case study is super important and useful in a way that it show cases what happens If some of your um assumptions are satisfied and if some of those assumptions are violated so in this specific case the Assumption related to the uh uh the uh errors having a constant variance is violated so we have a heos assist issue and that's something that we are seeing back in
our results and it's a very good example of the case that even without checking the assumptions you can already see that the standard error is very Large and uh you can see here that given that the standard ER is large this already gives a hint that most likely our heteroscedasticity uh is present and our homoscedasticity assumption is violated you keep in mind this um idea of um large standard errors that we just saw because we are going to see that this becomes a problem also for the um performance of the model and we will see
that we are obtaining a large error due To this and uh one more comment when it comes to the total rooms and the housing median age in some cases the linear regression results might not seem logical but sometimes they actually is an underlying explanation that can be provided or maybe your model is just overfitting or biased that's also possible and uh that's something that uh you can do by checking your o assumptions and uh before uh going to that stage I wanted to briefly Showcase to you this um idea of predictions so we have now
fitted our model on the uh uh training data and we are ready to perform the predictions so we can then use our fitted model and we can then uh use the test data so ex test in order to perform the predictions so to uh use a data to get new house mediate house values for the um blocks of houses for which we are not providing the uh corresp on me in house price so on the aning data we are uh re um Applying our model that we have already fitted and we want to see what
are this predicted Medan house values and then we can compare these predictions to the true median house values that we have but we are not yet exposing them and we want to see how good our model is doing a job of estimating and finding these unknown median house values for the test data so for all the blocks of houses for which we have provided the characteristics in the X test but we are Not providing the Y test so uh as usual like in case of training we are adding a constant with this library and then
we are saying model that fitted model uncore fitted so the fitted model and then do predict and providing the test data and those are the test predictions now uh once we do this we can then get the test predictions and uh if we print those you can see that we are getting a list of house values those are the house values for the um um Blocks of houses which were included as part of the testing data so the 20% of our entire data set uh like I mentioned just before in order to ensure that your
model is performing well you need to check the OS assumptions so uh during the um Theory section we learned that there are a couple of assumptions that your model should s your data should satisfy uh for o to provide uh B unbiased and um efficient uh estimates which means that They are accurate their standard error is low something that um we are also seeing as part of the summary results and uh your estimates are accurate so the standard error is a measure that showcases how efficient your estimates are which means um do you have a
high variation uh can the coefficients that you are showing in this table VAR a lot which means that you don't have accurate um coefficient and your coefficient can be All the way from one place to the other so the range is very L large which means that your standard error will be very large and this is a bad sign or you are dealing with an accurate estimation and uh it's more precise estimation and in that case the standard ER will be low uh and unbiased estimate means that your estimates are a true representation of the
pattern between each pair of independent variable and the response variable if you want to learn more about This idea of bias unbias and then efficiency and sure to check the um fundamental to statistics course at Lun Tech because it explains very clearly these Concepts in detail so here I'm assuming that you know or maybe you don't even need it but I would suggest you to know at high level at least then uh let's quickly do the checking of oils assumption so the first assumption is the linearity Assumption which means that your model is linear in
parameters One way of checking that is by using your already fitted model and your uh predicted model so the Y uh uh test which are your true house median house values for your test data and then test predictions which are your uh predicted Medan house values for nonen data so you are using the uh True Values and the predicted values in order to um plot them and then to also plot the best fitted line in an ideal situation when you would make no error and your model Would give you the exact True Values um and
then see how well your um uh how linear is this relationship do we actually have a linear relationship now if the observed versus predicted values where the observed means the the uh real uh test wise and the predicted means the test predictions if this pattern is kind of linear and matching this perfect linear line then you have um assumption one that is satisfied your linear assumption is Satisfied and you can say that your uh data uh and your model is indeed linear in parameters then uh we have the second assumption which states that your uh
sample should be random and this basically translates that the uh expectation of your error terms should be equal to zero and uh one way of checking this is by simply taking the residuales from your fitted model so model oncore fited and then that's Residual so you take the residuales you obtain the average which is a good estimate of your expectation of errors and then this is the mean of residuales so the average residuales uh where the residuales is the estimate of your true error terms and then uh here what I do is just I just
round up uh to the two decimals behind uh the uh the point this means that uh we are getting uh this average amount of uh errors or the estimate of The errors which we are referring as residuales and if this number is equal to zero which is the case so the mean of the residual in our model is zero it means that indeed the um uh expectation of the uh error terms at least the estimate of it expectation of the residuales is indeed equal to zero another way of checking the um uh second assumption which
is that the um model uh has a is based on a random sample and the sample we are using is random which Means that the expectation of the error terms is equal to zero is by plotting the residual versus fitted values so uh we are taking the residual from the fitted model and we are comparing to the fitted values that comes from the model uh and we are looking at this um graph this scatter plot which you can see in here and we're looking where this um pattern is uh symmetric uh around the uh threshold
of Zero so you can see this line kind of comes right in the middle of this pattern which means that on average we have residues that are across zero so the mean of the residual is equal to zero and that's exactly what we were calculating also here therefore we can say that we are indeed dealing with a random sample this FL is also super useful when it comes to the fourth assumption that we will come a bit later so for now Let's check the third assumption which is the Assumption of exogeneity so exogeneity means that
uh each of our independent variables should be uncorrelated from the error terms so there is no omitted variable bias there is no um reverse causality which means that the uh independent variable has an impact on the dependent variable but not the other way around so dependent variable should not have an impact and should not cause the independent Variable so for that there are few ways that we can deal with uh with this uh one way is just straightforward to compute the uh correlation coefficient between between each of these independent variables and the residues that you
have obtained from your fitted model the just simple uh technique that you can use in a very uh quick way to understand what is this uh correlation between each pair of independent variable and the residuals which are the Best estimates of your error terms and in this way you can understand that there is a correlation between your independent variables and your aor toes another way you can do that and this is more advanced and bit more um towards the econometrical side is by using this test which is called the Durban uh view housan test so
this uh Durban view Houseman test is um a more professional more advanced way of uh using an econometrical test to find out whether You have um exogeneity so exogeneity sumption is satisfied or you have endogenity which means that one or multiple of your independent variables is potentially correlated with your error terms uh I won't go into detail of this test uh I'll put some explanation here and also feel free to uh check any uh introductory to econometrics course to understand more on this Durban Vu housan test for exogeneity assumption the fourth assumption that we will
talk About is the homoscedasticity homoscedasticity assumption states that the error terms should have a variance that is constant which means that when we are looking at this variation that the model is making uh across uh different observations that uh when we look at them the variation is kind of constant so uh we have all these uh cases when the uh in observations for which the residual are bit small in some cases bit large we have this miror when It comes to this figure with what we are called in heteroskedasticity which means that homoskedasticity assumption is
violated our error terms do not have a variation that is constant across all the observations and we have a high variation and different variations for different observations so we have the heteros skill T isue we should consider a bit more um flexible approaches like uh GLS fgls GMM all bit more advanced econometrical Algorithms so uh the final part of this case study will be to show you how you can do uh this all but for machine learning traditional machine learning site by using the psychic learn so uh in here um I'm using the um standard
scaler function in order to uh scale my data because we saw uh in the summary of the table um that we got from the stats uh mos. API that our data is at a very high scale because the uh median house values Are those large numbers the uh age uh the median age of the house is in this very large numbers that's something that you want to avoid when you are using the linear regression as a Predictive Analytics model when you are using it for interpreting purposes then you should keep the skilles because it's easier
to interpret those values and to understand uh what is this difference in the median price uh of the house when you compare different characteristics of The blocks of houses but when it comes to using it for Predictive Analytics purposes which means that you really care about the accuracy of your predictions then you need to uh scale your data and ensure that your data is standardized one way of doing that is by using the standard scaler function uh in the pyit learn. preprocessing uh and uh the way I do it is that I initialize the scaler
by using the standard scaler and parenthesis Which I just imported from this pyic learn library and then uh I am uh taking the scaler I'm doing fitore transform exrain which basically means take the independent variables and ensure that we scale and standardize the data and standardization simply means that uh we are standardizing the data that we have to ensure that um some large values do not wrongly influence the predictive power of the model so the the model is not Confused by the large numbers and finds a wrong variation but instead it focuses on the a
true variation in the data based on how much the change in one independent variable causes a change in the dependent variable here given that we are dealing with a supervised learning algorithm uh the exrain uh scaled will be then containing our standardized uh p features so independent variables and then each test scal will contain our Standardized test features so the Unseen data that the model will not see during the training but only during prediction and then what we will be doing is that we will also use the um y train and Y train uh is
the dependent variable in our supervised model and Y train corresponds to the training data so we will then first initialize the linear regression here so linear regression model from pyit learn and then uh we will initialize the model this is just The empty linear regression model and then we will take this initialized uh model and then we will fit them on the uh training data so exore trained uncore scale so this is the trained features and then the um uh dependent variable from training data so y train uh do note that I'm not scaling the
dependent variable this is a common practice cuz you you don't want to uh standardize your dependent variable rather than you want to ensure that your Features are standardized because what you care is about the variation in your features and to ensure that the model doesn't mess up when it's learning from those features less when it comes to looking into the impact of those features on your dependent variable so then uh I am fitting the uh model on this training data so uh features and the independent variable and then I'm using this fitted uh model the
LR which already has learned from These features and dependent variable during supervised training and then I'm using the X test scale so the test standardized uh data in order to uh perform the prediction so to predict the immediate house values for the test data unseen data and you can notice that here in no places I'm using y test y test I'm keeping to myself which is the dependent variable True Values such that I can then compare to this predicted values and see how well my model was able to Actually get the predictions now uh let's
actually also do one more step I'm importing from the psyit learn the Matrix such as mean squar error uh and I'm using the mean squared error to find out how well my motor was able to predict those house prices so this means that uh we have on average we are making an error of 59,000 of dollars when it comes to the median house prices which uh dependent on what we consider as large or small This is something that we can look into so um like I mentioned in the beginning the uh idea behind linear regression
using in this specific uh course is not to uh use it in terms of pure traditional machine learning but rather than to perform um causal analysis and to see how we can interpret it when it comes to the quality of the predictive power of the model then uh if you want to improve this model this can be considered as a Next Step you can Understand whether your model is overfitting and then the next step could be to apply for instance the um lasso regularization so lasso regression which addresses the over F in you can also
consider going back and removing more outliers from the data Maybe the outliers that we have removed was not enough so you can also apply that factor then another thing that you can do is to consider bit more advanced pral learning algorithms because it can be that um Although the um regression assumption is satisfied but still um using bit more flexible models like random Forest decision trees or boosting techniques will be bit more appropriate and this will give you higher predictive power consider also uh uh working more with this uh scaled uh version or normalization of
your data as the next step in your machine Learning Journey you can consider learning bit more advanced machine learning models so now When you know in detail what is linear regression and how you can use it how you can train and test a machine learning Model A simple one yet very popular one and you also know what is logistic progression and all these Basics you ready to go on to the next step which is learning all the other popular traditional machine learning models think about learning decision trees for modeling nonl near relationships think about learning
Bagging boosting render forest and different sours of optimization alores like gradient descent HGD HGD with momentum Adam Adam V RMS prop and what is the difference between them and how you can Implement them and also consider a learning class string approaches like K means uh DB skin hierarchal clust string doing this will help you uh to get more hands on and go to this next step when it comes to the machine learning once you have covered all these Fundamentals you are ready to go one step further which is getting into deep Le thank you for
watching this video If you like this content make sure to check all the other videos available on this channel and don't forget to subscribe like and comment to help the algorithm to make this content more accessible to everyone across the world and if you want to get free resources make sure to check the free resources section at lul tech. and if you want to become a job Ready data scientist and you are looking for this accessible boot camp that will help you to make it jbre data scientist consider enrolling to the data science boot camp
the ultimate data science boot camp at L.I you will learn all the theory the fundamentals to become a jbre data scientist you will also implement the learn theory into a real world multiple data science projects beside this after learning the theory and practicing it With the real world case studies you will also prepare for your data science interviews and if you want to stay up to date with the recent developments in Tech what are the headlines that you have missed in the last week what are the open positions currently in the market across the globe
and what are the tech startups that are making ways in the tech and sure to subscribe to the data science Nai newsletter from lunar Tech interested in machine learning or Data science then this course is for you today will build the mov or Commander system using future selection con vectorization cosign similarity and at the end of the video we will create a web app using scrim building projects is one of the most effective ways to thoroughly learn a concept and develop essential skills this guide will walk you through building a movie recommendation system that's Ted
to user preferences we will average a v 10,000 Movie d set as our foundation while the approach is intentionally simple it establishes the core building blocks common to the most sophisticated recommendations in the industry think Netflix Spotify or others will harness the power and versatility of python to manipulate and analyze our data the pandas Library will streamline data priation and cycle learn will provide robust machine learning algorithms by V Riser and cosign Similarity user experience is key so we will design an intuitive web application for effortless movie seduction and the recommendation display at the end
of the video develop a Daya mindset and understand the essential steps in building a recommendation system we will Master Core machine learning techniques tying into Data manipulation future engineering and machine learning for user recommendations and create a user senta to solution deliver a seamless Experience for personalized movie suggestions so let's get started it all right so let's now go over the DS set will be using with the movie or Commander system we are using the tmdp movies D set from C this D set is crucial for developing A system that recommends films tail to your
preferences and introduces you to new titles we selected this data set for its comprehensive movie data it includes ID which is essential for movie Identification title genre original language and many other features but the key features that we will focus on our ID and title the genre and overview and to add on top of this we will combine the tags of overview and genre together for so we have selected this DS set because of how big it is it is around 10K top rated tmw movies and as as you see see many Futures so let's
move on to next chapter which is future engineering okay so now let's go with The features that we'll be using Futures in a recommend this system are essentially the the data points you use to make decisions about what to recommend these Futures help in identifying similarities between movies which is crucial for generating personalized recommendations for your system to be effective it's vital to select few features that offer meaningful insight into the content of the movies and the preferences of the Users so you have to be careful with what Futures you choose and for our movie
recommended system we will focus on several key features ID this serves as unique identify for each movie crucial for indexing and retrieving movie information accurately title the most basic yet essential future and abling users to identify movies by their names genre this categorizes movies into different parts facilitating recommendations based on content Similarity and user preferences the genre plays a pivot role in personalization and of course the overview overing a brief summary of the movie's plot to overview access to rich source for Content based filtering through NLP so we will be using that overview combined with
genre he create a very comprehensive descriptor for each movie therefore you're able to recommend movies more accurately now combining the over with the genre into a single pag Fature gives us a fuller picture of each movie this combination helps the system to better analyze and find movies that are similar in theme story or style so for example let's consider a movie like Inception it's over my be like a thief who steals corporate Secrets through dream sharing technology who his task with planting an idea into the mind of a CEO how the genre could be listed
as action science fiction Adventure how if you combine this with a text which is Action science fiction and Adventure you get more Fuller pictures which is action science fiction Adventure leave corporate Secrets dream sh technology planting an idea or a CEO which my recomend be like the Matrix but the the main point is that when you combine the overview with the text you get a much more sophisticated future that you can convert to a much more sophisticated numerical data point which you can use to better recommend mov and Before you use the before using the
overview and genre data for the Lord of the Rings for example example we would remove the and all now we pre-process the selected features because there are many movies that include stop fors in their title or in their genre or overview and those are words that don't contribute anything to the title and therefore we prepost them and remove them and clean out data so those are for example words such as the and in his and Other words as well so we have done our future selection and now let's move on to content based versus collaborative
based recommender systems otherwise let's explore now the key recommender systems that are currently being used by Netflix Amazon and other big Tech so there are two main recommended systems which are content based and ctive filtering Commander systems now let's start with a Content based or Commander system now a Content based or Commander System only uses the Futures and the overview of the movies that you have likeed to recommend the Sim movies so let's say you talk with a friend and you say I have like this I lik Iron Man because of its Futures such as
the Director the genre test and or the overview it will recommend similar movies based on those features so it won't use any other data for example what other people have liked or What her rating is of other movies or what rating you have given to other movies it's just based solely on the feutures of the movies that you have liked previously so for example if you've enjoyed Inception a Content based a Content based system I suggest Interstellar because both movies share a sar director and the complex narrative structure genre and overview now let's go on
to collaborative filtering recommender System so in N you often see for example if you have watched and enjoyed stranger things mics might often recommend The Witcher to you because other users who liked stranger things also enjoyed the Witcher now what's important to note here is that it doesn't use the features or the overview or the other informative data points it only uses what other users have liked what other users that has similar preferences like you have also liked so that's the difference Theist combination is made based on the viewing habits and the preferences of a
larger group of viewers with similar taste to yours now this method doesn't rely on it Futures but on the wisdom of the crowd it uses patterns of ratings or interactions from many users to predict what an individual might like for example if users who like The Avengers also enjoyed the Guardians of the Galaxy you might receive a recommendation for Guardians of the Galaxy if you like The Avengers while both systems are effective on their own if you could combine them that would that would enhance the accuracy of the recommander system so for example you start
off with content based or Commander system but once you start getting more data you can also use collaborative filter collaborative filtering recommended system to provide more accurate recommendations in this session we will focus on the crucial element of Transforming text into a numerical vectors so our models aren't able to be trained on text but they are be they're able to be trained on new vectors which means that we have to convert our text into a vector so to do that we use count vectorizer this method simplifies text analysis By ignoring the order of words instead
focusing on the frequency by translating text into a numerical data will be able to classify documents a vital function that allows our systems To process and organiz large amounts of text Data efficiency we a to provide a straightforward practical understanding of this essential technique so that we can move on to our next chapter which is cosign similarity to provide a more targeted example to provide a more targeted example let's say we are considered three movie overviews which is an an action pack Adventure Venture and Adventure that or adventure movies in Spire me and we both
love heart racing Adventures so if we count the wordss that we are using here which is fun action hack adventure movies P inspire me we both love art racing so be of we use the following words pin our sentences now for our and hat Adventure we use we use pun once p p Adventure we use it as well but for the rest use P of so our Vector is 1 1 1 0 0 0 0 Z and this is our Vector now for the Simply example which is pun adventure movies inspire me if we use
no pen we use no action pack we use tempure once we use movies once we use Inspire once and you re use me once and the rest we don't use it as well which means we get the vector is 0 0 1 one one one 0 Z zero miss here so this is our second vector and you can do the same for we both love hard racing venges but I think get the idea so this step is key as it Transforms tatal information into a numerical format a vector each movie or overview is s converted
into a vector in a high in high dimensional space where each unique word is a dimension and the words frequency and the overview is the value in that Dimension now this structure allows machine learning models to intert the data aing tests such as genre classification movie s an examp so let's see we have Iron Man One two three hander Avengers we try to create a vector out of this we use Iron Man [Music] one so we have the use of words Iron Man 1 2 3 so for example if we have Iron Man 1 you
can expect the following Vector 1 1 1 0 0 you get this 1 1 0 0 two then you can expect what the structure to be do you have one zero for Iron Man three you can expect b1101 so this process of conver Terization translating movies titles into the milk of vectors is a Cornerstone in the realm of text analysis it allows us to convert un structured text suest movie titles into a structured format as became later use in our machine learning algorithms how we can extend this process to a larger and more complex body
of text for instance we could apply con vectorization to movie descriptions reviews or even entire scripts IR Respected all the text size or complexity the conf T method remains effective it allows us to handle a broad spectrum of text dat up to understand the concept of Cl similarity to the learns of mov let's take a simple example imagine we are comparing two movies based on the Gen sci-fi trer and just sci-fi now close similarity is the following it's an equation of the dot product of two vectors and the multip multiplication of the Magnitudes of the
two vectors so when we have a Sci-Fi trer and the convert converted to a vector which is s and then fer is a one one and S just CI is just one Z so when would multiply them you will do 1 + 1 + 1 * 0 and the magnitude of a one one which which is the magnitude of a is equal to square root or Square of one a square of one and square root of it which is two and for sci-fi which is Vector B is the magnitude of 0 square so 1 *
so this is equal to 1/ 2 now if you take the close side of this you will get 0.71 so this is whole simar the two movies are the 0 71 now let's have a more broader example so we have Iron Man one and we compare it with the Avengers and ofenham now Iron Man 1 one two or three it's an action sci-fi M Avengers is an action scipi and adventure and Oken is drama and historical now we will Calculate the semy between this this movies to the genre and we will recommend movies based on
their genre so if we try to make factors out of them out of this genres of each movie they have action venture or you can also do action sci-fi Adventure drama and historical then Iron Man would be one one 0000 Avengers will be the one 1 1 0 0 and Oaken Heimer will be 0 0 011 so let's calculate the similarities Between Iron Man and the D Avengers the do pro as I mentioned cosine cosine similarity is equal to Dela of the two factors divid by my to a times my B so when we take
the example of Iron Man and the Avengers we get that get the following vectors 1 1 0 0 0 and the behind yourist which is 1 1 1 0 0 so now when you take a DOT part of this you get 1 * 1 1 * 1 1 * 0 0 0 0 [Music] now if we calculate the CA and SAR Between a Iron Man and the open Hammer that will get zero because [Music] so as you can see we we calculate the Clos and similarity using this formula all right let's get started so I
have written clode already so I will just walk through it and let's start with step one which is import pandas so since we will be using pandas we must first import it this is the day exploration And pre-processing part but let's install install the p hand us okay so let's start with the future selection part which means first do not list all the columns in theirr to identify relevant features so these are all the columns that we have inside our data set you're going to combine over and genre into a column which will have the
name tags and That's perfect now move us around into G now there's a new column which is tags now we have an additional tags and this is the only po that we'll be using to run run our model on which means we don't meet overv in John anymore so we can get rid of it we do that by saying mu the is equal to movies. drop and we are going to drop the columns overview and genre perfect as you can see have only ID title and Text and this is great now side the text cleaning
part we will import NK and the necessary [Music] modules for our text PR processing now let run this cuz our first prop will download the necessary modules through to get resources for pre-processing we going to start with playing the text so first we want to make sure that it's not a string do that By saying if not a instance of text now we want to also say all movies titles or the text column must all be lowercase we also want to remove punctuation uh with C digits we also want to token as it move sof
wordss and join the words back together and now we want to leave again our clean the in [Music] Perfect all right now up to the comater part [Music] so bu it import this and also install it if not already installed but it seems Solly installed we apply the cream text function to a text column and we create a new column which size text Queen so we clean the data here okay so let's finally as a com rizer bra's Futures is 10,000 and stop Wordss is able [Music] English and we also what we are say here
is that our maximum number of Futures are 10,000 and we must remove all the stop W that contain inside in the English English vocabulary okay now we back to S hour um right so now we fit the test data into an array and Factorize it [Music] perfect we have only data the [Music] ID and effect all so now we have can import coine similarity and start or initializing coine similarity so through this line we calculate the coign similarity between the Movies and based on their Vector representations that we did we already vectorized it through our
con vectorizer now we can calculate the similarity between them and that's you do that by initializing it and saying similarities per factor is run now let's see the similarity great now we click on your be. info and we have be is still the same with 10,000 IDs 10,000 titles and still some missing Overviews or tags Perfect all right everything looks fine now we want to identify and print the so here we will want to see how our model has performed so we are going to calculate we are going to identify and print the titles of
the top five similar movies to a given movie based on cosign similarity And here we will do it for um from movie three movie four let's say which one was movie [Music] four with for is a called forther part two with this function we see if our mod works so what we want to do here is to calculate the distance between we want to calculate the most similar or the movies that have the Highest coign similarity score with Movie 4 and we will we want to reverse uh return the reverse order of the list meaning
that the most similar movie will come on top and we want to do this by going from 0 to 5 which means we want to only recommend five movies and of course you want to then print the data of the list of five movies meaning the title of the list that contains the five most similar Movies to Movie 4 perfect so mve forward which is the Godfather or something that is related to Godfather Comm SC SP [Music] all right so let's now create a function that will only recommend us movie based on the title of
the movie so not the ID but just the [Music] title and let's see if that works [Music] Okay so now let's save the modified data frame and the similarity Matrix for dat use we are going to use this in our web app so let's import pickle and want to also download the mov list here some n this score and going to load the sav similarity score and perfect all right I will see You in the last part which is building the actual so as you can see the front end will be provided to you and
the only thing you have to do is just to code the content have Pi s St import pickle import requests so we have to first write the function to fetch movie poster using movie ID now this function will be provided to You I can just do that by connecting the API now you have to move load movie therea and similarity Matrix you can do it by movies is equal to pickle that load we have done the similar will call up but now we're loading it and we have to also do the same for similarity and
for list so let's now create the header for web [Music] App should be should be see header commanded [Music] system so now we have to also bring the necessary components for stimulate we will be using we will be creating a image Carousel and for that we must first import the components so we will be importing necessary components to create an image [Music] Carousel so we'll be Using the component for our front end and public and as a base we want to be able to fetch movie posters using movie ID or there just to be a
base or movies that are recommended so that that it's not empty and this is works just fine and of course let's display the image TR component and now create a drop down Menu for how movies [Music] page and now let's create our Command function so what we want to do first [Music] is create so the way we are going to do it is by first finding the index of a selected movie from our data frame based on its title and then we are going to calculate Its similarity scores against the movies and we are going
to rank it from the movie Des most similar meaning the highest coine similarity score to the lowest and we are going to provide we are going to recommend five movies of course we can recommend many or 50 movies that's completely up to you so let's now first find the index or um Create a variable [Music] index and now let's calculate it by the similarity score similarity score and do that by [Music] distance so as you can see we first calculate the similarity score we will return the perverse order meaning the highest ranking to the [Music]
lowest and now we want to initialize the list which we will do by command we be e put to M to this recommend poster is equal to an empty list [Music] [Music] M [Music] ID poster [Music] perfect and it works perfectly all right so let me walk you through the code to show you exactly how you can do it as well so we first import the streamlet pickle and of course request for our to fetch posters from for movie IDs the images because we don't have it stored and then we load our data we load
our movies list and of course the similarity Score and we grab his titles now we create the header of our web app and to create a carousel we must first bring or import components for our streamlet here we initialize our CER [Music] component we feture some movie posters images for there to be basic list of Available um images or rather movies that people can use to navigate without it's just a basic image Carousel and then we display the image Carousel component creating of course the drop down menus and here's the main function of our movie
recommander system which is basically we first initialize the index of our movies or movie movie that we calculate its distance or L the Similarity score between all the movies and return um one that ranks the highest so which means the movies that are the most similar will be returned and all right and let's Now work on the main function this is our main function which allows us to recommend movies based on index so here we initialize the index of the movie then we calculate here the most similar movies in resp Effect to our selected movie
we initialize the list of movie and poster and here we just fill in the lists and once everything is filled we just return it this is the button that we will click on to recommend movies we will have or five columns and each column will have a text and it's works just fine as you can see when we click on the recommend Iron Man thank you for watching this video I hope you enjoyed creating the movie recommendation system if you like to watch more of this content then make sure to subscribe and other than that
I will see you guys in the next video this video was sponsored by lunarch at lunarch we are all about making you ready for your dream job in Tech making data science and AI accessible to everyone with is data science artificial intelligence or Engineering at lunar Tech Academy we have courses and boot camps to help you become a job ready professional we are here to help also businesses and schools and universities with a topn training modernization with data science and AI corporate training including the latest topics like generative AI with lunar Tech learning is easy
fun and super practical we care about providing an endtoend learning experience that is both practical and grounded in Fundamental knowledge our community is all about supporting each other making sure you get where you want to go ready to start your Tech Journey luner Tech is where you begin for students or aspiring data science and AI professionals visit Lun Tech Academy section to explore our courses and boot camps and just in general our programs businesses in Need for employee training upscaling or data science and AI Solutions should head to the technology section on the learner Tech.
page Enterprises looking for corporate training curriculum modernization and customized AI tools to enhance education please visit the lunarch Enterprises section at lunch. for a free consultation and customize estimate join Lun Tech and start building your future one data point at a time when you are to enter in business you need to at least show that you can work for the T that you are going to be hiring for my stter point is like Showing your data science portfolio showing that you could actually you work right so a strong data science portfolio basically you learn the
data science communication translation skills business acument all plus but extra but for you as a hiring manager during the hiring process you pay attention at the projects that they have completed unless they have already an [Music] experience hi everybody Welcome Cornelius really excited to have you here an experienced data scientist a top voice in the field of data science and with wealth of knowledge here to share with you uh Cornelius uh you are a data science manager at aliens so can you walk us uh through your journey uh in the data science field and how
you uh went through the corporate leather oh that's my story I think so I think I will go back to the beginning for us so for I going to like all the Corporate stuff so I starting actually like uh every student every aspirant scientist I want to become a data scientist but I'm starting not majoring in any data science field I actually mauring at a biology and aary biology my bachelor my master everything think is all about biology I'm a research at heart previously but there um just like a moment where I become a data
scientist that I Want to become a data scientist so this is the moment that on the first coming where I was like in my Master's de I listening at webinar and I try to see what kind of a job that I could have in the future because like like this like as biologist especially my country in Indonesia is kind of a little bit hard to actually finding that securing I I was saying like a money money having some some money and for the job like that's like Financial Security job because like it's a little bit
harder in Indonesia and even it work I can as a biologist here this my patient but then I try to find something that is still related to my biology is still to related that with my patient like I any research but what could actually make some money and that's where I actually find some like what called data science so this is like when I Tred to searching doing my master look at this webinar Another scientist and then I tried to learning about it okay they using staish thing they using Mach learing things and then then
they try to use data to actually solving a business problem and then it's actually using in business and then day I realized okay this my could be my next current move so at that time my master time even like I I'm not graduate yet I try to learn as f as possible regarding data science I try to joining this kind of online courses I Try to jooy it like a community in the social media and I try to read it as much as possible and then yeah when I coming back from my master to back
to India yeah I focusing on selfing again and then I try to join like uh really offline classes through data scientist yeah and then they try to make a connection as much as possible from all of the my previous connection that I already have like could I become a your company basically how like do you Feel yeah and then I by a little bit hard work so I can sell like from 2018 to 2019 so like one and year and a half work moving from this field from field and I become a another scientist year
super cool that's quite an impressive journey in such short amount of time you managed to you know to go through all these different levels cuz you know data science can be tough as especially in the beginning when you want to get your first job uh as a Junior data scientist with uh almost no experience right and you were went from uh biology not a so-called traditional data science study all the way to the field of data science so uh let's talk about that can you talk about some of the challenges that you uh overcame when
you were just starting out with no experience at all yeah there like a lot of challenges if like when my first start as the data scientist I don't know anything I I know It's a little bit about programming I know a little bit about statistic I know a little bit here and there but when you're already do in the corporate basically because like my big job right now as a j like in the corporate there's like a lot of things to be learning especially in the business part yeah you can develop like some model I
mean like you can try to run this code but this is is it solving this business problem is it like really Valuable to our business is it like really could be convincing enough to be used by the business people and the business team or even like from the customer part to actually they want to using this kind my model like my study my learning so this is actually something that I learned during the my uh first year if I'm like if I like the law if I like the law I still relearning like what is
about the business itself because like as a jior scientist we are Working in the B we are working usually in the to execute to performing uh what the task is going to be right but in a corporate it's a little bit different I think like the start up that might try to like you could like do everything as but in corporate it's having a this a little bit different because it's eff is already being structured organized already in there and there already um business moving there and the business is already know what they want the
Business already know what who make money in there and then the business already know what it what should be data science and new stuff data science is could be there to automate stuff could be try to invigorate all the new thing in the business but the but as the data s itself if you're like a junior we need to understand why this Bri process M so for example like in my 2 years experience as Junior like one of the first modeling a G is like this called a Propensity to buy model so this is like
model this like a modeling that try to predict which customer that will be approached to buy a new insurance basically or a new product it seems easy it's like you just take a data of the customer and then you could try to which one is the one who buy which one who to buy and then you try to create a modeling that's there but the movie part is actually there's a lot of the that there's a lot of puzzle Pieces in there so yeah you the model but what kind of product that you actually sell
who's the target of the customer who's going to execute this list where the where's the communication is going to be and then how long it's going to be the is campaign how is it going to be is it going to be like a continuous or is it like just a one time there's like a so much moving back in there and then you need to like working a lot with the business so that's one of The part that where I was still in Junior that I really learning but that's actually one of the thing that
make my career actually boosted up because I really so much about the business and I also learning about the communication within this business part because like like I said before like you want to convince the the customer I say like customer is user the part that they you want to use this model you want I I can do my job but can you believe my job so B sign like that it takes like a really good translation between our technical to the business uh term basically so I cannot present my model as like for example
like a random Forest this is like how this is work and this like this is the Precision this is like recall classification model no they don't right yeah but what they understand is like for example like I create a proper model this kind of modeling when we are simulated basically it could have to Your ke ey kpi is made for example like uh income like increase the revenue in a monthly it could increase the revenue 20% compared to the normal job that you do and simulation so there there example but this like some kind of
uh need to be rethinking compared from our technical to how the business actually speaking so I always I always call it translation and I think I couldn't say even better than what you just said it's really important also from my experience and From what I've seen in my colleagues being a data scientist is not just you know crunching numbers many people think it's just statistics or maybe some mathematics or really data but it's like you just said all these different skill set that have to come together like business acument what you just mentioned and uh
communication translation from business you know to technical terms Cu uh usually uh the product managers will never come and tell you please make an Classification model for me to classify this thing no they just come and say I need to do this and then you need to realize that oh you need a classification model oh you need to select those approaches and I think you perfectly mentioned that uh this is this combination of different things that from one view it might seem very easy but when you dive deeper you clean the data where to collect
the data where to store it how to process it how to make Impact how to measure it so I I totally understand and how you could go and grow and you know go through this corporate letter very quickly because early on in your career you had an opportunity to learn and gain all these different skills and I think also for our audience for our aspiring data scientists this is definitely note to make if they want to grow quickly in their career they need to be prepared to work on their communication skills on their business Skills
like you did uh on their uh translation skills how to translate from uh business kpis okrs to an actual data science problems on that note since you mentioned very interesting project was there an any particular project in your journey early on when you were data scientist as a junior meteor that kind of made your career to you know to be um uh to set yourself apart from other ones and be promoted yeah precisely that's like this This it's actually a accumulative I don't say that it's just like one question because like oh maybe I could
a little bit to beginning so one of the things that I also try to do during my junior time I feel like a medical is to take an initiative so I not just move I'm not just moving like okay this is your project and then it's like you need to do this and that but even like from My jior Level I I'm not try to taking like a job like that but it's more like I try to communicate my B like I know this is like a really interesting project I want to take it or
like I know this is going to be having a good business impact so can we to try to talk at fairly speed business user this kind of project we could try to create together so this is like uh try to take initiative on my own on my career so I could know where I could try to move during that time but yeah as this is accumulated and this is like one Project that I know like uh during that time that this is this is the project that I done uh is called the the LLP project
so it's basically try to create a prediction which of the customer email was a not a spa sorry uh hard C or not so it's like you know like the customer sometimes like really complaining but right yeah but it's not just a complaint what we want to try to predict is like is like this this kind of home blade could be Uh damaging to our reputation or not so this like the kind of the complaint we try to cre so in this time uh this kind of project that I undertake is something that uh I
could say to make on because like this is not a kind of a project that was done before in our comy at final and then I try to say to the the user and my boss that I want to take this kind of project I know this like could be like really useful for our business and then yeah I will try to Take a responsibility like from managing this kind of project and yeah it's going it's going well and then it's like yeah even right now uh the business users still what's it call I still
want to try to communicate with me for any kind approach but in the business because I already starting with this kind of project and I try to initiate in it and they still try to approach me if it's like the they have like a some problem with another initiative that they want Done so it start simple it starts just from ourself but this kind of a things will be seen will be seen if like our boss uh colleague by your I uh any business partner and other so just take an initiative I think let some
uh could be like really make on brick your car not just brick it really make your car understood so kind of be proactive if you um like you you are already a junior data scientist you already have some basic skills you have worked with some Um senior data scientist under supervision couple of months now it's time to take initiative to uh look around uh you know Network and see what kind of projects are on the table and then go and try to make something from even if something seems boring or uninteresting you kind of need
to dig deeper like you did right so you basically identified them and said to your boss well can I work on this and if this is a impactful because at the end Of the day you will always go to Next Level if you are doing a project if if it has a lot of impact right yes yes that's true and yeah it's also uh I still from my junior level I already if feel like before I become data scientist right I already having I know what I want to be I know what I want to
do and then when I already what to do I'm not stopping there I try to making what is called a musle plan basically like this is my going to be my career but I need To take uh into my hand basically this is also what my boss always always said like before like this is like already inspired to me like your carer is in your head your life is in your hand so every kind of mov that you make you need to be responsible for that but it's also need to be something that make you
improve as your a person or as your career at anything so that's why I always try to take initiative in there and it and it worked out right Because now you are data science manager you are top voice in data science so it worked out well which is amazing and on that note given your uh nontraditional data science background cuz you you have background in biology and there are many people out there from listeners uh also uh who want to make a career change or who come from a non-traditional data science background meaning they don't
have a traditional statistics mathematics or data science master's Degree um I wanted to uh I wanted to ask you Sly you can tell to our audience about the impact that your unique background had on your career it's really interesting yeah because as a biologist before I not using much about really regarding my education like Rel the specific work for example like gy or like a protein of course I'm not using that in my everyday life but this more about what how I try to Method thinking during my time of Working as a biologist as a
researcher really how I structure my work during that time how do I using my statistic and finist I think using that from my biological thinking and and what it call uh and my education part I think like really taking part like how I could try actually uh break down a lot of like uh the academic academics you need to like BR down like for example like the structure of your work need to be like From uh the theory and then how the methodology and then how this going to be work the result and the coration
so all that that I experence previously uh it actually helping me working as a data scientist but of course like I know every person having like a different background I def Le for my said because I still from the science part this kind of risk methodology could be really useful during that asst but I know like there could be like a people coming from The literature for example like coming from the philosophy like coming from uh yeah language and then there something that really not related to a programming or like really data at all but
I think like it's like uh could have like a different unique perspective as well in there at the fair least uh when you approaching your work it could be like how do you approach your work dur during your study could be applicable the way you're working at the time but at at Least if you have already have the basic I think need even if you are not from the data science uh measure so you basically use on your advantage in your advantage all your background even if it was a non-traditional data science background yes so
what would be your advice to anyone who um who has this non-traditional background in terms of specific steps yes having a rock map for me is Like the best one so I know it's like a little bit cliche I think to say but having a rock map to follow is really helpful so there's already a lot into online like right now like how to become a data scientist step to step First Learning statistic second learn by Third learning the machine learning four having like the machine learning project data science project five like try to apply
try to communicate try to uh uh learning the presentation this kind of Step might boring but following those step is actually important because like it really a structure way because like to become a data SCI I'm going to say like it just starts small in there but those stepe like if statistic programming and machine learning it's are going to be uh connected each other to each other the skill and it going to help you to become data sence that you want so just uh follow me the rockmat that I already Have it going to be
helpful rather than you try to learning yourself like jumping here and there try to I think emotion learning then I try to just learn the those kind of things but then you forget about the statistic then you forget about the math or you just learning about the programming then forget about the how to present it to the business user just uh foll the step one by one and then I think it could be already Good so in organized way basically to follow the road map yeah yeah okay and have a clear plan I suppose yes
having CLE plan is actually really helpful because like as uh this is like coming from my experience as well because like I try to create a plan for myself to become a data scientist and then I Tred to follow that plan and it actually work but it work out not well but this kind of plan that actually I think really helped I def is for me because like I Know if we are not focusing because this plan can uh this kind of plan actually make you focus because if you are not focusing on that thing
you could just go anywhere at that Lo loose loose loose Lo lose your way to becoming like the data side that you want right right no I absolutely agree with you having a clear plan and uh learn in an organized way and not just to go from one course to the other one to try to learn everything that that would be indeed the best way Because otherwise they will everyone will be spending ton of time on learning one skill and then there is a new skill to learn by the time they finish another one amazing
and uh when it comes to Leading because you are a data science manager and you have gone through this different steps already and you are leading teams what in your opinion uh is that helped you uh to balance the two so the technical side and leading people how do you manage projects and people And uh at the same time uh you driving your career yeah that's really hard I could see but this is like something that I also learned during my corporate time because leading and then try to become an individual cont I could already
say like it's two different things on one side like uh in the technical part we already understand but become a leader become a leading to the people we need to understand how to Delegate how to delegate what kind of the task what kind a project and then trust the other that they could actually done it and then way to trust trusting each other I trust myself and I trust you as a colleague I would say like you are also my friend if feel like you are like but it's also friend that in we are working
together in here we are not just like boss and to okay you do this you do that and you are okay bye no but we are trying to communicate together what are The problem in here what how we could try to working together in this here but I try to not to backr myage like I try to jumping in there doing every stuff like you to but trusting the people and the delegate to the people is actually important but it's coming from the learning of course of course like as individual contributor I I still really
love encoding I still love programming I do it right now I still do it right now I really want to Job get like doing this border and stuff uh at any time but trusting the others is like it's a learning learning to managing this project it could be you could do this and then learning that I could actually uh doing this stuff uh much better when I actually already being in this sport right now but so yeah it's it's still a learning process for me I could say like it's still alarming is uh never ending
journey I could say but delegating to the people That I trust having a Trust basically to the other is like important to balancing things on that note because um maybe some of our listeners who are inspiring uh tech people and who haven't worked in the field they don't know this concept of individual contributor what you just mentioned because for anyone who knows who doesn't know this term uh we have usually these two different career paths right in data science when you just join You can uh you usually join us an individual contributor so you work
on the technical Stu and once you go through your career there's usually two ways to go one is the uh individual career and as an um uh individual contributor and the other one is the managerial path which I assume is the one that you are following because you are managing people yeah so um that's about that uh you mentioned working as an individual contributor and then now You are manager which means that you are managing people and supervising them and on the note of the trust because that's something that you mentioned a lot when explaining
your leadership skills the trusting in others but how do you build trust as a data scientist when you have just joined your data science uh job yes he my proof I think because like that's what I do to build the trust with my boss with my colleague I Pro that in my job I could actually do this and then I try to present how my result is going to be and then I try to ProActive at my work trust is not built just like by you you saying it but trust is built by you are
having a proof of your work trust is coming from the things that you already did is from the action itself I know like communicating is maybe not that easy as well but just by but your work is your uh is like a promise in there during that time so I know like as a junior data scientist like yeah I I can Do I can do it I can do it I said I can do it like that and I know like uh people that I it's like yeah I can do it but how's they work
so it is that's actually that that that that work that actually making that uh the proof of the trust so as a j I think the best way is like yeah just do the work if like it's smaller like as you said previously youly like this might be boring stuff it might be an assum it might be not leading anywhere but those kind of work Show that you could actually do the work rather than just so do get the work done basically yeah get the work done right right that makes sense so then you can
build the trust uh make sure that you help your supervisor because that's at the end of the day the job of the data scci Junior data scientist to help uh the senior data scientists and other leaders in the team to make their life easier whether it's unfortunately data uh collection data Analysis or not and um on the note of uh leading the teams and becoming manager in your opinion um how can one decide whether they want to take the path towards individual contributors so become a principal data scientist and then stay like an individual contributor
or they should consider the path uh towards becoming a manager data science manager what qualities uh do you need to have to go towards One path or the other I think it's coming from yourself Because every single person have like a different evaluation right about what they kind want to be like for example for my side I love both side I already said like I love both side like as individual contributor as they leading I love both of them so I want to try to actually experience having like this experience as a leader as well
so because I already have a experience as a different contributor that's why I try to move to that site as well to try to The leadership part but of course like it's coming back to yourself again what do you want to do in your future because like I cannot tell like for as every single person that you need to be like become a manager is the best part because it's like have to best money or like become is much better because it's like really coming to yourself or like your comfort zone where you want to
be in the future so it need to become to yourself I mean like every single puff Car way have the pro and con it's always like that it's just like the decision that you already make you need to respons for that that makes sense so what it is to be a data science manager um walk us through your kind of day to day uh your responsibilities for the listeners to get an idea what it means to be a data science manager is uh you make in my will say like still leing the thing of course
Like having to having like iing like the them and then having like managing the project what what's our project right now what the what's how how it going and then uh how is it progressing so it's like managing the progress of the work in there and then try to balance it with the business user what their expectation is and then translate it to our technical so could be actually making those kind of progress so it's like balancing with Everybody uh from every side as well so I will say like those kind of work that I
usually like the da day so in uh the people like how what you going to be like in this week or what going to be this day or and then that try to when the business use are coming how do we going to make this project uh successful right today basically that makes sense so managing people lot of communication meetings So that's definitely also part of the managerial position that's for sure so if you don't like meetings and you don't like communicating with product people and business people you should not choose for death path right
that's that's true that's yeah I could I say like that's the Kong because like I could say like if you see my calendar of course like the business meeting it could be like8 nine 10 12 it could be like there prob lot and then it's like Suddenly there like uh meeting what what do I need to J this meeting this many for yeah yeah for sure now I heard that a lot so that's something for our listeners to take into account when choosing which career path they need to go through and on the note of
the projects because uh you have been leading projects and you have been also going through the corporate letter successfully as your dat science manager Now at such a young age can you uh remember maybe there is like a one project an impactful one that you end up uh going through it was a tough project but you uh with your team have completed it and what were the main takeaways from it uh this like so many project but I think what the really impactful right now if I like still we still ongoing right now is like
we try to create this kind like a CL fraud detection project So basically it's like Afric need like a fraud detection model this is like uh really really impactful to the business because like you know like truck usually not happening that often but when is happened it could be damaging to financial damage to the reputation damaging to thebody on their business and then yeah this is like really ongoing project that need a lot of consideration like coming from this kind Of business like this customer like where the data set is coming from so it's like
really involving a lot of OB stakeholder and then a lot of technical person uh to be involved so this kind of project that we are doing uh I would say like taking a lot of people come and go because like if you're like this is like really hard project but at the fin list uh I would say it's like making an impact because like the business use use business user Using it and then try and try to trust our result but there's always a lot of room for improvement because like uh I I said before
a fra detection as are really that could be really damaging if we could not do it uh properly so we need to present those result from this CL FR to be something that could be understable by the business so we try to create the model as good as possible we need to have like the explainability as good as possible We need to integrate all the business process as possible so those Ming part really uh really right now I still inte uh the project that I really remember until now that I haded to even uh try
to what's it called uh in another project those kind of work that I done there I try to implement it in in the other because it's just like this kind of complexity really need to be Tak consideration okay amazing sounds as a Tough problem fraud detection uh um it if if you have a high error rate that can seriously impact your operations also uh yes this about money uh that we are talking about after uh and on the note of money so uh you have gone uh through this different steps you start as a junior
data scientist usually or as an intern in data science if you don't have any background at all and you haven't complete any projects and then you need To go through the steps usually you become major data scientist and then you become um a senior data scientist and you become then a data science manager like you are today so what it takes to get a promotion is it only the technical side building that trust like you just said making an impact or you also need to actively promote yourself because I know it really is different from
companies to companies but uh there are some common qualities that differentiate Data scientists who stay in the same role for many years even if they have a very uh good skill set versus the data scientists who very quickly like yourself go through the different steps and get promoted a lot yes I think like you corporate yeah I said before is really depend on the businesses as well of course I because promotion only coming uh first if there's like a business set if you want like a business position to be filled That's the first stage of
course like you want to like having like a really higher position it might not go as fast as mine if you have if your company doesn't even need it but for a promotion basically for a promotion that increase your money uh increase your financial security increasing that I I believe that's the most important part actually the financial I think like for most the people like I know like some people Doesn't want doesn't need to be promoted but they just want to have like a higher salary but they right Bigg tile but in like my the
bigger biggest quality that yeah of course like I said before like taking initiative I discuss it with my manager what my career I want to be as well I want to be like I said it before like can I get a promotion I yeah I as ask like at the time like I they they try uh and then they the What's it called uh negotiate basically yeah yeah we can do it like in this uh turn of time with your with your kind of pro proof of work we already been everything that already I done
I already show it because like but I had thought just like six month no I just I'm not saying that I in six month I said I can be promoted no but I try to build it during that time but it's actually like a year that I come to to the video to Here I I ask even I ask so it's really take an initiative from because you don't even know like yeah is it like a possible or not people are not asking I know like some company culture is like a little bit do like
right like like so like it's like really shy or like like some people just too shy but it's is depend on culture and the comp because like in my company at my B is like really open it doesn't matter if you want to you ask to be promoted or not It's like we open discussion but for the company that are a little bit stret like a little bit a t in that part yeah you need to take a little bit proactive but showing you are the best in your work showing that if like compared to
your colleague yeah I am the number one in there I I I would say this is the B competition I will stay if you are to be like in the car to it's still a competition if you put it as a Competition but it's like a healthy competition if you want to get a in that okay amazing so basically don't be shy to ask because if you don't ask you might not get it read the room meaning you need to know who your bus is whether he is open or he is not open to u
a conversation like that so I understand a company culture whether it's fine to promote yourself and uh kind of build that negotiation skill set right to negotiate for whether it's higher salary Or Better Business title because sometimes that's also important uh for for your position in the company but do that on time meaning before that show your work and only with right cards go and promote yourself right yes yes you need to have a strategy basally for that like a to that but yeah I know like it's uh take a time take a skill like
uh to do that I know like not everyone could actually brave enough to do that or like uh having like Thinking to do that everyone is different but of course like you could try to copy a little bit people strategy I know like online like if you like what I said right now if you want to copy my strategy like even after a year after you work hard you already taking initiative you ask your boss then when that strategy work you could try to copy mind if you uh compy actually having a little bit right
mind there could be like A okay so that that's a very good advice I think to not show away to ask and know your strategy plan in advance yes on the note on that planning and uh on a personal branding so planning your personal brand now Cornelius you have a huge following you are top voice on LinkedIn and you have uh about 30,000 followers on LinkedIn right almost 30,000 still almost yes if you listening now go and follow Cornelius maybe we can 30,000 yes so how important it is to Have a personal brand as a
data scientist yes it's actually very important so I will talk offet of my work because like of course it's school but offset of my work it a lot of chances it give me a lot of opportunity like right now with you I'm talking with you because I already have a personal branding right you know yeah understand it's you're opening a lot of door so I get a lot of new friends lot of new networking lot of a Current opportunity that uh offset of my job so try to a lot of freelances lot of writings I
love writing so I try to focus in myself as in the writing as well because like some of the people the C content Creator will actually do like a video to actually doing a Tik Tok for example Instagram photo but I love article writing I have like a newsleter I try to make myself personal branding in there as a data scientist not writing and those do uh make my career actually Have it like a I don't say it's a side hustle anymore because it's already my hustle is become become my personal hustle that b uh
improve my income Financial Security so this personal branding could have make you having like more car choices it could make you a lot of because like like this uh I read a lot about uh people who like a business entrepreneur they usually not only have like one source of income they have like a multiple they building here and there And I see like this coming actually from the person as well they could actually come in to have this kind of a source of income multiple because they promote thems like myself like I try to Pro
data scientist who know some of the stuff in the data science world and then yeah the opportunity keep coming that's because start promoting here and there and then try to having which uh car this choice that could actually make in the future at first is the door opening is Uh more so uh in the case that maybe someday I could lay off or something you never know in the company you never know what could happen if you're like it might like something like s secure right now maybe in the future something like this like a
Cris financial crisis being L off like or even like a pandemic before like people that who like Financial secur is suddenly getting cut off right I I see my personal bry right now As a security measurement as well that I could try to make it uh a runway in the case of something that happening as well so that's really important yeah for sure because like you just mentioned uh in many parts of the world also currently there are many data scientists that are being laid off just because of the economic structure and uh like you
the message what you are also giving to our audience is that your personal brand will not Just be uh something uh for uh you know for for that scenario but also for uh for for for just in general it opens many doors for you new opportunities new networks new contacts and also in case something happens so uh you are being terminated laid up for some reason then uh that will be a great way for you uh to have a source of income so not to put all your eggs in one basket that's the message you
are giving right Bally yes very Precisely and on the note of your newsletter because you have a popular newsletter and it is called non-brand Data can you tell us more about that yes at first yeah I just call it on non brand out because like is is really in the early year or I try to create a newl and branding so I try to just not want to boxing myself in those of branding that I could to be a data engineering or like a data scientist like in the NLP or stuff I could try so
this kind of us Could talk about anything but and related to data but at the moment right I try to talk more about a three things like first about more about a carrier so like a business in the carrier in that as data scientist like those tips and that's like uh my OPM but and the second one is talking more about uh the technical stuff so I try to also still putting my technical uh knowledge uh in my newsletters so try talking about the Pyon M op and how to integrade it so Basically I try
to combine like I still understanding the business but understand I'm programing stuff or teal stuff in there it's still uh it a data size juice letter that try to make you as a car uh data scientist to improve your career this amazing so if you listening now okay make sure to check the non-brand data newsletter from Cornelius hit the Subscribe button and you will get lot of help with uh your career how to start a Data science career and also the technical stuff like Cornel just mentioned yes also uh let's talk about the coaching that
you do because you have a topmate um uh link uh and also uh way to provide your knowledge so to coach other aspiring data scientists or people who are already in the field can you tell us more about what kind of coaching you do yes so it's more about how you could move as a data scientist the carer where do you want to move I I Said before we want to manager I want to the contributor but basically how do you want to move as data scientist in your career but of course it is also
open I open to coaching as like from non uh data scien yet you want to move breaking into data science Field opening for the coaching for that as well amazing so if you looking for a coach then uh make sure to get in touch with Cornelius uh he might be able to help you with your career but also the Technical part yes uh he's on the top mate so we will put the link uh in the description so on the note of the person branding let's say uh you have a person who comes and ask
for your coaching services or just in general your advice uh you have a huge personal brand at the moment huge following on LinkedIn what would be your advice more in actionable way how to build a personal brand from scratch yes I will come from my personal experience first I first I'm networking For actually before I post anything much about that well I already I coming up with the plan of what I want to be I'm a data scientist I know what I know then I try to uh building that those going to be my brand
those about data scientist in here then I try to have like a networking like a so the first time that I posting it uh my social media of course like it's not going to be like having like a lot of that reading it how like L of people That noten it but that's why I try to actually also approaching people that actually already having those big follower so I have like right now my friend like his name is Carin but he's already not that much active right now in the b in but he's the first
one that I helping me going to the network in the data science field l in as a data scientist and then because of his follower I get another follower as well and then I start getting more and more More on in there so it's just like annoy my brand then try to network it with those people who are TR already uh having those number basically I would say it's a number game if you say it but at then three consistent I'm consistently still posting it from my uh it's already been five years I think as
still cons now try to posting something stuff even if like I know like one of the things that uh make people kind of share away after they posting their First post is there doesn't have like a lot of people that might be liking them or like uh comment to their content but no I mean like just keep posting it I mean like every single post that you have might be pable for someone you might thinking that this of post might not ah this is like just too easy why people already know but no maybe someone
is actually reading your content and understand okay this is how do you do it okay people uh do that I think about That before okay like not much but there a people actually say thank youing my post and then it's like okay that actually make my M higher but the first that it might not showing that much but just keep consistent posting because it's like it take take a mental I like take for take a mental fortitude to keep posting if like your situation is not that much but of course it's take a strategy still
take a strategy to improve your social media And personal branding so what I say going to be like before having a with your persona brand having the network and then keep consistent I think those three is already being good enough if you want to building for some brand amazing so if you if you are someone who is listening and who who doesn't yet have any personal brand uh following these steps will help you to uh gain that personal brand uh online but also offline um so on that note Cornelius um do you think that uh
having uh like coaching um services having a newsletter and having a LinkedIn following uh is that enough for your personal brand and of course also networking or are any other uh Media or channels that you usually use in order to uh build your personal Rand as a technical person I'm talking about for example GitHub you know um place to um sore and uh create and showcase your Cod also to Showcase how you can tell a Story about data yes yes that's true like GitHub is like uh really just T the community for dat not even
dat just like programmer to have like those portolio to be showed in there but also for myself because I a writer previously uh I but writing I using medium to hosting all my uh article I'm still quite active there if uh but right now is a little bit more in the my newsletter but depending on your passion I think like a Strategy you need to yeah uh need to a little bit so for example like my side go to medium and GitHub but maybe for example like for some people that would like video maybe you
could try to have like present in the YouTube or like in the Tik Tok because I know like some friend in there uh so of my friend is actually really active in the Tik Tok compared to the uh Linkin but right uh it's really need to understand as well what the audience what your audience Want to be uh sorry what you want the audience you want the others that you want so for example like I like l in because like it's like a really professional place just like a social platform so there's not a lot
of professional uh Communication in there so that's why I really active there try to build my building personal brand there but maybe your person bring want to Target more more casual people then maybe X may be better places so I know Like a lot of people in The X like a big that people in there as well there's a lot of follow I try to post here and there as well in X but it's mostly actually in basa Indonesia so maybe a little bit harder for Indonesia speaker but yeah uh think but if you was
just starting try to focus on one platform first having like a one platform social media then try to build from there then try to branch in there but like GitHub like video I think That's just the medium weight just the places to show me portfolio so I think it's like more about you can show your working but if you want to for uh I think it's like uh complimenting complimenting your personal branding that you already building in the social media like you have IDM or right some other uh it's like for for but starting uh
your starting places then try to focus there C on there I think that's one important part I think that's an Amazing advice to keep it simple cuz sometimes it can be overwhelming if you have so many social media or different channels that you need to keep up with it will you will just run out of time or you will burn out so sort simple basically that's what you are saying and then they will all start to complement each other so either ex or Tik Tok or GitHub medium or um different other places LinkedIn uh Facebook
so it really depends on your uh Target profile like You just mentioned yes amazing let's now dive deep become a bit more technical because data science is such a buzz word there are people who understand data science as data analytics there are people who understand data science as this machine learning engineer and um with the uh revolution of AI now there are many people who understand data scientist as someone who is dealing with L mops large language models deep learning machine Learning so many parts that um many data scientists across the world are learning and
are implementing now in your opinion as a data science manager with the wealth of knowledge and experience in the field what is for you the data scientist and what it takes to be a data scientist yes this is still a question that I actually still asking data scientist previously I said is someone working with the data to bring a into the business but right is moving much More than that as well so it's like data cus is someone who bring the value to the business and making the decision for the better in business but I
always said a business because like a data scientist uh always need a business to work in with because like even if you're a Sol PR a solide this your product will be need to be a promoted in some way then have like some business value so that's why as always said that right now data science is not just something that Do miss value but also someone who really a better decision in the business but yeah to become a data scientist I would say it takes a lot of uh mental for so like take a consistency
taking a lot want to learn all the way every day every years every time always called system learning right and of like programming business I already said it before so I don't need to repeat it again but yeah just keep want to keep learning and keep up with The new latest technology this technology latest technology I believe in myself I believe all that it's going to change the world it's going to change the world like I'm I right now it's going to change the world this St so as a data scientist try to keep following
that technology because if you we not following the technology we will be yeah as I said being replaced basically data scientist we have like an advantage because we are usually working For creating that AI complimenting that AI right so on that note I think you already answered my question because I wanted to ask you do you think that AI is a password many people think that this new era of generative AI CH GPT you know Cloud LMS it's a hype it will just go away but you just answered that in your opinion it's not going
to uh going away anytime soon and it's going to make a huge impact so can you uh walk us through some of the recent developments That you are aware of and you believe are going to make a huge impact and in which industry specifically in your opinion yes uh if you talk about technology like now yeah we already have like a lot of generation the open I already have like a lot of the model that really like like the cloud is also we have a really great you see lately day and then I try to
I feel like like Sora right now already try to I just image the of the video but Right now what I really see is like the business as so like the non technical people because right now it's like the impation is like just like for recording or for a little yeah simple stuff but right now like a business people start to see how useful is that like a lot of not just like technical people but like a non technical people who are try to build a product based on that so yeah it just from my
set so uh this past two weeks two weeks I already did like a lot Of uh stolder basically this India like a big stolder actually they want to try to building this kind of a product and I try to using this uh AI to simplify their business process basically it is like if like in if like in Indonesia they already see the potential that it could be used but of course like right now it's our job as well like as data scientist to prove that this kind of AI to be actually useful in the visible
it's not just like The business want to use it and try to uh make it effort from b side but we as a data scientist need to make an effort as well that yes this AI could be useful for your business and that it could be like as approve that this two things that combin together business and is like become big it could be like become a game changer so but yeah that's why I really really confident that like I could actually change the world because it's not just Uh I personally using it and then
try to get a benefit from that but also everyone that I I like from my side is actually already starting using it and they try to implement it in their business and that is actually useful well that's amazing to hear because I am trying when when I'm hearing talks about AI will replace data scientists I feel like um that's something that uh can is highly arguable so um first I want to take uh your uh Opinion on this do you think think AI will replace data scientists and um how can you future proof yourself as
a data scientist yes I think it's not going to replace wholly I think it's more like about some of the job that we not yes some of the task that we under for example like maybe about detection or like uh C generation it could be like delegated to do AI but of course like a restructuring all the C where the bu is going to be using it and that where it's Going to be managed is it's still taking a data scientist to do that but that's why it's the data set just become evolving right because
like those task could be replac by AI we as a data scientist need to Evol as well we are not just need to understand how to coding it well but you need to understand how to manage discod better we to actually it better we need to understand where is going to be in the business which going To be used which of us data scientist is going to be really using this AI so all this latest technology we need to understand as well so but this what I will say like data say need to be become
a full stack and that is like cannot be avoidable on basically like like like my PA as well like I try to learn a lot of ml Ops operation s because I don't think it's going to be replaced by uh as well because to the structure could be need To be yeah I couldn't said better because um being able to become this full stock professional that you know both the data side the machine learning side the Deep learning but also the recent developments in AI kind of at least high level to know what LM is
what LM Ops is those Cloud Technologies how can be used right yes and have also the business acument and the communication because if the uh product um there's always Communication right business versus data scientist there's always a need for the translation and um as long as you are able to do so and continuously develop yourself with the technology there is no way that you will be replaced because by the way the other day I was reading that um currently uh there is an 85% gap between uh the demand for data science and AI professionals versus
the supply so unless you are doing a manual job I Always tend to say that unless you do something repetitively that can be replaced by AI you are good to go yes that's so you should definitely have a motivation to get into data science if you like it on the note of getting into data science because you are now in the position of hiring data scientists so what are the skill set that you are paying attention at because with this recent buz LMS generative AI many aspiring data scientists uh instead of Starting with fundamentals they
start um with a difficult stuff training neural networks understanding rnns attention mechanisms Transformers diffusion models and then uh try to show with this project that they are uh an experienced professional what do you pay attention at when you are hiring and you are getting this different raum okay so the first thing that I don't pay attention much first thing that I don't actually pay attention much Is actually where the diversity is or where the GTA is or their age is or their background is like if you out that but I the fin L those kind
of discriminative I the do so I try to make every qu in the in the F but of course like when you are in a business there's like a business needs so the first thing of course like filling up the business needs uh when it's like a hiring and as a data scientist that what I I do that I See for a junior data scientist I want to see if you could actually do the work so basically have you already at uh least do some data science project already created some data project data science project
and therefore for this data science project how's your uh thought process going how's your what what uh motivated you to do this kind of project what motivated you to do uh this kind of code what motivated you to present this Kind of result so those data science project is actually what I really see of course like uh having like an internship experience having like uh job experience previously really uh us it up I see that as well but of course like uh I know like this a little bit harder for engineer F uh sorry uh
ASA fans to become up engineer because like the position is like really hard to feel so that's why I try to I finist taking a look at their data science Portol project so that's the those the the most important part that I try to text in for like like a communication like a bus equum like for transl to the business I think those kind of a skill that I think will be you learn when you are already being in a business inside a business but when you are to enter in business you need to at
least show that you can work for the T that you are going to be hiring for and understanding but but of course like understanding a Little bit of a is uh really helpful still really helpful it's a plus yeah there is a plus it's a plus going but my standard point is like showing your data science portolio showing that you could actually do the work that's what that my right so a strong data science portfolio basically you learn the data science communication translation skills business acument all plus but extra but for you as a hiring
manager during the Hiring process you pay attention at the projects that they have completed unless they have already in experience yes and so on that note uh what would you suggest like couple of examples of such projects that um you would suggest an aspiring data scientist who have zero experience just fresh out of college maybe even non-rop data science Ed education to put on their resume to impress you there's a lot I mean like a lot of People that are impress me but very so smart but it's like uh you think like uh there like
a lot of thata but uh uh that you could try to find it in the kle or like UCI those kind of little bit complex data science project but what would really impress me if you actually could formulate a business problem from those data sit data set and then formulate why you've develop this kind of modeling and then this model that you develop is actually solving or even not Using a modeling are you actually using the data set no you don't need a model but I have you can formulate a business problem from this data
set and then you try to using the data science technique maybe just a clustering technique maybe just a custom or Dimension reduction technique but Co actually showing that this is how I could solve the problem from my bus problem that I already formulate with this data set and then I showing you that this is like how I do It so that's the one that I really impress me because like what I want to see in your data science project portolio is basically the top process usually Thea process will coming instantly when you already have like
this business problem that you want to uh solve but of course like uh the the data set that already available in the public is maybe a little bit limited but of course like within this limited data set if you could be creatively thinking About a problem and then try to solve it it will be Miss right so from end to end basically and also have kind of skill set than just solving the problem in technical way yes solving a problem in a technical way you can say that amazing and one last question because we spoken
about many important topics that I believe uh many aspiring data scientists would be interested in uh what do you see as the future of data science in the upcoming Five Years well we're going to change a lot I think like with this all AI development and that technical stuff that everywhere I know like in five years data size will be invaluable to the business like I said before AI is like is like a password previously but is a password that used by the business and if the business SCH actually use using it bets in the
company in this five year if it is five year they will try to HDE as Much as the data scientist just to making that sure that this is going uh smoothly and then I'm pretty sure I'm pretty sure like I'm this is in this five years will be using a lot of automation from our that uh from data science uh technique amazing now Cornelius thank you so much for joining us today I think uh your insights and all your tips were uh invaluable for our listeners who are interested in Tech and in data science Uh
for our listeners make sure to follow Cornelius on LinkedIn but also to check his newsletter non-brand data and if you're looking for a coach then uh definitely go for Cornelius he will be able to help you thank you so much Cornelius was really pleasure to have you here thank you so much thank you Dr so they're going to buy a b a business and that'll be the base and then on top of that base we're going to add other you know other businesses to try to help It grow you know exponentially faster and and I
used 100% Bank debt to by those 23 companies and you know the net result is you know a tremendous amount of shareholder value created Tesla could have come crashing down if if the lenders started saying you know enough is enough joining us today is Adam coffee who brings over 21 years of experience in building businesses having served as a CEO for three major companies supported by nine different Private Equity sponsors Adam has managed transactions worth over 2.5 B million dollar and advised top Fortune 500 companies during his time Adam organized 58 business deals and significantly
increased company values achieving fivefold returns for investors he grew one company's value from 10 million to over a billion dollar earning him recognition as one of the most influential leaders by the Orange County Business Journal Adam is also a Bestselling author a popular speaker and a mentor to aspiring leaders his extensive background includes role in health care manufacturing and Beyond his diverse skill set also includes being a licensed contractor a pilot an army veteran and a former executive at GE for 10 years today we will dive into the proven strategies that aspiring Tech entrepreneurs and
fresh graduates need to drive in today's competitive landscape we will uncover invaluable in Insight on how to navigate the tech World cut through the noise build investor trust and secure funding as well as Forge lasting Partnerships finally we will learn how to plan and execute a lucrative exit that maximizes your hard-earned success the podcast will be hosted by vah asan experienced software engineer and a tech entrepreneur the co-founder of lunar Tech that is on the mission to democratize data science in a So without further Ado let's get started welcome Adam we are excited to having
you join us today now Adam urp a big deal in private equity and in business D deals over like 2.5 billion um you have also advised top 1400 companies and written bestselling books on business Adam could you please share your journey and how you got to where you are today with with our audience happy too happy too and and hey by the way good to see you good to uh good to be here with all Your listeners out there um you know I think for all of us life is a journey and it's a building
of a set of experiences that make us who we are as a young person I served in the US military um service in the military taught me something about discipline teamwork leadership uh engineering uh engineering made me a meticulous planner um I'm a pilot Pilots don't take off unless we know where we're going so that taught me how as an entrepreneur to plan an a exit From the beginning you know and always have an exit and a destination in mind uh I spent 10 years working for Jack Welsh in the uh Camelot era of uh
ge I call it um GE was the world's number one largest company Fortune number one on the Fortune 500 list company was growing so fast it was doubling in size every three years and that that really informed my thinking about growth and GE taught me how to run a business then I spent 21 years as a CEO building three Different National companies for nine different private Equity firms um bought 58 companies a buy and build guy a turnaround guy and uh you know I've got2 and a half billion dollars in uh in CEO exits under
my belt um that kind of led me to writing books you know about how to do this I I wanted to educate I'm turning 60 here shortly and I I I wanted to start thinking about you know Legacy and how how you know I wanted to teach the next generation of entrepreneurs and Business business owners how to excel at this game at this thing you know that that's been so influenced by private equity and you know that kind of kind of led me to hanging up my CEO cleats a couple years ago I started a
Consulting business I've got clients all over the globe uh I helped them with uh with scaling uh with doing m&a teaching them the tricks that that the big institutional investors use to create shareholder wealth and then um you know I help people exit I work with private Equity firms I work with individuals and Founders I'm having a ball I work more hours now than I ever did when I was a CEO so so much for uh for for slowing down I I think I've actually speed it up awesome and um can you share a few
stories on how you acquired new businesses and grew them and sold them like the top businesses you worked on well you so so usually in my world again I've been doing this with large Institutional shareholders and so they always start with what's called a platform company so they're going to buy a b a business and that'll be the base and then on top of that base we're going to add other you know other businesses to try to help it grow you know exponentially faster so if I take my last company as an example um the
the company that I was hired to run so private Equity Firm buys a company it's a platform company it has 200 million Plus in Revenue uh they buy it with a combination of debt and Equity from their fund uh and then they they bring me in the company has not done well you know and time to bring in the guy to turn it around to fix it to get it scaling again and then start doing a a buy and build and so I then bought 23 companies over a five-year period uh I bought I bought
eight total in the first hold period 15 in the second and uh you know and started bolting on these other Businesses to go from being Regional to National National to International depending on which company I was building at at the time and in addition to to Growing through m&a also then would focus on on improving the business that I started with so usually investing in technology trying to do my best to increase the revenues the profitability um you know and and you know of the base business and then also a lot of effort around organic
growth to get the Business that that was underwhelming and not not doing really well to grow like it had never grown before organically and so I've learned how to build I I'll say a very balanced growth oriented company but no question that m& is the largest component of shareholder value creation and on that example those 23 companies that that we bought you know on average I paid five times for each one of the companies they were small they were plentiful or smaller plentiful And and I used 100% Bank debt to buy those 20 three companies
and I use the cash flow of the 23 businesses to you know service the debt while I'm collecting them buying them and then when we go to market we sell it and we sold for the first time a multiple of around 14 times and so things I was buying at five times I'm now selling at 14 times and you know the net result is you know a tremendous amount of shareholder value created then you add In the organic growth you add in the margin Improvement and that's kind of my recipe P for the perfect exit
you know in that case in my first exit the the three-year period uh it was a 4ex multiple of invested Capital so shareholders were happy investors were happy management team was thrilled we made a ton of money you know and uh when it when when things go well and um so as you know like in the tech industry the creating uh value Wealth is the the opportunity is immense that it's also like very competitive like we have many fresh graduates coming straight out of University and trying to make a new business a new startup but
they have no idea how to do this so what strategy or what mindset would you recommend them for our ambitious Tech entrepreneurs who want to get their food in the field yeah so Tech is an entirely different world you know you have to get to different concepts um you know Software as a ser you know or uh a tech enabled platform definitely brings a higher valuation call it at Exit um but oftentimes from a tech startup perspective I'd say that a lot of people out there are trying to create the new next best thing and
often times I tell people it's like instead of trying to create something new that doesn't yet exist um potentially solve uh an old problem or or put a new spin on something that's already out there And you know I think too often and it's like we we tried too hard as entrepreneurs to create something new differentiated that the world's never seen before and sometimes boring old problems still need help and still need solving you know and they can be updated and solved in a new modern fashion and and so I I think sometimes entrepreneurs overthink
complexity um and so when I when I am talking to people about what constitutes a great company you know I Tell them to think about human basic needs think about needs versus wants in a bad economy in a down economy if my business is focused on needs I'm not going to get hurt as bad my revenue streams will be be still fairly consistent um but if my if my product or my service is is a want then if I'm laid off or I'm unemployed or I'm feeling a pinch from high interest rates you know I
can slow down I can avoid or I can completely ignore that spend for a you Know for an extended period of time until the economy comes back and so we have to be concerned with just the cyclicality of of of the broader economies you know in the world we go through up Cycles we go through down Cycles you know and and the world can throw us curveballs like covid and so we we have to be very thoughtful around if we're going to start something I want it to be needs-based so you know it's like if
if my if my roof was leaking and it's Raining outside and I'm in my house you know and water is pouring on my head I have to fix that whether I'm broke or not you know but if I wanted to put new fancy you know accessories on my big monster truck out there um you know if I'm unemployed and I don't have any money then I just look at the magazine and dream about what I would do but I don't have the money and so I don't do it you know it's a it's a discretionary
spend so needs versus wants then we want Subscription based versus I'll call it Project based we want some type of a a product that customers are going to pay us a monthly fee for it's shocking to me if I go to my credit card statements and I look for all the monthly fees I'm paying you know for Adobe Acura bat and for Google cloud and for Apple this or that and it's like I spend you know a a fortune every month in just these recurring you know contracted type charges and you know That's also the
key to entrepreneurial success once I find a customer I want to create a recurring Revenue stream you know even in games people might have a free game but there's in in inapp purchases to help augment you know and so if I'm thinking from a tech perspective needs versus wants contracted Revenue stream versus one-time use or Project based and uh you know and and then I'm thinking you know in a perfect world low capital Expenditure um not a lot of money to to further develop or refine a product once it's created and uh you know and
it's it's creating a profile like that that leads to High profitability High free cash flow and with high free cash flow comes the ability to service a lot of debt which means buyers who want to use debt as a primary funding source they can uh they can service a lot of debt because there's a lot of cash flow so if you can build a business with high-f Free cash flow you know that's focused on needs not wants you know and uh and has recurrent contracted Revenue stream you're going to do much much better yeah now
that's that's a great advice like often times on entrepreneurs start working on some kind of new project they think they are like working on solving a problem in a certain way but actually they're like not solving like any form of new problem and they end up hitting like a wall where where They are like but is isn't already someone else doing that when they talk to an investor yeah sometimes boring you know Industries and we we solve a problem there but they're a a staple you know or a Mainstay of an economy um we we
we can get a lot better traction you know when we solve a a a common problem for common people rather than create a new problem that someone doesn't know they have yet and then have To convince them they need our product to solve that problem yeah 100% And with um startups like many startups like need human capital they need like some kind of um Capital to be able to employ uh new people to be able to uh invested in marketing or in other resources now build building trust is very important with investors because they won't
know that they are investing in someone that's trustful and they're able to not only get their money back But also get multiple returns so how would you advise on new people entering the field on building um trust with investors well this is the age-old problem and the age-old question question right so chicken or the egg I have no Revenue but I need people you know Venture Capital investor says I don't want to give you a bunch of capital that you're going to waste you know on the come you know I need you to be able
to prove you know proof of Concept and prove that you can can actually create these revenue streams so it's a very delicate balance and it makes startups a very difficult place to to to to be you know and and oftentimes I I I ask myself do I want to build or do I want to buy and I'll I'll look at the existing Market in place you know and I'll say look if I start from scratch I have a very high probability of failure um I I have a a lot of hurdles that I'm going to
have to to Cross and I might ask myself is there an existing company that has the existing technology or has the existing product that I can buy that's pre-existing and as a result I've got a company that has Revenue customers a history of profitability and then it's a different game you know so in the startup world we use things like Founders equity and and we we we try to attract people by by telling them you know how rich they're going to be one day down the road in the Future and that's a hard sell you
know and I I got to tell you I get contacted constantly with people who want to offer me Founders Equity to to help them and you know what I work for cash flow I don't work for Founders Equity uh and when I'm sitting on the boards of companies they give me stock anyway so I I get I get stock in an existing company with real Revenue real customers you know and I get cash flow and so I personally won't work in a tech type Startup world where there's Founders Equity involved so I I think we
have to be realistic and and we we have a we have to profile so anytime I need people you know my goal and objective is to hire the best people I can find for the company that I want to be in five years not the company I am today you know and part of my tenants include I have to pay a fair market wage but if I can't because I'm I'm cash constrained then the only tool I've got is incentive Equity to try to attract people and then my profile might change you know I I
may not be looking for an established executive who's used to making seven figures a year because I have no money to pay them and so I'm looking for a different profile it's a younger person it's an upand cominging person it's a person with great skills but they live in their mother's basement you know or they live in an apartment and their cost structure is low they don't yet have Kids they're not yet married and as a result of that you know I can attract them with the equity potential and the lack of of cash flow
because their their needs are lower you know it's like I I'm a seven figure you know eight figure guy every year and so it's like I I I don't work for free I don't work for Equity that may or may not pay off in 10 years I work for cash and Equity you know and so you know we we have to think about the talent that we need and where are we Going to find it and how are we going to attract it and retain it and so we have to build a profile for the
type of person that we think would be uniquely qualified to go on this entrepreneurial Journey with us um especially when we're cash constrained in the beginning and we just don't have the right level of capital so I need Brilliance on a budget and I'm going to look for a profile of a person who's got low cash flow needs to where my small poultry salary will in in You know at least cover their basic needs because they have cheap basic needs but brilliant skills and they're trying to become you know call it the next tech you
know Tech billionaire or multi-millionaire they'll believe in the journey and they'll take and call it use Sweat Equity to get there um now in your experience how do um successful companies balance Innovation with Su sustainable growth like for example we have like lot of Businesses that are innovating but they like they keep on innovating and in an unsustainable way a small Tech entrepreneur who's trying to create something that has you know a of legs I'll call it something that has longlasting ability to build a sustainable Revenue in future um at some point we have to
shift entrepreneurial gears and say it's good enough it's good enough for now and our Focus needs to be scaling and then the Innovation or Investment you know we we we we don't necessarily want to stop but we do need to throttle back so if I've gotten to a proof of concept you know I'm out in the market place you know there is a a point where we have to be thinking about well if I continue to spend money I don't have you know innovating innovating innovating while this is important I'm never going to build a
sustainable business if I don't also keep my eye on the ball in the fact that my investors Need to see a return and I need to create you know revenue and so as I get out of the gates I get out of the market when I start to start to see Revenue coming in it's like we really have to drive Revenue hard and show sustainable high levels of Revenue growth and the high margins that that that were we were hoping for and we have to demonstrate this and and so our you know we have a
lot of initial effort to to call it on the technology side to innovate and Create the product once we get out and launch that needs to to scale back and our efforts need to be replaced by focusing on marketing and sales and building the revenue stream we have to remember in order to build the best business in the world it still has to be fed with cash and investors eventually will run out of patience and pull out the rug from under us if we can't prove that we've got revenue and and so I I think
back to like Elon Musk in the Early days of Tesla you know or Jeff Bezos at at Amazon you know on any give day Elon Musk could have you know Tesla could have come crashing down if if the lenders started saying you know enough is enough I'm not loaning you anymore it's time for you you know to either make money or or shut down and you know he he was able to to to navigate that as was Jeff when he was building Amazon but the typical small entrepreneur isn't going to get that kind of treatment
you Know they are not going to be able to sustain Innovation and investment in in a hope and a prayer if they cannot prove that money is there so don't forget that while we may be interested in technologically changing the world there's the commercial aspect of we got to make money and before I worry about making money big I need to prove to people I can make money small and once I've got my product kind of to a stage to where it's re ready to revenue I need To turn down innovation turn up marketing and
really focus on driving Revenue creation and customer adoption so so that I can then start generating cash which will let me then go back to innovating you know at a future time so we we have to be balanced a lot of times entrepreneurs forget the commercial aspect and the commercial aspect is we got to make money we're so busy innovating we forget that we have to make money and before long it's like Investors get tired of us because there's a thousand other things for them to invest in they pull the rug out from under us
and we crash and burn and so the best technology on the planet does not guarantee you commercial success you know we have to drive commercial succcess as soon as we're able in order to prove the sustainability of our business no 100% And I I feel like ego has some kind of role in that where like an entrepreneurs like very convinced Because of certain reasons but also their ego on why this Innovation will only cause growth although in reality it's only uh hinders their growth like what do how would you that's why that that's why dreamers
dream and doers do you know so there there's I call it the accidental arrogance of success it's like PE people get so into their own you know self-promotion that that you know I'm God's gift and what I've done is going To change the world and you know I I see those pitches every day from people who are out there Adam my idea plus your wallet equals you know the best thing the plan planet has ever seen and I'm like first of all if you're talking to me about money you don't understand my value because my
value is what's up here not what's in my wallet money is a commodity there's trillions of it out there looking for Investments right now you know and so all you have to do is Know where it is go go go go get it treat it well give it an outsized return and you'll get funded you know and so money is a commodity money's not the issue people who are focused on money is my problem don't understand how money works and so in addition to being you know call it a tech genius they need to have
a business Acumen you know Andor partner with someone who understands business and and they can be the the the strange person who's locked in the the Dark room for 20 hours a day innovating and creating something great but they still need some business guy out there to be the front end and so when we get arrogant you know and and and keep in mind most of these people have not created anything yet you know and so if they have an arrogance of success and they have it before they're actually revenu and and building something special
then boy that's that's a an entrepreneur who's going to have a hard Time finding capital and finding money there's a fine line between arrogance you know and confidence and we need to be confident we shouldn't be arrogant and if we're Arrogant with no money and we're Arrogant with an idea but no Revenue you know then investors just simply walk away you know I I that's not a not an adventure that I'm going to back so we we have to be careful about letting the arrogance of our genius Cloud our thinking and and ultimately Investors see
right through that and uh if there's you know what it's it's okay to be arrogant you know if you're the richest man on the planet and you have you know you you have arrived when you have an idea and no revenue and you're arrogant and you're arrogant with investors that's not a good recipe for success and we are almost uh hitting the time and could um tell us about your services for example we have new startups but they have no idea how to do Business they don't have the business acument so maybe they can talk
to you or yeah so I do Consulting work um you know people can read my books they're cheap you know I donate my royalties to charity um and you know all three of my books have been number one best seller so so thank you to everybody out there who reads my books I've been on hundreds of podcasts just like this I do the freely so that there you know if you go to listennotes Docomo from from there I teach seminars globally you know those are relatively low cost and I'll do boot camps where we spend
two to four days together and I I I really get in depth about about all things around growth and raising capital and selling businesses and uh and then I I do work you know one-on-one with uh with with dozens of entrepreneurs I I have a peer group we call the chairman group I do that with my business partner JT Fox um you can reach out to me on LinkedIn you can go to my website uh Adam e coffee.com um you can also uh you know in in I'd say LinkedIn is where you'll find me the
most I'm most active it's really the only social media platform I'm on Twitter I post some things once in a while I'm not on Instagram or Facebook there's a fake Adam coffee out there believe it or not I guess you know you've arrived when there's people who are intimidating you and so on Facebook And Instagram you'll find fake Adam coffe trying to take money from you um you know I I'm trying to help people not builk them for uh for money um so I'm I'm a consultant you know and I I do Consulting work with
uh all kinds of different people private Equity firms you know family offices Etc so thanks for having me on I appreciate you appreciate your listeners out there good luck take care of people and uh and revenue will happen thank you Adam this Video was sponsored by lunarch at lunar Tech we are all about making you ready for your dream job in Tech making data science and AI accessible to everyone with is data science artificial intelligence or engineering at lch Academy we have courses and boot camps to help you become a job ready professional we are
here to help also businesses as schools and universities with a topn training modernization with data science and AI corporate training Including the latest topics like generative AI with luner tech learning is easy fun and super practical we care about providing an endtoend learning experience that is both practical and grounded in fundamental knowledge our community is all about supporting each other making sure you get where you want to go ready to start your Tech Journey lunner Tech is where you begin for students or aspiring the science and AI professionals visit lch Academy section To explore our
courses and boot camps and just in general our programs businesses in Need for employee training upscaling or data science and AI Solutions should head to the technology section on the lunch. a page Enterprises looking for corporate training curriculum modernization and customize AI tools to enhance education please visit the lunch Enterprises ction at lunch. for a free consultation and customize estimate join lunch and start Building your future one data point at a time