Name: The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)
Duration: 1 h 23 min 6 s
Channel: Lenny's Podcast
Description: I'm very clear that I'm a big fan of test everything which is any code change that you make any feature that you introduce has to be in some experiment because again I've observed this sort of surp...

I'm very clear that I'm a big fan of test everything which is any code change that you make any feature that you introduce has to be in some experiment because again I've observed this sort of surprising result that even small bug fixes even small changes can sometimes have surprising unexpected impact and so I don't think it's possible to experiment too much you have to allocate Sometimes to these high risk High reward ideas we're going to try something that's most likely to fail but if it does win it's going to be a home run and you

have to be ready to understand and agree that most will fail and it's amazing how many times I've seen people come up with new designs or a radical new idea and they believe in it and that's okay I'm just cautioning them all the time to say hey if you go for something big Try it out but be ready to fail eighty percent of the time welcome to Lenny's podcast where I interview world-class product leaders and growth experts to learn from their hard-winning experiences building and growing today's most successful products today my guest is Ronnie kohavi

Rani is seen by many as the world expert on a b testing and experimentation most recently usvp and Technical fellow of relevance at Airbnb where you led their Search experience team prior to that he was corporate vice president and Microsoft where he led Microsoft's experimentation platform team before that he was director of data mining and personalization at Amazon he's currently a full-time advisor and instructor he's also the author of the go-to book on experimentation called trustworthy online controlled experiments and in our show notes you'll find a code to get a discount on taking his live

cohort based Course on Maven in our conversation we get super tactical about a b testing Ronnie shares his advice for when you should start considering running experiments at your company how to change your company's culture to be more experiment driven what are signs your experiments are potentially invalid why trust is the most important element of a successful experiment culture and platform how to get started if you want to start running experiments at your Company it also explains what actually is a p-value and something called twyman's Law plus some hot takes about Airbnb and experiments in

general this episode is for anyone who's interested in either creating an experiment driven culture at their company or just fine-tuning one that already exists enjoy this episode with Ronnie kohavi after a short word from our sponsors this episode is brought to you by mixpanel get deep insights into what Your users are doing at every stage of the funnel at a fair price that scales as you grow mixpanel gives you quick answers about your users from awareness to acquisition through retention and by capturing website activity add data and multi-touch attribution right in mixpanel you can improve

every aspect of the full user funnel powered by first party behavioral data instead of third-party cookies mixed panels built to be more powerful and easier to use Than Google analytics explore plans for teams of every size and see what mixpanel can do for you at mixpanel.com friends slash Lenny and while you're at it they're also hiring so check it out at mixpanel.com friends slash Lenny this episode is brought to you by round round is the private Network built by Tech leaders for Tech leaders round combines the best of coaching learning and authentic relationships to help

you identify where you want to go and Accelerate your path to get there which is why their weightless tops thousands of tech execs round is on a mission to shape the future of technology and its impact on society leading in Tech is uniquely challenging and doing it well is easiest when surrounded by leaders who understand your day-to-day experiences when we're meeting and building relationships with the right people we're more likely to learn find new opportunities be dynamic in our Thinking and Achieve our goals building and managing your network doesn't have to feel like networking join

around to surround yourself with leaders from Tech's most Innovative companies build relationships The Inspired take action visit round.tech slash apply and use promo code Lenny to skip the waitlist that's round dot Tech slash apply thank you Ronnie welcome to the podcast thank you for having me So you're known by many as maybe the leading expert on a b testing and experimentation which I think is something every Product Company eventually ends up trying to do often badly and so I'm excited to dig quite deep into the world of experimentation and a b testing to help people

run better experiments so thank you again for being here great goal thank you let me start with kind of a fun question what is maybe the most unexpected a B Test you've run or maybe the most surprising result from an A B test that you run yeah so I think the the opening example that I use in my book and in my class is the most surprising public example we can talk about and this is this was kind of an interesting experiment somebody proposed to change the way that ads were displayed on Bing the search

engine and he basically said let's take the second line and move it Promote it to the first line so that the title line becomes larger and when you think about that and there's you know if you're going to look in my uh book or in the class There's an actual diagram of what happened uh the screenshots but if you think about it it's just realistically it looks like an idea like why would this be such a reasonable interesting thing to do uh and indeed when we went back to the backlog it was on the backlog

for months And language there and many things were rated higher but the point about this is it's trivial to implement so if you think about return on investment we could get the data by having some Engineers spend a couple of hours implementing it and that's exactly what happened somebody uh Ed Bing who kept seeing this in the backlog and said my God we're spending too much time discussing it I could just Implement that he did he spent a couple of days Implementing it and as is you know the common uh thing at Bing he launched

the experiment uh and a funny thing happened we had an alarm big escalation something is wrong with the revenue metric now this alarm fired several times in the past when there were real mistakes where somebody would log Revenue twice or you know there's some data problem but in this case there was no bug we That simple idea increased Revenue by about 12 percent and this is something that just doesn't happen we can talk later about employment's law but that was the first reaction which is this is too good to be true let's find the bug

uh and we did and we looked for several times and we replicated the experiment several times and there was nothing wrong with it this thing was worth a hundred million dollars at the time when Bing was a lot smaller and the key thing is it didn't hurt the user metric so it's very easy to increase Revenue by doing theatrics that you know displaying more ads is a trivial way to raise revenue but it hurts the user experience and we've done the experiments to show that in this case this was just a home run that improved

Revenue didn't significantly hurt the the guardrail metrics and so I was we Were like in awe of you know what a trivial change that was the biggest Revenue impact of being in all its history and that was basically shifting in two lines right switching two lines in the search results right and this was just moving the second line to the first line now you then go and run a lot of experiments to understand what happened here is it the fact that the title line has a bigger font sometimes different Color so we ran a whole

bunch of experiments and this is what usually happens we have a breakthrough you start to understand more about what can we do and there was suddenly a shift towards okay what are other things we could do that would allow us to improve Revenue we came up with a lot of follow-on ideas that helped a lot but to me this was an example of a tiny change that was the best Revenue generating idea in Bing's history and And rated properly right nobody gave this the the priority that in hindsight it deserves and that that's something that

happens often I mean we are often humbled by how bad we are at predicting the outcome of experiments this reminds me of a classic experiment at Airbnb while I was there and we'll talk about Airbnb in a bit the search team just ran a small experiment of what if we were to open a new tab every time someone clicked on a search result instead of Just going straight to that listing and that was one of the biggest wins in search yeah and by the way I I don't know if you know the history of this

but I tell about this in class we did this experiment way back around 2008 I think and so this predates Airbnb and I remember it was heavily debated like why would you open something in a new tab the users didn't ask for it uh it was a lot of pushback from the Designers and we ran that experiment and again it was one of these highly surprising results that made it that we learned so much from it so we first did this it was done in the UK for opening Hotmail and then we moved it to

MSN so it would open search in Utah and all the set of experiments were highly highly beneficial we published this and I have to tell you when I came to Airbnb I talked to our joint friend Ricardo about this and it was sort of done it Was very beneficial and that it was semi-forgotten which is one of the things you learned about institutional memories when you have winners make sure to address them and remember them so it was an Airbnb done for a long time before I joined that listings opened in a new tab but

other things that were designed in the future were not done and I reintroduced this to the team and we saw big improvements Shout out to Ricardo or mutual friend who helped make this conversation happen there's this like holy grail of experiments that I think people are always looking for of like one you know hour of work and it creates this massive result I imagine this is very rare and uh don't expect this to happen I guess in your experience how often do you find kind of one of these gold nuggets just lying around yeah so

again this is a topic that's uh near and dear to my Heart everybody wants these amazing results and you know I show them in chapter one in my book multiple of these you know small efforts huge gain but as you said they're very rare I think most of the time the winnings are made sort of this inch by inch and there's a graph that I show in my book a real graph of how Bing ads has managed to improve their revenue per thousand searches over time and every month you can see a small Improvement and

a small Improvement sometimes a degradation because of legal reasons or other things we you know there were some concern that we were not marking the ads properly so you have to suddenly do something that you know is going to hurt Revenue but yes I think most results are inch by inch you improve small amounts lots of them I think they're the best example that I can say is a couple of them that I can speak about one is at Bing the relevance Team hundreds of people all working to improve being relevant they have a a

metric we'll talk about oh we see the overall evaluation Criterion but they have a metric that their goal is to improve it by two percent every year it's a small amount and that two percent you can see here's a point one and here's a 0.15 here's a 0.2 and then they add up to around two percent every year which is amazing another example that I Am allowed to speak about uh from Airbnb is the fact that we ran some 250 experiments in my tenure there in search relevance and again small improvements added up so this

became overall a six percent Improvement to revenue you know so when you think about six percent it's a big number but it became it came out not of one idea but many many smaller ideas that each gave you a small gain and in fact I again there's another Number I'm allowed to say of these experiments 92 percent failed to improve the metric that we were trying to move so only eight percent of our ideas actually were successful at moving the key metrics there's so many threads I want to follow here but let me follow this

one right here you just mentioned of 92 percent of experiments failed is that typical in your experience running seeing experiments run a lot of companies like what should people expect When they're running experiments what percentage do they expect to fill well first of all I published three different numbers for my career so overall at Microsoft about 66 two-thirds of ideas fail right and don't take the 66 is accurate like you know it's about two-thirds and Bing which is a much more optimized domain after we've been optimizing it for a while the failure rate was around

85 so it's harder to improve something you've Been optimizing for a while and then at Airbnb uh this 92 number is you know the highest failure rate that I've observed now I've quoted other sources that you know it's not that I worked at groups that were particularly bad uh booking uh Google ads other companies published numbers there are around 80 to 90 percent failure rate of ideas here this is where it's important of experiments it is important to realize that when you have a platform It's easy to get this number you look at how many

experiments were run and how many of them launched not every experiment maps to an idea so it's possible that when you have an idea your first implementation you start an experiment boom it's egregiously bad because you have a bug in fact 10 percent of experiments tend to be aborted on the first date those are usually not that the idea is bad but that there is an implementation Issue or something we haven't thought about that forces on a board you may iterate and pivot again and ultimately if you do two or three or four pivots or

bug fixes you may get to a successful launch but those numbers of 80 to 92 percent failure rate are of experiments very humbling I know that every group that starts to run experiments they always start off by thinking that somehow they're different and their Successor is going to be much much higher and they're all humbled you mentioned that you had this uh pattern of clicking a link and opening a new tab is a thing that just worked at a lot of different places yeah are there other versions of this you collect kind of a list

of like here's things that often work when we want to move yeah are there some some you could share I don't know if you have a list in your head like you give you two resources uh one of them is A paper that we wrote called rules of thumb and what we tried to do at that time at Microsoft was to just look at thousands of experiments that run and extract some patterns and so that's that's one paper that uh we can then put in the nodes perfect um but there's another more more accurate I

would say uh resource that's useful that I recommend to people and it's a site called goodui.org and good ui.org is exactly the site that Tries to do what you're saying at scale so guys the name is Jacob linovsky he asks people to send them results of experiments and he derives he puts them into patterns there's probably like 140 patterns I think at this point and then for each pattern he says um well who hasn't helped how many times and by how much so you have an idea of you know this worked three out of five

times and it was a huge win in fact you can find that open a new window in there I feel like you feed that into chat GPT and you have basically a product manager creating a roadmap uh tool in general by the way this is all about a lot of that is institutional memory right which is can you document things well enough so that the organization remembers the successes and failures and learns from them I think one of the mistakes that some company makes is they launch a lot of experiments and never go back and

summarize the learnings so I've actually Put a lot of effort in this idea of institutional learning of doing the quarterly meeting of the most surprising experiments by the way surprising is another uh question to people uh often are not clear about what is a surprising experiment to me a surprising experiment is one where the estimated resolve beforehand and the actual result differ by a lot so that absolute value of the difference is Large now you can expect something to be great and it's flat well you learn something but if you expect something to be small

and it turns out to be great like that ad title uh promotion then you've learned a lot or conversely if you expected something will be small and it's very negative you can learn a lot by understanding why this was so negative and that's interesting so we focus not just on the winners but also Surprising losers things that people thought would be a no-brainer to run and then for some reason it was very negative and sometimes it's that negative that gives you and so I'll actually you know I'm just coming up with one example of that

that I should mention we were running this experiment at Microsoft to improve the windows indexer and the team was able to show on offline test that it does much better at Indexing and you know they showed some relevance is higher and all these good things and then they ran it as an experiment you know what happened surprising result indexing the relevance was actually high but it killed the battery life um so here's something that comes from Left Field that you didn't expect it was consuming a lot more CPU on laptops it Was killing the laptops

and therefore okay we learned something let's document it let's remember this so that you know we now take this other factor into account as we design the next iteration what advice do you have for people to actually remember these surprises you said that a lot of it is institutional what do you recommend people do so that they can actually remember this when people leave Say three years later document that you know right you know we had a large deck internally of these successes and failures and we encourage people to look at them the other thing

that's very beneficial is just to have your whole history of experiments and do some ability to search by keywords right so I'm I have an idea type a few keywords and see if from the thousands of experiments that ran and by the way these are very reasonable numbers at Microsoft just to let you know when I left in 2019 we were on a rate of about 20 to 25 000 experiments every year so every working day we were starting something like a hundred new treatments big numbers so when you're running in a group like Bing

which is running thousands and thousands of experiments you want to be able to ask has anybody did an experiment on this or this or this and So that searching capability is in the platform but more than that I think just doing the quarterly meeting of the most successful most interesting sorry not just successful most interesting experiments is very key and that also helps the flywheel of experimentation this will get segue to something I wanted to touch on which is there's often a I guess a weariness of running too many experiments and being too data driven

And the sense that experimentation just leads you to these micro optimizations and you don't really innovate and do big do big things what's your perspective on that and then just can you be too experiment driven in your experience I'm very clear that I'm a big fan of test everything which is any code change that you make any feature that you introduce has to be in some experiment because again I've observed this sort of Surprising result that even small bug fixes even small changes can sometimes have surprising unexpected impact and so I don't think it's possible

to experiment too much I think it is possible to focus on incremental changes because some people say well you know if we only tested 17 things around this and you have to think about it's not just it's like in stock you need a portfolio you need some experiments that are incremental that move you in the Direction that you know you're going to be successful over time if you just try enough but some experiments have you have to allocate sometimes to these high risk High reward ideas we're going to try something that's most likely to fail

but if it does win it's going to be a home run and so you have to allocate some efforts to that and you have to be ready to understand and agree that most will fail most of these high enough it's Amazing how many dimes I've seen people come up with new designs or a radical new idea and they believe in it and that's okay I'm just cautioning them all the time to say hey if you go for something big try it out but be ready to fail eighty percent of the time right and one true

example that again I'm able to talk about because we put it in my book is we were at Bing Trying to change the landscape of search and one of the ideas the Big Ideas was we're going to integrate with social so we hooked into the Tweeter fire hose feed and we hooked it to Facebook and we spent a hundred percent years on this idea and it failed you don't see it anymore it existed for about a year and a half and all the experiments were just negative to flat and you know it was an attempt

it was Fair to try it I think it took us a little long to fail to decide if that this is a failure but at least we had the data we had hundreds of experiments that we tried none of them were a breakthrough and I remember sort of mailing uh chilu with some statistics showing that you know it's time to abort it's time to fail on this uh and you know he decided to continue more and it's a million dollar question you know you continue and then maybe the Breakthrough will come next month or uh

do you abort and a few a few months later we we aborted that reminds me of at Netflix they tried a social component that also failed at Airbnb early on there was a big social attempt to make like here's your friends have stayed at these airbnbs completely not had no impact so maybe that's one of these learnings that we should talk yeah this is hard this is hard and uh but that's again that's the value of Experiments which are this Oracle that gives you the data you may be excited about things you might believe it's

a good idea But ultimately the Arbiter the Oracle is the controlled experiment it tells you whether users are actually benefiting from it whether you and the users the company and the users there's obviously a bit of overhead and downsides running an experiment setting all up and making sure you know analyzing the results is There anything that you ever don't think is worth a b testing first of all there are some necessary ingredients to a b testing and I'll just say I'll write not every domain is a minimal thing to be testing right you can't a

B test mergers and Acquisitions right it's something that happens once you either acquire you don't acquire so you do have to have some necessary ingredient you need to have enough units mostly users in order For the statistics to work out so yeah if you're too small it may be too early to a B test but what I find is that in software it is so easy to run a b testing and it is so easy to build a platform I don't say it's easy to build a platform but once you build the platform the incremental

cost of running an experiment should approach zero and we got to that at Microsoft where After a while the cost of running experiments was so low that nobody was questioning the idea that everything should be experimented with now I don't think we were there at Airbnb for example the platform at Airbnb was much less mature and required a lot more analysts in order to interpret the results and to find issues with it so I do think there's this trade-off you're willing to invest in the platform it is possible to Get the marginal cost to be

close to zero but when you're not there it's still expensive and there are there may be reasons why not to run AB thefts you talked about how you may be too small to run a b tests and this is a constant question for startups is when should we start running a b tests do you have kind of a heuristic or rule of thumb of just like here's a time you should really start thinking about running an image a million dollar Question that everybody asks so I actually we'll put this in the notes but I gave

a talk last year what I called it is practical defaults and one of the things I show there is that unless you have at least tens of thousands of users the math the statistics just don't work out for most of the metrics that you're interested in in fact you know I gave an actual practical number of a retail site with some conversion rate trying to Detect changes that are at least you know five percent beneficial which is something that startups should focus on they shouldn't focus on the one percent they should focus on the five

and ten percent then you need something like 200 000 users right so start experimenting when you're in the tens of thousands of users you'll be only be able to detect large effects and then once you get to 200 000 users then the magic starts happening then you can start testing a Lot more then you have the ability to test everything and make sure that you're not degrading uh and getting value out of response so you ask for rule of thumb 200 000 users you're magical below that start building the culture start building the platform start

integrating so that as you scale you you start to see the value love it coming back to this kind of concern people have of experimentation keeps you from innovating and taking big Bets I know you have this framework uh overall evaluation Criterion and I think that helps with this can you talk a bit about that the oec or the overall evaluation criteria is something that I think many people that start to dabble in a b testing Miss and the question is what are you optimizing for and it's a much harder question that people think because

it's very easy to say we're going to Optimize for money Revenue but that's the wrong question because you can do a lot of bad things that will improve Revenue so there has to be some countervailing Metric that tells you how do I improve Revenue without hurting the user experience okay so let's take a good example uh with search you can put more ads on the page and you Will make more money there's no doubt about it you will make more money in the short term the question is what happens to the user experience and how

is that going to impact you in the long term so we've run those experiments and we were able to map out you know this number of ads causes this much increase to churn this number of ads causes this much increase to the time that users take to find a successful result and we came up with an Oec that is based on of these metrics that allows you to say okay I'm willing to take this additional money if I'm not hurting the user experience by more than this much right so there's a trade-off there one of

the nice ways to phrase this as a constraint optimization problem I want you to increase Revenue but I'm going to give you a fixed amount of average real estate that you can use right so you can for one query you can have zero ads for another query you can Have three ads for a third query you can have wider bigger ads I'm just going to count the pixels that you take the vertical pixels and I will give you some budget and if you can under the same budget make more money you're good to go right

so that to me turns the problem from a badly defined let's just make more money right any page can start Plastering more ads and make more money Short term but that's not the goal the goal is long-term growth and revenue then you need to insert these other criteria and what am I doing through the user experience one way around it is to put this constraint another one is just to have these other metrics again something that we did to look at the user experience how long does it take the user to reach a successful click

what percentage of sessions are successful these are key metrics that Were part of the overall evaluation Criterion that we've used I can give you another example by the way from you know the hotel industry or Airbnb that we both worked at um you can say I want to improve conversion rate but you can be smarter about it and say it's not just enough to convert a user to buy or to pay for a listing I want them to be happy with it several months down the road when they actually stay there right So that could

be part of your oec to say what is the rating that they will give to that listing when they actually stay there and that's a that causes an interesting problem because you don't have this data now you're gonna have it three months from now when they actually stay so you have to build the training set that allows you to make a prediction about whether this user whether Lenny is going to be happy at this cheap place or where They know I should offer him something more expensive because Lenny likes to stay at nicer places where

the water actually is hot and comes out of the faucet that is true okay so it sounds like the courts of this approach is basically have a kind of a drag metric that makes sure you're not hurting something that's really important to the business and then being very clear on what's the long-term metric we care most about to me the key here the key word is Lifetime value which is you have to define the oec such that it is causally predictive of the lifetime value of the user right and that that's what causes you to

think about things properly which is am I doing something that just helps me short term or am I doing something that will help me in the long term once you put that model of lifetime value people say okay what about retention rates you can measure that what about the time to Achieve a task we can measure that and those are these countervailing metrics that make it make the oec useful and to understand these longer term metrics what I'm hearing is use kind of models and forecasts and predictions or would you suggest sometimes use like a

long-term holdout or some other approach like what do you find is the best way to think see these well so there's two ways that I like to think about it one is you can run long term Experiments for the goal of learning something so I mentioned that at Bing we did run these experiments where we increased the odds and decreased the odds so that we will understand what happens to key metrics the other thing is you can just build models that use some of our background knowledge or use some you know data science to look

historical I'll give you another good example of this when I came to Amazon one of the teams uh that Reported to me was the email team that it was not the transactional emails when you buy something you get an email but was the team that sent these recommendations you know here's a book by an author that you bought here's a product that we recommend and the question is how do we give credit to that team and the initial version was well whenever a user comes from the email and purchases something on Amazon we're gonna give

That email credit well it turned out this had no counter available metric the more emails you send the more money you're going to credit the theme and so that led to spam literally a really interesting problem the team just ramped up the number of emails that they were sending out and claimed to make more money in their Fitness function uh improved and and then so then we backed up and then we said okay we can either phrase this as a Constraint satisfaction problem you're allowed to send user and email every X days or which is

what we ended up doing is let's model the cost of spamming the users okay what's that cost well when they unsubscribe we can't mail them okay so we did some data science study on the side and we said what is the value that we're losing from an unsubscribe right we came up with a number it's a few dollars but the point Was now we have this counter availity metric we say here's the money that we generate from the emails here's the money that we're losing on long-term value what's the trade-off and then when we started

to incorporate those formula more than half the campaigns that were being sent were negative hmm okay so it was a huge Insight uh at Amazon about how to send the right Campaigns and this LED and this is what I like about these discoveries this fact that we integrated the unsubscribe led us to a new feature to say well let's not lose their future lifetime value through email when they unsubscribe let's offer them by default to unsubscribe from this campaign so when you get an email uh you know there's a new book by the author the

default to unsubscribe would be unsubscribe me from author emails And so now the the negative the countervailing metric is much smaller and so again this was a breakthrough in our ability to send more emails and understand based on what users were unsubscribing from which ones are really beneficial I love the surprising results we all love them I mean this is this is The Humbling reality and you know people talk about the fact that a b testing sometimes leads you to incremental I actually think that many of these small Insights lead to fundamental insights about you

know which areas to go some strategies we should take some things we should develop helps a lot this makes me think about how every time I've done a full redesign of a product I don't think ever has it ever been a positive result and then the team always ends up having to claw back what they just hurt And try to figure out what they messed up is that your experience too absolutely yeah in fact I I've published uh some of these in in LinkedIn posts showing a large set of you know big launches that redesigns

that dramatically failed uh and it happens very often so the right way to do this is to say yes we want to do a redesign but let's do it in steps and test on the way and adjust so you don't need to take 17 new changes That many of them are going to fail start to move incrementally in a direction that you believe is beneficial just on the way the worst part of that those experiences I find is it took like I don't know three six months three to six months to build it and by

the time it's launched it's just like we're not gonna unlaunch this everyone's been working in this Direction all the new features are assuming this is going to work and You're basically stuck right I mean this is the sun cost fallacy right we invested so many years in it let's launch this even though it's bad for the user no that's terrible yeah yeah so uh this is this is the other advantage of of recognizing this humble reality that most ideas fail right if if you believe in that statistics that I published then doing 17 changes together

is more likely to be negative do them in smaller increments learn from it's called o-fat One factor at a time do one factor learn from it and adjust of the 17 maybe you have four good ideas those are the ones that will launch and be positive I generally agree with that and always try to avoid a big redesign but it's hard to avoid them completely there's often team members that are really passionate and like we just need to rethink this whole experience we're not going to incrementally get there have you found anything effective in Helping

people either see this perspective or just making a larger bet more successful by the way I I'm not opposed to large redesigns I try to give the team the data to say look here are lots of examples where big redesigns fail try to decompose your redesign if you can't decompose it the one factor that I do a small set of factors at a time and learn from these smaller changes what works and what doesn't now it's also Possible to do a complete redesign I'm just as you said yourself they be ready to fail right I

mean do you do you really want to work on something for six months or a year and then run the A B test and realize that you've hurt revenues or other key metrics by several percentage points and a data driven organization will not allow you to launch what are you going to write in your annual review yeah but nobody ever thinks it's going To fail I think no we got this or we've talked to so many people but I think organizations that start to run experiments are humbled early on from the smaller changes yeah right

you're right nobody I'll tell you a funny story when I came from Amazon to Microsoft I joined the group and for one reason or another that group disbanded a month after I joined and so people came to me and said look you just Joined the company you're a partner level you figure out how you can help Microsoft and I said I'm going to build an experimentation platform because nobody in Microsoft is running experiments and 50 more than 50 of ideas at Amazon that we tried fail in the classical response was we have better PMS here

right there was this complete denial that it's possible that 50 of ideas that Microsoft is implementing in a three-year development cycle by the way this is how long it took office to release it was a classical every three years we release and the the data came about showing that uh yeah you know Bing was the first to truly Implement experimentation at scale and we share with the rest of the companies the surprising results and so when office was and this was you know credited to Chilu and uh Sacha Nadella they were ones that says Ronnie

you know you try to get office to run experiments we'll give you the air support uh and it was hard but we did it you know it was it took a while but office started to run experiments and they realized that many of their ideas were failing he said that there's a site of a failed redesigns was that is that in your book or is that a site that you can point people to to kind of help build this Case but I teach this in my class but I I think I've posted this on LinkedIn

and answered to some questions I'm happy to put that in the notes okay cool we'll put that in the show notes because I think that's the kind of uh data that often helps convince a team maybe we shouldn't rethink this entire onboarding flow from scratch maybe we should kind of iterate towards and learn as we go this episode is brought to you by EPO EPO is a Next Generation a b testing Platform built by Airbnb alums for modern growth teams companies like DraftKings zapier click up twitch and Cameo rely on EPO to power their experiments

wherever you work running experiments is increasingly essential but there are no commercial tools that integrate with a modern grow team stack this leads to wasted time building internal tools or trying to run your own experiments through a clunky marketing tool when I was at Airbnb one of the Things that I loved most about working there was our experimentation platform where I was able to slice and dice data by device types country user stage EPO does all that and more delivering results quickly avoiding annoying prolonged analytic cycles and helping you easily get to the root cause

of any issue you discover Apple lets you go beyond basic click-through metrics and instead use your North Star metrics like activation retention subscription and Payments EPO supports tests on the front end on the back end email marketing even machine learning clients check out app people at geteppto.com that's getepo.com and 10x your experiment velocity is it ever worth just going let's just rethink this whole thing and just give it a shot to break out of a local Minima or local Maxima essentially so I think what you said is fair I mean I I do want to

allocate some percentage of resources to Big bets as you said we've been Optimizing this thing to Hell could be completely redesigned it it's a very valid idea you may be able to break out of a local Minima what I'm telling you is 80 of the time you will fail so be ready for that right what people usually expect is my redesign is going to work no you're most likely going to fail but if you do succeed it's a breakthrough I like this 80 rule of thumb is that just like a simple way of thinking about

It eighty percent yeah rule of thumb and you know you've had uh you know I've heard people say it's 70 or or 80 but it's in that area where I think you know when you talk about how much to invest in the known versus the high risk High reward that's usually the right percentage that most organizations end up doing this allocation right you interviewed treas I think he mentioned that uh you know that Google is like 70 percent you know the searching ads and it's a 20 for some of the apps and new stuff and

then it's the 10 for infrastructure yeah I think the the most important point there is if you're not writing an experiment 70 of stuff you're shipping is hurting your business well it's not hurting it may it's flat too negative some of them are flat uh and by the way flat to me if something is not that Sig that's a no ship because You've just introduced more code there is a maintenance overhead to shipping your stuff I've heard people say look we already spent all this time the team will be demotivated if we don't ship it

and no that's wrong guys right you know let's make sure that we understand that shipping this project has no value is complicating the code Base maintenance costs will go up you don't ship on flat unless it's a sort of a legal Requirement right when legal comes along and says you have to do X or Y you have to ship on flat or even negative and that's understandable but again I think that's something that a lot of people make the mistake of saying legal told us we have to do this therefore we're going to take the

hits no legal gave you a framework that you have to work under try three different things and ship the one that hurts the least I love that reminds me when Airbnb Launched the Rebrand even that they ran as an experiment with the entire homepage redesigned the new logo and all that and I think there's a long-term holdout even and I think it was positive in the end from what I remember speaking of Airbnb I want to chat about Airbnb briefly I know there's and you're limited in what you can share but uh it's interesting that

Airbnb seems to be moving in this other direction where it's becoming a lot more top-down Brian Vision oriented and Brian's even talked about how he's less motivated through an experiments he doesn't want to run as many experiments as they used to things are going well and so you know it's hard to argue with the success potentially you worked there for many years you ran the search team essentially I guess just what was your experience like there and then roughly what's your sense of how things are going where it's going so as you as you know

I'm restricted from Talking about Airbnb I will say a few things that I am allowed to say one is in my team in search relevance everything was a b tested so while Brian can focus on some of the design aspects that people who are actually doing you know the neural networks and the search everything was a b tested to help so nothing was launching without an A B test we had targets around improving uh certain metrics and everything was done A B test now other teams some did some did not I will say that you

know when you say things are going well I think we don't know the counter factual I believe that head Airbnb kept people like Greg really which was pushing for a lot more data driven and had Airbnb run more experiments it would have been in a better State than today but it's the counter factual we don't know it's a really interesting perspective yeah there may be such an interesting Natural experiment of a way of doing things differently there's like de-emphasizing experiments and also they turned off paid ads during covet and I think I don't know where

it is now but it feels like it's become a much smaller part of the growth strategy who knows if they've ramped it up to back to where it is today but I think it's going to be a really interesting case study looking back I don't know 5-10 years from now it's a one-off experiment where it's Hard to assign value to some of the things that Airbnb is doing I personally believe it could have been a lot bigger and a lot more successful if it had run more controlled experiments but I can't speak about some of

those that I ran and that showed uh that some of the things that were initially untested were actually negative and could be better All right mysterious one more question Airbnb you were there during kovid which was quite a wild time for Airbnb we had sunshine on the podcast talking about all the craziness that went on when travel basically stopped and there was a sense that Airbnb was done and travel is not going to happen for years and years what's your take on experimentation in that world where you have to really move fast make crazy decisions

make big Decisions what did you what was like during that time so I I think actually in a state like that it's even more important to run a b tests right because what you want to be able to see is if we're making this change is it actually helping in the current environment you know there's this idea of external generalizability is it going to work out now during covet is it going to Generalize later on these are things that you can really answer with the controlled experiment so sometimes it means that you might have to

replicate them six months down when covid say uh is not as impactful as it is saying that you have to make decisions quickly to me I'll point you to the success rate like if if in peacetime you're wrong two-thirds to 80 of the time why would you be subtly right in Wartime right Until the time so uh I I don't believe in the idea that because bookings went down materially the company should certainly you know not be data driven and do things differently I think if Airbnb stayed the course did nothing the revenue would have

gone up in the same way in fact if you look at one investment one big investment that was done at the time was online experiences and the initial data wasn't very promising and I Think today it's a footnote yeah whatever another case study for the history books Airbnb experiences I want to shift a little bit and talk about your book which you mentioned a couple of times it's called trustworthy online controlled experiments and I think it's basically the book on a b testing let me ask you just uh what surprised you most about writing this

book and putting it out in in the reaction to it I was pleasantly surprised that It sold more than what we thought and more came more than what Cambridge predicted so when when first we were approached by Cambridge after a tutorial that we did to write a book I was like I don't know this is too small of an inch area yeah um and I uh you know they were saying so you'll be able to sell a few thousand copies and help the world and uh I found you know my co-authors which were great and

you know we wrote a book that we Thought is not statistically oriented has fewer formulas than you normally see and focuses on the Practical aspects and on trust which is the key the book as I said you know was more successful it sold over 20 000 copies in English it was translated to Chinese Korean Japanese and Russian and so it's it's great to see that we helped the world become more data driven with experimentation and I'm happy because of that and I was pleasantly surprised by The way all proceeds from the book are donated to

Charities on if I'm pitching the book here uh I there is no financial gain for me uh from having more copies sold I think we made that decision which was a good decision all proceeds go with the charity amazing I didn't know that we'll link to the book in the show notes you talked about how trust like it's trust is in the title you just mentioned how important trust is to experimentation a lot of people talk About how do I run experiments faster you focus a lot on trust why is trust so important in writing

experiments so to me the experimentation platform is the safety net and it's an oracle so it serves really two purposes the safety net means that if you launch something bad you should be able to abort quickly right safe deployments say velocity there are some names for this but this is one key value that the platform can Give you the other one which is the more standard one is at the end of the two-week experiment we will tell you what happened to your key metric and do many of the others surrogen and debugging and guardrail metrics

trust builds up it's easy to lose and so to me it is very important that when you present this and say this is science this is a control experiment this is the result you better believe that this is Trustworthy and so I focus on that a lot I think it allowed us to gain the organizational trust that this is really and the nice thing is when we we built all these checks to make sure that the experiment is correct if there was something wrong with it we would stop and say hey something is wrong with

the experiment and I think that's something that some of the early implementations in other places did not do and it was a big Mistake I mentioned this in my book so I can mention this here optimizely in its early days were very statistically naive they sort of said hey we're real time we can compute your P values in real time uh and then you can stop an experiment when the p-value is statistically significant that is a big mistake that inflates your what's called type one error or the false positive rate materially so if you think

you've got a five percent type one Error or you aim for that p-value less than 0.05 using real-time sort of p-value monitoring to optimize the offered you would probably have a 30 percent error rate so what this led is that people that started using optimizely thought that the platform was telling them they're very successful but when they actually started to see while it told us this is positive Revenue but I don't see this over time like by now we Should have made double their money uh so their questions started to come up around the trust

in the platform there's a very famous post that some of you wrote about how optimized they almost got me fired by a person who basically said look I came to the organ I said we have all these successes but then I said something is wrong and he tells them how he ran an AAA Test when there is no difference between the A and the B and optimizely told them that it was Statistically significant too many times optimizely learned optimizely you know several people pointed I pointed this out uh in my Amazon review of the book

uh that the optimize the authors wrote early on I said hey you're not doing the statistics correctly other you know Ramesh Johari at Stanford pointed this out became a consultant of the company and then they fixed it but to me that's a very good example of how to lose trust they allowed a lot of Trust in the market they lost all this trust because they built something that had very much inflated erroring that is uh pretty scary to think about you've been running all these experiments and they weren't actually telling you accurate results what are

signs that what you're doing may not be valid uh if you're starting to run experiments and then just how do you avoid having that situation what kind of tips can you share for people trying to Run experiments you know there's a whole chapter of that in my book but I'll say maybe one of the things that is the most common occurs by far which is a sample ratio mismatch now what is the sample ratio mismatch if you design the experiment to send 50 of users to control and 50 of users per treatment supposed to be

a random number uh or you know a hash function if you get something off from 50 percent it's a red flag so let's take a real Example uh let's say you're running an experiment and it's large it's got a million users and you got 50.2 so people say well I don't know it's not going to be exactly the same as 50.2 reasonable or not well there's a formula that you can plug in I have a spreadsheet available for those that are interested and you can tell here's how many users are in control here's how many

users have in treatment my design was 50 50 and it tells you the Probability that this could have happened by chance now in a case like this you plug in the numbers it might tell you that this should happen one in half a million experiments well unless you've run half a million experiment very unlikely that you would get a 50.2 versus 49.8 split and therefore something is wrong with the experiment right now people I remember when we first implemented this check we were Surprised to see how many experiments suffered from this right and there's a

paper that was published like 2018 and where we share that and Microsoft even though we'd be running experiments for a while is around eight percent of experiments that suffered from the sample ratio mismatch and it's a big number I think about this you're running 20 000 experiments a year so many of them eight percent of them are invalid and we somebody has to go Down and understand what happened here we know that we can't trust the results but why so over time you begin to understand there's something wrong with the pi data pipeline there's something

that happens with Bots Bots are a very common factor for causing uh sample ratio mismatch um so there's a whole that paper that was published by my team talks about how to diagnose sample ratio mismatches in the Last about a year and a half it was amazing to see all these third-party companies Implement sample ratio mismatches and all of them were reporting oh my God you know six percent eight percent ten percent uh so yeah they were it's sometimes fun to go back and say how many of your results are in the past or invalid

before you had this sample ratio mismatch test yeah that's frightening is The most common reason this happens is you're signing users in in kind of the wrong place in your in your code So when you say most common I think the most common is Bots somehow they hit the controller the treatment in different proportions because you changed the website the bot may fail to parse the page and try to hit it more often that's a classical example another one is just a data pipeline um we've had cases where we were trying To remove bad traffic

under certain conditions and it was skewed because of the control and treatment I've seen people that start an experiment in the middle of the site on some page but they don't realize that some campaign is pushing people uh from the side so there's multiple reasons it is surprising how often this happens and I'll tell you a funny story which is when we first added this test to the platform we just put a banner saying you Have a sample ratio mismatch do not trust these results and we noticed that people ignored it okay we're starting to

present results that had this banner and so we blanked out the scorecard we put a big you know red can't see this result you have a sample ratio mismatch click ok to expose the results and why we do we need that okay we need that okay button because you want to be able to debug the reasons And sometimes the metrics help you understand why you have a sample ratio mismatch so we blanked out the scorecard we have this button and then we started to see that people press the button is still presented the results of

experiments with sample ratio of this method so we ended up with an amazing compromise which is every number in the scorecard was highlighted with a red line so that if you took a screenshot other People could tell you that a sample ratio mismatch freaking product managers this is this is uh intuition people just say wow my Instagram was small therefore I can still present the results people want to see success I mean this is a natural bias and then we have to be very conscientious and fight that bias and say when something looks too good

to be true investigate which is a great segue to something you Mentioned briefly uh something called toyman's law yeah can you talk about that yeah you know the the general statement is if any figure that looks interesting or different is usually wrong uh it was first said by this person in the UK who worked in radio media but I'm a big fan of it and uh you know my main claim to people is if the result looks Too good to be true if you suddenly moved your you know your normal movement of an experiment is

under one percent and you suddenly have a 10 movement hold the celebratory there like it was just your first reaction right let's take everybody to fancy dinner because we just improved Revenue by millions of dollars hold that dinner investigate see because there's a large probability that something is wrong with the result And I will say that nine out of ten when we call out time is law it is the case that we find some flaw in the experiment now there are obviously outliers right that first experiment that I shared where we promoted that made long

ad titles that was successful but that was replicated multiple times in double and triple checked and everything was good about it many other results that were so big turn out to be false so I'm a big I'm a big big fan of Foreigner's law there's a deck I could also give this in the note where I shared some real examples uh of toymouth love amazing I want to talk about rolling this out of companies and things that run you run into that fail but before I get to that I'd love for you to explain just

p-value I know that people kind of misunderstand it and this might be a good time just help people understand what is it actually telling you that P-value say 0.05 I don't know if this is the right forum for explaining p-values because the definition of a p-value is simple what it hides is very complicated so I'll say one thing which is many people assign one minus p-value as the probability that your treatment is better than control so you run an experiment you got a p-value of 0.02 they think there's a 98 probability that the treatment is

better than the control That is wrong okay so rather than defining B values I want to cautiously caution everybody that the most common interpretation is incorrect p-value assumes it's a conditional probability or an us or assumed probability it assumes that the null hypothesis is true and we're Computing the probability that the data we're seeing it matches the hypothesis this null hypothesis in order to get the Probability that most people want we need to apply Bayes Rule and invert the probability from the probability of the data given the hypothesis to the probability the hypothesis given the

data for that we need an additional number which is the probability the prior probability that the hypothesis that you're testing is uh successful or not that's an unknown what we do is we can take historical data and say look people fail Two-thirds of the time or eighty percent of the time and we can apply that number and compute that we've done that in a paper that I will give in the notes so that you can assess the number that you really want that what's called a false positive risk so I think that's something for people

to internalize that what you really want to look at is this false positive risk which tends to be much much higher than the five percent that people think right so if you're I Think the classical example in the Airbnb where the failure rate was very very high is that when you get a statistically significant result let me actually pull the node so that I know how to have the actual number if you're at Airbnb where the success rate of or Airbnb search where the success rate is only eight percent if you get a statistically significant

result with the p-value less than 0.05 there's a 26 percent chance That this is a false positive result right it's not five percent it's 26 so that's the number that you should have in your mind and that's why when I worked at Airbnb one of the things we did is we said okay if you're less than 0.05 but above 0.01 rerun replica when you replicate you can combine the two experiments and get a combined p-value uh using something called Fisher's method or Snuffer's method and that gives you the joint Probability and that's usually much much

lower so if you get 2.05 or something like that then the joint the probability that you've got them is much much lower wow I have never heard it described that way makes me think about how like even data scientists in our teams are always just like this isn't perfect like we're not 100 sure this experiment is positive but on balance if we're launching positive experiments we're probably doing good things it's Okay if sometimes we're wrong by the way it's true on balance you're probably better than 50 50. but people don't appreciate how much that 26

that I mentioned is high and the reason that I want to be sure is that I think it leads to this idea of the learning the institutional knowledge what you want to be able to say is share with the organ success and so you want to be really sure that you're Successful so by lowering the p-value by forcing teams to work with the p-value maybe below 0.01 and do replication on hires then you can be much more successful and and the false positive rate will be much much lower fascinating and also shows the value of

keeping track of just what percent of your experiments are failing historically at the company or within that specific product Say someone listening wants to start running experiments they say they have tens of thousands of users at this point what would be the first couple steps you'd recommend well so if they have somebody in the org that has previously been involved experiment that's a good way to consult internally uh I think the the key decision is whether you want to build or buy there's a whole series of eight sessions That I posted on LinkedIn where I

invited guest speakers to talk about those problems or if people are interested they can look at how what the vendors say and what agency said about build versus buy question and it's usually not a zero one it's usually a both you build some and you buy some and it's a question of do you build 10 or do you build a 90 I think for people starting the third party products that are available today Are pretty good this wasn't the case when I started working so when I started building running experiments at Amazon we were building

the platform because nothing existed same at Microsoft I think today there's enough vendors that provide good experimentation platforms that are trustworthy that I would say not a good way to consider using one of those so you're at a company where there's Resistance to experimentation and a b testing whether it's a startup or a bigger company what have you found Works in helping shift that culture and how long does that usually take especially at a larger company my general experiences with Microsoft where you know we we went with this beachhead of Bing we were running a

few experiments and then we were asked to focus on being and we scaled experimentation and build a platform at Scale at Bing once Bing was successful and we were able to share all these surprising results I think many many more people in the company were amenable and it's all it was also the case that helped a lot that you know there's the usual cross-pollination people from Bing move out to other groups and that helped these other groups say hey there's a better way to build software so I think if you're starting out find a place

find a team where experimentation Is easy to run and by that I mean they're launching often right don't go with the team that launches every six months or you know office used to launch every three years go with the team that launches frequently you know they're running on Sprints uh they launch every week or two sometimes they launch that I mean being used to launch multiple times a day and then make sure that you understand the question of the oec is it clear what they're optimizing for Right there are some groups where you can come

up with a good oec some groups are harder you know I remember one funny example was the microsoft.com website which this is not MSN this is microsoft.com has like multiple different constituencies that are trying to determine this is a support side and this is some uh the ability to sell software through this site and it's and and warn you about You know safety and updates it has so many goals I remember when the team said we want to run experiments and I said I brought the group in and some of the managers and I said

do you know what you're optimizing for it was very funny because the the they surprised me they said hey Ronnie we read some of your papers we know there's this term called oec we decided that time on site is our oec and I said wait a minute Some of your main goals as a support site is people more spending more time on the support side a good thing or a bad thing and then half the room thought that more time is better and half the room thought that more time is worse so I know he

sees bad if directionally you can't agree on it um that's a great tip along these same lines and you're a big fan of platforms and building a platform to run experiments versus just one-off Experiments can you just talk briefly about that to give people a sense of where they probably should be going with their experimentation approach yeah I mean so I think the motivation is to bring the marginal cost of experiments down to zero so the more you self-service right go to a website set up your experiment Define your targets Define the metrics that you

want right people don't appreciate that the number of metrics starts to grow really fast if You're doing things right and being you could Define 10 000 metrics that you wanted to be in your scorecard big numbers so it was so big and people said it was computationally inefficient we broke them into templates so that if you were launching a UI experiment you would get this set of 2000 if you're doing a revenue experiment you would get this set of 2000 if you're doing uh so the point was Build a platform that can quickly allow you

to set up and run an experiment and then analyze it I think you know one of the things that I will say at Airbnb is the analysis was relatively weak and so lots of data scientists were hired to be able to compensate for the fact that the platform didn't do enough and so and this happens in other organizations too where there's this trade-off Like if you're building a good platform invest in it so that more and more automation will allow people to look at the analysis without the need to involve a data scientist we published

a paper again I'll give it in the notes with this you know sort of a nice Matrix of six axes and how you move from crawl to walk to run to fly and what you need to build on those sixes so if you know one of the things that I do sometimes When I can solve is I go into the organ and say where do you think you are on these six axes and that should be the guidance for what are the things you need to do next this is going to be the most epic

show notes episode we've had yet maybe a last question we talked about how important trust is through any experiments and how even though people talk about speed trust ends up being most important still I Want to ask you about speed is there anything you recommend for helping people run experiments faster and get results more quickly that they can Implement yeah so I'll say a couple of things one is if your platform is good then when the experiment finishes you should have a scorecard soon after they made fix a day but it shouldn't be then you

have to wait a week for the data scientist to me this is the number one way to speed up things Now in terms of using the data efficiently there are mechanisms out there under the title of variance reduction that help you reduce the variance of metrics so that you need less users so that you can get results faster some examples that you might think about are capping metrics so if your Revenue metric is very skewed maybe you say well if somebody purchased over a thousand dollars let's make that a thousand dollars at Airbnb one of

the Key metrics for example is nights booked well it turns out that some people book tens of nights they're like an agency or something hundreds of nights you may say okay let's just cap this it's unlikely that you know people book more than 30 days in a given month so that various reduction technique will allow you to get statistically significant results faster and uh a third technique is called Cupid which is an article that we published Again I can give it into notes which uses the pre-experiment data to adjust the result and we can show

that you get the result is unbiased but with lower variance and health hence it requires fewer users Ronnie is there anything else you want to share before we get to our very exciting lightning round uh no I think we've asked a lot of good questions uh hope people enjoy this I know they will lightning round lightning craft here we Go I'm just gonna roll right into it what are two or three books that you've recommended most to other people there's a fun book called calling which is despite the sort of uh name which is a

little extreme I think for the title it actually has a lot of amazing insights uh that I love and it sort of embodies in my opinion a lot of the twyman's law of showing that things that are too extreme your meter should go up and say hey I don't believe That so that was that's my number one recommendation there's a slightly older book that I love called hard facts dangerous half-truths and total nonsense uh by the Stanford professors from The Graduate School of Business very interesting to see many of the things that we grew up

with as sort of well understood turn out to have no uh justification and then some with stranger book which I love sort of on the verge of psychology It's called mistakes were made but not by me uh about all the fallacies and and that we fall into uh in The Humbling results from that the titles of these are hilarious and there's a common theme across all these books next question what is a favorite recent movie or TV show so I recently saw a short series called Chernobyl on the disaster it I thought it was amazingly

well done uh yeah Highly recommend it you know based on true events uh you know as usual there's some freedom for the artistic uh movie they it was kind of interesting at the end they say this woman in the movie wasn't really a woman it was a bunch of 30 data scientists not data scientists 30 scientists that in real life presented all the data to the leadership of what to do I remember that uh fun fact I was born in Odessa Ukraine which was not so far From Chernobyl and I remember my dad told me

he had to go to work they called him into work that day to clean some stuff off the trees I think Ash from the explosion or something it was like far away where I don't think we were exposed but uh but yeah we were in the vicinity that's pretty scary my wife thinks I've yeah every every time something's wrong with me she's like oh that must be a Chernobyl Chernobyl thing okay next question favorite interview Question you like to ask people when you're interviewing them so it depends on the interview but I'll I'll give you

when I do a technical interview which I do less of but uh one question that I love that is amazing how many people it throws away for languages like C plus plus is tell me what the static qualifier does and for for multiple you know you could do it for a variable you can do it for function uh and it is just amazing that I would say more than 50 percent of people that interview for engineering job cannot get this and get it awfully wrong definitely the most technical uh answer to this question yeah yeah

I love it okay what's a favorite recent product you've discovered that you love blink cameras so uh a bling camera is this small camera you stick in two double a batteries And it lasts for about six months they claim up to two years my experience is usually about six months but it was just amazing to me how you can throw these things around in the yard and see things that you would never know otherwise you know some animals that go by we had a skunk that we couldn't figure out how he was entering so I

threw five cameras out and I saw where he came in um where'd he come in he came in under a Hole in the fence that was about this High I cannot I have a video of this thing just squishing underneath we never would have assumed that it came from there from the neighbor but yeah it's these things have just changed and when you're away on a trip it's always nice to be able to say you know I can see my house everything's okay you know one point we had a false alarm and the cops came

in and had this amazing video of how they're entering the house uh and Pulling the guns out you gotta share that on Tick Tock that's good content wow okay done cameras we'll set those up in my house ASAP yes what is something relatively minor you've changed in the way your teams develop product that has had a big impact on their ability to execute I think this is something that I learned at Amazon which is a structured narrative So Amazon has some variance of this which sometimes but then they go by The Neighborhood six page or

something but when I was at Amazon I still remember that email from Jeff which is no more PowerPoint I'm gonna force you to write a narrative I took that the heart and many of the features that the team presented instead of a PowerPoint you start off with a structured document that tells you what You need the questions you need to answer for your idea and then review them as a team and Amazon these were like paper base now it's all you know based on word or Google Docs where people comment and I think the impact

of that was amazing I think the ability to give people honest feedback and have them appreciate and have it stay after the meeting was you know in these notes on The document just amazing final question have you ever run an A B test on your life either your dating life your family or kids and if so what did you try so there aren't enough units remember I said unique 10 000 or something you run through a B test I do I will say a couple of things one is I try to emphasize to my family

and friends and everybody this idea called the hierarchy of evidence when you read something There's a hierarchy of trust levels if something is anecdotal don't trust it if there was an experiment it was observational give it some bit of trust as you get more up and up to a natural experiment and control experiments and multiple control experiments your trust levels should go up so I think that that's a very important thing that a lot of people miss when they see something in the news is where does it come from I have a talk that I

I've shared Of all these observational studies that people made that were published and then somehow a control experiment was run later on and proved that it was directionally incorrect so uh I think there's a lot to learn about this idea of the hierarchy of evidence and share it with our family and kids and and friends and there's another I think there's a book that's based on this like how to read a book well Ronnie the experiment of us Recording a podcast I think is a hundred percent positive p-value 0.0 thank you so much for being

here thank you so much for inviting me in for great questions amazing I appreciate that uh two final questions where can folks finding online if they want to reach out and is there anything that listeners can do for you finding me online is easy it's LinkedIn and what can people do for me you know understand the idea of control experiments there's a mechanism to make The right data driven decisions use science uh you know learn more by reading my book if you want again all proceeds go to charity and if you want to learn more

there's a class that I teach every quarter on Maven we'll put in the notes uh how to find it and uh some discount for people who managed to stay all the way to the end of this uh podcast yeah that's awesome I'll include that at the top so people don't miss it so there's Gonna be a code to get a discount on your course Ronnie thank you again so much for being here this was amazing thank you so much bye everyone [Music] thank you so much for listening if you found this valuable you can subscribe

to the show on Apple podcast Spotify or your favorite podcast app also please consider giving us a rating or leaving a review as that really helps other listeners find the podcast you can find All past episodes or learn more about the show at lennyspodcast.com see you in the next episode