hello and welcome to designing better ml systems learnings from netflix my name is rob hilton i'm a principal solutions architect with aws and i'm joined by netflix's very own saving goyal a software engineer from their machine learning infrastructure team netflix has been a long time aws customer and open source proponent and for the last decade or so they've been focusing that lens on the wide world of data science in today's talk we're going to discuss some of the decisions infrastructure decisions that netflix has made and also some of the open source projects that help ease

the burden on their data scientists and we'll also step through what it will take for you and your organizations to get started building these similar systems in order to kick it off i'm gonna hand it off to savin thanks rob netflix has made significant investments over the years in machine learning and we are constantly looking for ways to improve our machine learning systems now what does that mean for us how do we go about getting this meaningful improvement yes of course one way is to invest in getting more and better data as well as innovating

on building bigger and better models yet another approach that my team focuses on is how can we provide a modern data science stack to our data scientists so that they can go from prototype to production as quickly as possible and can be as productive as possible now in this talk i'm going to talk about how we think about a data science stack and what does that mean internally at netflix and some of the open source work that we have done in that area and my expectation is that hopefully some of these thoughts would be useful

for you as well as you go about building similar infrastructure at your companies now what does the infrastructure stack look like it's ultimately at the end of the day composed of layers of pre-existing stack that any organization that is collecting and operating data operates on at the very bottom we have a story around data warehouse then you need some sort of compute resources to manage and orchestrate your compute and of course data scientists they are very well familiar with notebooks that's their lingua franca when it comes to authoring any sort of machine learning code so

we need to have some sort of notebook infrastructure for them as well and all of this should allow for rapid prototyping rapid model development without essentially enforcing any sort of constraints on their day-to-day work now what does that look like for netflix at netflix s3 serves as our data lake our data warehouse all of the data that netflix ingests is available when it comes to tabular data as parker files exposed as iceberg tables as well as it serves as a repository for all the tv shows and movies that we shoot around the world that's available

to our members now using amazon s3 as our data lake has paid off significant benefits uh for us over the years it has allowed us to decouple compute from data storage which means that now we can invest in different querying engines different compute engines and we can move as the industry moves so we have invested in presto and apache spark and all of these are exposed through our federated query engine called genie which is available as an open source project so that our users they have a unified interface through which they can launch these jobs

without actually worrying about how to locate a presto cluster or which spark cluster has access to which data sets for them and all of the data sets are managed all the metadata around it are managed and exposed through metacad so our users can very easily discover all the data that's available to them within the netflix universe and then they can meaningfully query that now when it comes to machine learning workloads there's one behavior that we observed internally which was most of the times our data scientists are actually interested in getting access to a single or

multiple partitions of one single iceberg table which means that going through a query engine like apache spark or presto might actually at times be not as efficient as say directly going through s3 and saturating your network bandwidth to pull the data as quickly as possible so internally at netflix we have invested a lot in building these tools that allow our users to short-circuit a bunch of these query engines so that then they can feed their data to their machine learning workloads as quickly as possible and a lot of this innovation is actually available as part

of our open source tooling around metaflow now once you have these machine learning workloads which are able to reliably ingest these data sets as quickly as possible the next question comes that hey you know how do you execute these machine learning workloads where does it actually run does it actually run on a user's laptop does it run in the cloud and yes of course for quick prototyping we want our users to run be able to run these workloads on their laptops but then there comes a point where they need access to gpus where they need

access to higher memory and higher cpu cores than what's available on their laptops and that necessitates a move to the cloud at netflix our container orchestration system is called titus and a close parallel to that would be uh aws's batch and this this allows us our users to very efficiently take their code that's running on their laptop and be able to launch these containers at scale now what's interesting about these machine learning workloads is that they can be rather bursty and resource intensive now think about this use case where say as a data scientist you

want to train a model for every country and every language that netflix operates in so we are already talking about roughly 2 000 or 4 000 different models depending on the permutations now if you want to do a hyper parameter sweep over each of these models now we are looking at 10 to the power 5 10 to the power 6 models in expectation and that places a huge load on our container orchestration systems and it can meaningfully starve out uh other workloads uh that might have higher priority and that's where careful resource management becomes really

really important for us both batch and titles expose queuing mechanisms where a machine learning workload can very efficiently cue tasks and then uh these queues can leak uh these tasks based on priority onto the actual container orchestration system say elastic container service or kubernetes and that allows us to very meaningfully isolate some of these machine learning workloads and make sure that they don't end up impacting any of our production workloads at the same time for many of these machine learning workflows cost efficiency is a very key concern many of these training pipelines they end up

using gpus or other exotic instances which can often be super expensive and also for our end users it can be really hard to estimate the resource requirements for any of these jobs ahead of time so we have invested in machine learning models and other systems that can accurately predict what are going to be the resource requirements for any of these jobs so that then we can much more efficiently uh pack these jobs onto a physical instance so that we can drive up utilization for our workloads and which essentially then ultimately at the end of the

day helps us in driving down cost now once you have all of these workloads running and you know a machine learning workflow as we just spoke about can be rather wide um now we need to think about like okay how do we actually orchestrate each of these individual jobs uh you can have a machine learning workload that's processing a terabyte of data that needs access to an instance that has say terabytes of ram uh then maybe once you have done feature engineering on that data and you have sort of like whittle down your data then

you might want to offload the compute to an instance that has gpus uh so that then you can train your model and uh then you might want to sort of like move your compute to a general purpose instance where you can just like characterize what the model quality is and that's essentially the role of an orchestrator and uh the orchestrator that we use internally at netflix uh is netflix's conductive and a very close uh aws offering is aws step functions now when we look at a machine learning workflows some machine learning workflows are meaningfully different

than uh say data engineering workflows or other atl batch pipelines uh more importantly machine learning workflows they can be rather long say you know if yours is training a computer vision model that needs to uh crunch through all the frames that netflix has ever collected that can take weeks uh to execute upon so we need to make sure that the orchestrator has support for long-running workflows as well as at the same time uh these machine learning workflows they multiply rather quickly uh netflix is a big believer in experimentation so any change we push out whether

it's a change in our model or whether it's a user facing change we always run multiple a b tests so we need to run multiple instances of the same pipeline in parallel with minor little changes to test out that yes of course you know the newer model or the newer sets of models are indeed better than what's running in production right now and we have many data scientists many teams working on many different problems so the number of machine learning workflows that are running at any given point in time they can very quickly multiply so

we need to make sure that our workflow orchestrators can also scale uh well for us and in practice we have found aws step functions and conductor to really scale out for us in a very meaningful manner now one thing that's really important when we think about machine learning workflows is that these workflows are not isolated islands they are essentially consuming data from say an upstream data engineering etl and then the model that they produce that needs to essentially connect to uh some downstream process that might be consuming that model say hosting it as a microservice

or say generating some offline scores and and it's really important to have uh some capability to think through uh how how would we sort of like piece together these different processes these different pipelines and that's where data signaling comes in really handy for us so uh in in open source in uh aws uh universe as well an equivalent offering uh for configuring data signals would be even bridges where uh say you have an upstream uh data engineering pipeline and as soon as it's done writing to a table it can emit an event and then you

have like a bunch of other pipelines your machine learning pipelines which are waiting on those events and then they can trigger on top of that which is a much better mechanism than say polling for a given um sort of like time stamp uh to trigger your pipelines once you sort of like provide all of these capabilities like the other thing that data scientists care about is what is my authoring experience how do i actually go about writing these workflows and getting through my ideas and we have made huge investments in the notebook ecosystem internally at

netflix uh if you are using python or r then you can very easily spin up a jupiter notebook and you can get access to any number of instances that aws provides to run your compute in the back end as well as we have yet another open source project called polynote which allows users who are more interested in the scala and jvm ecosystem to get the same benefits of the notebook ecosystem now when you are running machine learning workloads as a data scientist monitoring and observability is is a key issue you would want to know how

is your model behaving over time the ml workflow that you just launched how is that behaving uh you would want to inspect the model that has just been produced which essentially calls for some sort of ml monitoring solution and in our experience it's actually rather tricky to build a monitoring solution that can take care of all the different diverse set of problems that teams at netflix are trying to solve say the model monitoring requirements for a problem in natural language processing is going to be very different than say if you're solving a computer vision problem

and what we found was rather uh interesting for us was uh notebooks by themselves can be an interesting mechanism to monitor uh your machine learning pipelines the data scientists they are already well familiar with the notebook ecosystem so they can very quickly and easily put together these dashboards that can investigate and look into how their workflows are behaving and what we do internally at netflix is we allow our users to essentially set up these templatized notebooks that are executed by a paypal and are hosted as dashboards in commuter and both paper mill and computer are

open source projects so for example in this screenshot we have a user who is just tracking overall execution time of different steps of their workflow uh using bokeh and there's like nothing specific about any of the internal tooling for our end user and they can essentially take they can use any open source libraries that are available for visualization they can use any sort of widgets that they are comfortable with and they can customize this notebook for their use case so this is what a modern data science stack looks like nominally at a very high level

for data warehousing concerns amazon s3 is an excellent choice and aws batch gives you access to the latest and greatest instances that amazon has to offer for compute and through step functions you can very easily piece together your workflow to execute on top of these batch instances via sagemaker notebooks you can author your workflows in a very easy manner and then none of these layers prevent you from using any of the latest and greatest innovations that are happening in the open source community whether it's mxnet tensorflow pyto or psychic learn now all of this is

all good and fine uh these are like the basic building blocks that are indeed needed for any sort of data science activity but there are still some things that are missing from this picture as a data scientist you still have to worry about how do i actually write my workflow do i need to familiarize myself with amazon street languages to be able to use step functions uh do i need to learn how to bake docker containers so that i can launch my jobs on uws batch do i need to be familiar with how the data

is laid out in my data warehouse to be able to interact with it as well as there are a bunch of machine learning concerns as well experimentation and versioning this is a big concern in machine learning because of the iterative nature and you would want to make sure that any sort of experiments that you do any sort of changes that you make to your model they are captured reliably and you're able to go back in time and replay those reproducibility is yet another big concern and you would also want to make sure that you're able

to very reliably train multiples of models and you're able to manage those effectively so which means that at the end of the day you also need some solution that can sort of like bind these layers together so that as an end user uh you can focus on more of the data science work and less of the engineering work which is what mediflow provides so metaflu is a python library that we have written over the last few years that has really helped us understand our data scientists pain points and we have taken all of our learnings

through those years and baked it into an opinionated package uh that allows our data scientists to focus more on the model development aspects uh while all the orchestration compute and data movement uh aspects are taken care of uh behind the scenes for them now let's look at like what are some of the major features that it provides you so it allows users to write idiomatic python code to declare their graphs uh which means that they are free to use any of the libraries any of the training packages like tensorflow pytorch scikit-learn they can use any

of those they can even write their own training routines we bundle in a high throughput data layer which allows them to then access s3 and essentially saturate the entire network bandwidth that's available uh to their instance so that then they're not paying heavy costs just around moving data we provide very simple uh primitives so that our users they can very easily declare that hey you know this is my workflow and these are some specific jobs that maybe need access to uh an instance that has a high memory or an instance that has gpu instances attached

to it and they can very easily uh just drop in a bunch of decorators and then metaflow will take care of how to orchestrate that compute and uh move their data as well as code to those instances perform the compute and move back the results to the next uh instance that's waiting uh for the input and once they have uh all of this done and once they are happy with the model that has been produced then um if they have to sort of like deploy their workflows onto an orchestrator like aws step functions then it's

a one-click uh task for them where megaflow will essentially compile their workflow down into a language that step functions will understand so that our end users they don't really have to worry themselves with learning yet another dsl to integrate with any of the orchestrators and uh more importantly we also manage and maintain all the metadata that's generated through every single execution of these workflows uh which allows our users to then set up these notebooks and they can track how their workflow is behaving what are the characteristics of their models uh what is the distribution of

the data that went in and they can use any sort of widgets any libraries like boca plotly dash to set up these dashboards and of course you know a model doesn't live in isolation it needs to integrate with an external service uh so they can very easily uh just follow a function as a service paradigm uh to define uh functions uh that can be used for either uh offline batch scoring uh or can be hosted as microservices and all of this functionality is also available to uh the users uh in our language uh as well

uh so these are some of the features uh of mediflow as well as some of the uh architectural and technological choices uh that we have made over the past few years that have really paid uh strong dividends for us and uh now rob will walk you through how to uh get started with all of these infrastructural choices within your own aws account over to rob thank you very much savin now that we have an understanding of the infrastructure that's required in order to build these types of ml systems let's step through the process of building

it together shall we fortunately it's relatively easy there are only a few teeny tiny little prerequisites that you need in order to get an infrastructure up and running that matches aws best practices first and foremost we know that we're going to need to store artifacts so let's create an s3 bucket secondly we need compute so let's create an aws batch compute environment don't forget to enable ec2 provisioning mode third we're going to create an aws fargate ecs cluster don't forget to enable inbound access and security groups for port 80. that's important third we're going to

create a postgres database your schema is going to need to be um hold on um okay all right so i got away from me a little bit there uh let's go through these one by one uh we're gonna need an nlb uh we're gonna need some dynamo db okay um did you did you create your task definition yet there's a lambda okay okay congratulations i guess we're done your infrastructure is ready if your eyes glazed over at the beginning of that last slide you're not alone in the world's least surprising plot twist aws infrastructure gets

complicated and difficult the more nuanced a solution becomes but fortunately aws and netflix have done much of the hard work for you there's an open source cloud formation template that you can find at github.com netflix mediflow tools that open source cloud formation template should deploy most of the infrastructure that you need from a to z that allows you to stand up these ml systems and start seeing them work for your environments there's an added benefit of the fact that this cloud formation template is inherently flexible it covers multiple different types of configurations meaning that if

your environment needs to do prototyping with jupiter notebooks that's an option for you whereas you can turn it off if you don't additionally if you need things like gpu instances those options are available to you as you deploy this infrastructure secondarily we've set it up so that it's secure the permissions are restricted to the environment that is deployed by cloud formation so that means people won't be accessing infrastructure that they don't need and last but certainly not least due to the relationship that we have with netflix we ensure that we maintain feature parity in the

infrastructure as netflix discovers ways to enable their data scientists across the board now let's dive a little tiny bit deeper and explain what this cloud formation template is going to do for you as you stand up first of all i want you to imagine with me a big empty lonely aws account there's nothing in it and then you start with a cloud formation template that we talked about last time the first thing that we're going to need to think about is whether or not our data scientists need to prototype if they do then we're going

to be able to stand up amazon stage maker notebooks and then you'll have a flexible jupiter environment that has permissions already set for the rest of the aws infrastructure secondarily we know that we're going to need asset storage for our flows and we also know that we're going to need compute infrastructure so the cloud formation template will stand that up for you as well but we're starting to get into a little bit of a nuance here in the fact that in order to maintain security across our environment aws batch will need to be able to

pull from cl from s3 safely and securely so we'll set up all the im rules and permissions for that speaking of i am we'll also create aws im users that only have permissions to access the resources configured within this cloud formation template additionally if we spin up different versions of this cloud formation template we'll also get standalone users that don't have the ability to hop around in environments where they don't need to be so now that we have this environment stood up and functional and we understand that we have our basic infrastructure resources we have

to think about how to scale these things so what we'll allow you to do is instead of scheduling and orchestrating these flows with your laptop or machine we'll configure aws step functions for you along with all the permissions so that you can schedule pipelines and flows and orchestration steps and have them triggered now it's a big world for organizations out there and as soven mentioned it's very very very likely that your pipelines and machine learning systems don't exist in an ether or in a vacuum so we have the rest of aws so the cloud formation

template will configure everything that you need from amazon eventbridge in order to facilitate integrating your pipelines and flows with all the rest of the other cool stuff that you are doing as if that isn't enough it's configured so you can easily stitch it together with amazon service catalog and expose these services to your end users for deployment as they see fit now this all makes an assumption that you are working in a production environment with your aws accounts but what if you just want to get started and see how it works fortunately netflix has used

a variation of this cloud formation template to allow you to get started all without any of your own infrastructure under the covers you can go to mediflow.org sandbox you can sign up with your github id and you can get a restricted sandbox environment running on aws that allows you to get started with mediflow and see if these types of systems work for your organization with that that's the end of our talk feel free to go and check out any of the resources and documentation or open source projects that you've heard about in this discussion and

on behalf of saavin and myself thank you so very much for attending you

AWS re:Invent 2020: Designing better ML systems: Learnings from Netflix