well good afternoon everyone and welcome to reinvent 2024 I hope you had a really really awesome day one it's great to see you so thanks for coming to this Monday afternoon evening slot it's really appreciated so quick intro to myself I'm Mike Jukes uh I'm a senior manager for startup specialists in Amia and I'm actually part of something called The Specialist organization at AWS and the reason that I just quickly say this is that us as AWS C customers actually have access to Specialists across different domains and different products so be that databases networking containers

serverless if you ever want to access specialist support you can actually just do so by talking to your account team so it's a really handy tip to know I'm honored to be joined here today by the very awesome miles Bryant from monzo who's come all the way from London with me um we're going to hear about monzo's journey through kubernetes uh and their optimization how they made use of ec2 spot Carpenter and eks together and more importantly how this has enabled them to scale Services seamlessly while optimizing their Cloud spend so the agenda for today's

session I'm just going to split it into two and the first part is all about how we can run compute efficient workloads on AWS and I'm actually going to put a bit of a container spin on it when we we talk about Amazon ec2 ec2 spot and graviton they are some really great cost optimization levers that you can pull when making use of eks carpenter just helps enable that Journey even more and the final part of part one is I'll just talk through a general kind of optimization Journey with eks and some really useful levers

to pull then I'll hand over to Miles and he's going to talk about monzo's Journey he'll also give a quick introduction to Ono if you're not familiar with who they are then he'll talk about the journey that they've taken on kubernetes going from a self-managed platform over to eks with Carpenter and then on to spot and finally we'll do a bit of takeaway and learning from what they've learned and taken cool so let's go on with part one then so ec2 ec2 spot and graviton so AWS actually offers the broadest range of instance types to

run virtually any workload so we now have over 800 different instance types and the reason I explain this is that it's super important when we start talking about easy2 spot and Carpenter because ultimately at the at the end of the day it's the compute that lies underneath so the reason that we have so many different instance types is twofold first of all we've got a number of different categories for compute so be that general purpose burstable memory intensive compute intensive we have a different category for different types of workloads the second area is the capabilities

so with each instance type you might want to choose a specific capability be that a certain size you might want networking requirements you might want a certain memory size or you might want to choose the CPU processor that goes with it so that could be Intel it could be AMD or it could be our AWS gravison chips and at the heart of ec2 we have some really cool custom silicon Innovation so the first one on that list is Nitro and if you haven't heard of nitro it's almost like the secret source to E2 effectively what

it is it's a super lightweight hypervisor where we've externalized components such as Security Storage and networking and is actually the reason that we're able to get more out of AWS hardware and offer huge huge amount of instance types Nitro is really cool we're not going to talk about it more today but there's loads of sessions about it on it reinvent this week so definitely going and have a look if it's something of Interest the second part of custom silicon is our ads gravison chips and our chipsets offer the best price performance for cloud workloads and

we'll talk a little bit more about graviton very shortly finally we have a number of purchase options so everyone will probably know what OnDemand instances are but more importantly when do you use on demand instances for a specific type of workload so they're really really useful when you have spiky stateful workloads you need to retain the state you don't know what the usage level is going to be like we've then got savings plans which is you commit to a certain amount of usage over a period of time you know exactly what that usage is going

to be and then you are able to get significant disc counts on that and finally we've got ec2 spot instances and spot instances is spare capacity AWS and effectively we are utilizing spare capacity from different places and you're getting a significant discount on that as well spot instances however must be used for spiky stateless workloads and we'll talk a bit more about that now so E2 spot and graviton are a really winning combination when it comes down to Containers so let's start with spot spot as I mentioned is spare capacity and it can offer up

to 90% discount typically for customers using containers we typically see around the 65% to 68% Mark the key thing with spot is that they are interruptable instances so with that in mind you have to be prepared to get an interruption when you're using ec2 spot so work Clowes that use ec2 spot should be stateless they should be fault tolerant and they should be flexible so when we're talking about the land of containers containers by their nature are immutable you create them you destroy them you replicate them so they're really perfect for spot providing that the

workload is stateless Amazon has loads of integration built in with spot so even though we're going to talk about Amazon eks today with Carpenter we also have integration across uh Amazon ECS and AWS fargate as well and the final key thing to remember about spot is that you need to diversify across different instance pools so by by that what I mean is that if you think that adbs spot instances are spare capacity the more places that I'm grabbing spare capacity from the better experience I'll have so by diversification make use of different instance types instance

sizes different families and even across A's will mean that you'll utilize the best spare capacity then we've got graviton so graviton instances and our chipsets offer the best price performance for cloud workloads running on ec2 and you can actually see up to a 40% Better Price performance Improvement Now by Price performance I mean two things so first of all generally the instance price can be competitive and quite different and the second element is that you can get more performance out of graviton graviton actually runs on the arm 64 instruction set and there's huge ecosystem support

now so you can get containers from different container Registries that typically support arm 64 out of the box if you're making use of isv software it's also really really great for graviton because there are so much support from isvs now we're actually on our fourth generation of graviton and with our fourth generation we even brought out more performance improvements and you'll actually see that we've actually launched a few new instances this week as well finally my favorite kind of point of graviton is that you can get up to 60% less energy consumption when you're using

graviton instances versus comparable ec2 instances so graviton is a really good mechanism for Price performance and sustainable workloads it's a really really willing combination and if you haven't tried it yet I would just really advise you to go and have a look at it great so why are we talking about this in a session on Carpenter and eks well Carpenter actually enables both ec2 spot and graviton very easily out of the box for E2 spot it's got a lot of the best practices in such as handling spot interruptions or diversifying across instance types and for

graviton it makes it super easy to introduce an arm 64 based image and just provision graviton infrastructure for you Additionally you can run spot and gravison together they are a winning combination so absolutely you can use them both so that's a little bit about ec2 and I love to Baseline when I talk about um Journeys with uh optimization on eks because it really is something to really think about let's move on to eks and Carpenter so just a baseline everyone's knowledge when we're talking about scaling in kubernetes we talk about two different aspects the first

one is application scaling so how do I scale my application across kubernetes the second one is data plane scaling so how do I scale the infrastructure to handle the requirements of a certain application so when we talk about application scaling we've got something called horizontal pod alter scaler HPA that's how do I scale the pods horizontally I've also got vertical pod autoscaler so how do I scale the pods vertically and we've even got ker kubernetes event driven autoscaler as well which is basically scaling on more advanced metrics from a data plane scaling perspective we've got

two different options and really when we talk about cluster Al scaler and Carpenter they are effectively doing the same job they are scaling out that infrastructure based on the requirements of our application so let's look at a comparison between the two so first of all this is a typical cluster autoscaler setup so we can see here that I've got three node groups backed by different autoscaling groups now the challenge with cluster Auto scaler that we found is that you'd effectively have to have a node group for a different type of workload so for example if

I had a workload that runs well on c6i I'd put it in that node group bracket if I had a different workload that runs on m6g I put it in that one and for the p4s I'd put it in this one too but what this actually caused was the fact that we'd be pushing our workflows into certain node groups or we'd have to have the additional overhead of creating more and more node groups so we found that customers were actually creating huge amounts of node groups and with that autoscaling groups and actually it was become

very operationally heavy to manage so that's where Carpenter comes in and Carpenter is really right sizing the data plane for all workload types so it's actually removing that group element and that group structure and actually provisioning infrastructure for the right size based on the application requirements so if we talk about Carpenter and what it actually is it's absolutely application first infrastructure so we are provisioning nodes based on pod requirements the second thing is it's really good at diversification across spot and on demand as well so as I mentioned it's got the ability to handle spot

interruptions out of the box it's very useful it's got the ability to diversify across loads of different instance types and it even can utilize on demand when it needs to we've then got the groupist autoscaling as well so as I mentioned mentioned we start simplifying the data plane we remove that notion of node groups and it makes it much simpler but in actuality what this enables us to do is it enables us to provision new workloads and infrastructures provisioned based on the requirements of those workloads so for example let's say that I provision a new

arm 64 container image into my infrastructure if graviton wasn't there before it will be there now because by introducing graviton friendly container images it means it can provision the graviton infrastructure so it makes super super easy to provision that and then finally there's some really neat features on optimization with Carpenter and that's called consolidation and what Carpenter is actually doing is it's consistently optimizing our infrastructure and clusters to right size and Bin pack so it's a really really cool feature and then to finish off part one let's talk through a specific eks optimization Journey so

I had a customer not too long ago tell me what is the cheapest and quickest journey to optimize eks just tell me what it is so the first thing that we did is we looked at the workload and that is the very most important thing to start with is identify what your workload and the attributes based on your workload so in this case the customer had eks and they were using cluster uto scaler in terms of the workload itself it was running some applications based on go and Python and they were already fault tolerant and

stateless and those are the two key words the first step then once you're on your optimization journey is to look at optimizing the costs and basic Ling no cost First sorry so Baseline first to see the journey that you go on and you can measure those savings we've then got Carpenter so Carpenter not only enables some of these new ec2 and ec2 spot and graviton techniques but it actually handles the optimization of the cluster really well that by simply implementing Carpenter customers were seeing savings just by doing that the next step in this particular Journey

was to introduce spot so we knew that the workloads were fault tolerant and stateless so it was really easy to put in spot and that is the key thing to do if you are going to implement spot make sure you do it for the right workload as I mentioned it needs to be stateless flexible and fault tolerant finally we looked into graviton and then graviton offered even more price Performance Based on that as well and the only real consideration to think about when you're thinking about graviton is that the workload must be supported by that

arm 64 instruction set which is actually now super common the final thing to think about with that optimization journey and this is something to think about throughout the process is to right siize those pod requirements those pod requirements are being used by Carpenter to provision infrastructure so the more that you're optimizing and getting those right it will help across the optimization Journey nice cool so now we're going to move on to monzo's journey and to eks optimization so miles over to [Applause] you thank you Mike that sets me up really nicely and I'm excited to

talk about monzo's journey to eks optimization so first of all to introduce myself I'm miles I'm a senior platform engineer at monzo I've been at monzo for about 6 and a half years working on our kubernetes and um service measure infrastructure and I lead our infrastructure platform team who look after all things compute and networking so a quick poll who here has actually heard of monzo okay a few people nice uh I'm going to tell you anyway our mission is to make banking easy and painless we let you spend save and manage your money through

a single interface if you don't know we have these really brightly colored hot cards which quite striking when you see them and we have about 10 million customers in the UK and a small presence here in the US as well we basically provide this unified place to manage your financial life now building a digital Bank from scratch requires a lot of different features and a lot of different systems and we have all of the usual customer facing features you'd expect from a bank so checking accounts inter Bank transfers loans budget management etc etc but this

is really only the tip of the iceberg there's so much stuff under the surface that happens in the background from Financial crime checks to business analytics security and staff data and we knew this was always going to be our ambition right from the beginning and we start start off with a microservice architect and really build that in and we chose microservices because it allows our engineers and multiple teams to be able to change and manage different parts of the system independently and these microservices mean that even 9 years later we're still able to ship changes

safely and at High Velocity now microservices of course come with some operational overhead but very natural fit for paying down some of that operational overhead is kubernetes and this is because kubernetes allows us to separate the concerns of actually running the infrastructure and the underlying compute from the concerns of writing your code and shipping it to production and at monzo we've been using kubernetes for about eight years when we first started using it there was no cloud provider offering a managed platform and the only option was to run it ourselves now I've had a lot

of fun over the years operating kubernetes and I think it can be very interesting intellectually stimul stimulating but it does take a lot of time and effort so we decided to offload this to ads by using eks as a manag kuet service and we recently completed a for migration from our own self-hosted cluster to eks so let's talk about that Journey let's talk about that journey to eks and optimizing our Cloud cost we go back all the way to 2016 and at this point monzo is really just a small startup with just a handful of

Engineers we weren't even called Mono at that point as I said we spun up our own kubernetes cluster just running on a and to talk a little bit about our architecture early on we had around 150 stateless go microservices and over the next few years as we added more products and features this would grow to over a thousand and our overall architecture was pretty straightforward at this point we had just a single production cluster in a single region with one autoscaling group for our kubernetes worker compute with just a just a single instance type and

about 200 or 300 nodes and at this point our system was very static our applications and services were just manually scaled whenever we needed to we' just update the CPU memory and this meant that our cluster was also very static and the scaling process was pretty simple whenever we needed more compute someone would just log into the Adis console click a button and update the autoscaling group size and you know what this was absolutely fine for us at our relatively small scale we weren't wasting much money on spare CPU cycles and actually we saved a

lot of time from not having to set up and debug and get comfortable using an a scaler and we can spend that time on on other things so let's talk about the next stage of our journey by 2018 2019 we needed to move on now mono at this time was uh a very exciting place um first of all it's when I joined but more importantly we just completed a very challenging migration from our prepaid card product to checkin accounts and getting a full EK banking license and we were having an incredible year of growth we

went from adding six 60,000 new customers a month to over 200,000 is just an incredible rate of growth and we also had a crowdfunding round where we raised 200 million from our customers in just two days and during this whole time we were just constantly shipping new things and we launched overdrafts loans business banking and even our first US product now I'm telling you this because more customers and more products means more microservices were being written to serve these and more microservices and more incoming load to the platform meant that we were having to scale

more and more and more and that scaling means more time that someone had to spend in the AIS console clicking a button to update the auto scaling group and more money that we were we were wasting just having CPU sitting around doing nothing and that meant that it was time to move on so we brought in autoscaling we focused on the application layer to begin with just starting with vertical P Auto scaling to right size CPU and memory once we were happy with that we then added horizontal prodor scaling which would allow our services to

scale out underload and of course by the time we added these this meant our our services were now scaling up and down a lot just dynamically so we added the kubernetes cluster autoscaler to handle scaling our nodes up and down and by this point we also had a good understanding of our Baseline usage so we committed some compute spend and bought some savings plans to optimize some cost and let's revisit our architecture at this point we had grown to over 1,00 microservices and this growth is showing no sign of slowing down anytime soon but our

compute our underlying ads Compu architecture wasn't too different still pretty straightforward we added a couple new Autos scaling groups to handle some different workload requirements uh such as Prometheus monitoring which required a lot more memory so by 2022 with autoscaling we' built Some solid foundations for optimizing our Cloud cost and it's now time to take advantage of this and go even further and it was at this point that we really felt that our self-hosted cluster was creaking doing upgrades was really really painful and we obviously had to fix any issues we had with the whole

host cluster ourselves and we realized that we probably had better things to do of our time than manage kubernetes ourselves and a natural choice for us already being on ads was eks we started migrating around 18 months ago setting up a new separate eks cluster running in parallel and over those 18 months we migrated our services across gradually just one by one using traffic shifting in our service mesh to move Services over with zero downtime we've now fully completed this migration and have torn down our old cluster and along with eks we also decided to

introduce Carpenter as an alternative to Cluster aut scaler now Carpenter enabled a number of things to us we really wanted to move to a world where our compute scaling was driven by the need of our application and not be constrained to just operating a handful of instant sizes without a lot of operational overhead we also wanted to optimize cost further by introducing e spot which I'll talk about more in a moment now of course you can do all of these things with with CET cluster aut to scaler you can set up multiple node pools but

it's a little bit more challenging there a little bit more operational overhead and to introduce Carpenter we started very simple we just ported over the cluster Auto scaling water SK groups to Carpenter noals and then from there we began to iterate and add complexity to meet our needs and since we were deploy Carpenter onto our new eks cluster and cluster autoscaler was running on the old cluster as we migrated Services between the two clusters we'd also migrate them over to Carpenter and be able to gradually test out how it worked so here's what our eks

architecture looks like with Carpenter you can see that we have now a variety of different instance types and sizes and we don't need any autis ging groups to manage having different instance types Carpenter can just manage all of these directly by provisioning E2 instances so with Carpenter in place we now have the ability to optimize our cost even further by moving to e two spot and monto's compute requirements are very Dynamic and driven by by user Behavior maybe someone is is using our app or just paying for something in a shop with their with their

monzo card and as our customers are currently mainly based in the UK our appload actually follows uh a pretty predictable kind of daily pattern and savings plans aren't actually a particularly good fit for this type of load because either you have to underc [Music] commmittee or you overit your savings plan and you have overnight uh the overnight Baseline where you're just paying for compute that you're not actually using and this is where spot really came in for us because this meant that we could have the flexibility to still have the the daily sort of Peaks

cost optimize and because we we use stateless go microservices that can shut down at a moment's notice we already particularly well set up the spot but even though we were confident that spot would work for us we still needed to roll it out gradually and get comfortable with using it now every service at monzo is classified in how critical it is all the way from tier zero being the most critical services that underpin the really critical functions of the bank like payments or customer support all the way to tier three being the least important services

that might be might be some small back office service that uh no one will really notice if it if it starts failing at least not for a while and we made use of this tearing by writing a small kubernetes admission controller which would modify these pods and inject a pod topology spread based on the service tier and our configuration and what this pod topology spread did was basically give Carpenter a hint of on where it should provision and schedule these pods and this gave us the ability to set a percentage per service tier on how

many of those pods we wanted to run on spot or on demand which then gave us the ability to roll this out gradually by service tier starting from the least important Service uh tier to the most critical let's have a look at our architecture again with spot introduced and it's really not much different what I really want to call out here is that Carpenter manages every aspect of provisioning spot for us it will automatically handle spot Interruption messages and replace nodes as they get interrupted it will even find cheaper spot options and automatically consolidate and

migrate workloads and if we can't provision spot for some reason it will even make use of on demand so we're always covered so I want to show you uh our journey over time and how it's how it's been going for us and this graph shows you the last part of our journey and this is a breakdown of our compute spend by category and all the way back in July 2023 all the way to the left you could see that we were fully on savings plans with a bit of about 25% of on demand usage and

over time we gradually increased spot and rolled it out until we get to January 2024 when one of our savings plans rolls off and you can see that we almost entirely absorbed that with spot usage and even more recently you can see where we finished the eeks migration and we're actually now 99% covered by savings plans and Sport so what have we learned from this journey so carpenter has been great overall for us but there's been some minor considerations we've need to think about we found sometimes that it could be very slow to scale down

and to fix this we had to investigate and optimize our pod topology spreads and also find pod destruction budgets that will blocking services from being evicted from nodes Carpenter is also under very active development which is great because we get uh lots of new features quickly but this means that we've been through a couple of major Carpenter upgrades and we didn't have any issues with these we found that they were very very well documented and and quite painless but it still required to and work on our behalf to understand how the upgrade would work and

and execute on it safely and we also found it challenging sometimes to work out why comper would make certain decisions about pod scheduling and had to really learn how to how to debug Carpenter and uh for example why a particular pod might not be able to be scheduled on a node or a node not being able to be provisioned due to a provisioning error we've also had some considerations to think about with spot we need to make sure that we've configured enough instance types so that we are Diversified and der risked and to help us

with this we also needed to invest in monitoring our spot placement score and our spot Interruption rate and just to make sure we're minimizing that risk of large scale spot interruptions which might cause us some system disruption but the most interesting challenge to me was actually the perception of our platform users and other engineers and thinking about what if my service is terminated by spot like we're introducing we're introducing this new compute class which the node might get interrupted at any point and this was actually more of an educational problem than Technical and we solved

this by pointing out the cost benefits of spot and how it actually helped us with our with our Cloud cost optimization but also actually that it's really useful for us to be able to continually test our service resiliency by having service pods terminated often and we should always be able to handle that but the key argument that we made was that actually our service pods are getting terminated all of the time anyway our system is dynamic and constantly scaling up and down and Carpenter is doing a lot of work to consolidate nodes finding empty no

half empty nodes that it can migrate workloads from and we found from our data that surfaces are actually 20 times more likely to be terminated due to a carpenter consolidation and a spot Interruption so what's next on our journey we want to investigate and further optimize are application right sizing even more and look at vertically scaling our services so where we might have services with a high number of really small replicas that we're horizontally scaling we want to make those fewer larger replicas and we also want to look at graviton graviton would bring us better

price performance and it also give us the ability to diversify our spot instance he types and workloads even more and it'll also be more sustainable as Mike mentioned earlier which is better for everyone and given our services are written in go it should be relatively straightforward for us to support it so what are my own takeaways from this [Applause] journey I think the most important thing firstly is to observe your current costs and establish a Baseline from there you can decide what you actually want to achieve from your cost optimization journey and then you can

continually measure against your targets I think it's important to call out that cost optimization work is always a tradeoff it's it's very easy to ask a question how much money should we save and the answer being as much as possible but really there are often hidden costs to doing these optimizations for example you might have to spend a lot of engineering time in implementing these Solutions and opportunity cost from those Engineers not being able to work on other projects or you might introduce a lot of system complexity from adding all of these new components to

optimize costs and automation so it's really important to think about what you want to achieve and the tradeoffs you're willing to take to get there and finally when you have decided what changes you want to make it's always a good idea to build your confidence gradually by doing gradual migrations and drisk the changes finally by adopting spot and Carpenter we've saved an additional 15% on our compute costs we're continuing to scale seamlessly and optimally and we're looking forward to saving even more in the future that's all from me I'm going to hand back over to

Mike Round of Applause great stay there awesome so thank you so much miles for sharing that Journey it's a really really awesome I really wanted to put a real customer story into how Carpenter spot and even in the future graviton can add to optimization so just going back to my first slide really the one that like looked at this I just wanted to flash this up just to make everyone really aware and to think about when to pull the optimization levers so it's actually a very similar journey to monzo and I actually promis it was

a completely different customer but really think about your workload first that is the number one thing to think about Baseline your costs so you can measure what the savings are going to bring in look into Carpenter because that optimizes in itself but also enables you to bring in gravit on a spot use spot if it's suitable and I can't stress that enough the workload has has to be flexible fault tolerant and stateless and then even make use of graviton to get that price performance and sustainability benefits too and actually throughout the process we should always

be right sizing our pod requirements so super exciting to see where you're going on your journey next just to really finish off very very quickly is that we're actually running a carpenter Workshop Hands-On that's at Thursday in the win so if you haven't already signed up and been part of that that's a way to get really handson to implement Carpenter to implement spot to look at multi architecture with graviton uh it's a really good Workshop to do and in actual fact we've actually got a really special live demo arcade machine and that's actually sitting on

this floor all the way very end next to the escalators and it's taking a retro video game Space Invaders and calling it spot Invaders and it's actually a real life demo of when you shoot aliens it's actually destroying pods in the cluster it's actually showing that you can be fault tolerant with spot and Carpenter it's a great demo go and check it out and that is all from us so have a great rest of reinvent and enjoy your evening thank you so much thank you

AWS re:Invent 2024 - Run workloads efficiently on EKS with Karpenter and EC2 Spot Instances (CMP213)