welcome to real world examples on optimizing. net performance I'm Stefan Newcastle I'm Icelandic a principal software engineer at lucinity in regular also Microsoft MVP and have a pinpoint contributor maintainer and that's a link to my blog if you want to check it out I'm gonna go over a few topics here today a lot of them have to do with memory allocations on the garbage collector in in. net value versus reference types that's like Heap stack memory techniques for example for memory pooling stack allocations boxing serialization ndns channels which I wish more people knew about and then some benchmarking and profiling tools let's see if we can get all over all of this so starting at the beginning with memory allocations dotnet is a managed runtime as you know what that means is that when you start up a.
net process it allocates contigues a heap of memory for each process and it manages it for you actually takes care of allocating and cleaning up the memory that you use and this process is maintained by the garbage collector in. net now there's a link there to the fundamentals of coverage collection and to optimize the performance the garbage collector splits the Heap into three generations called zero one and two and that's to help it manage like different lifetimes of objects that you're allocating it also has a separate Heap for large objects there's objects that are bigger than 85 000 bytes which is called the large object Heap get to that in a bit so the garbage collector basically consists of three phases and this is what it runs for every generation that it collects it marks the objects that are still alive in that generation reload kitchen if it has to and then it compacts that Heap for that generation and this is done to make space for other objects that you might be allocating now the generations are like I said zero one and two generation zeros where all your new objects get allocated when you create them reference types that is and if generation zero is full the garbage collector has to run a collection on that generation releasing objects so it can make space for the new objects that you're allocating an object that survived get promoted up to gen one that's sort of how it keeps track of uh like short short-lived objects and log lived objects concept compact that generation gen 1 very similar if there is not enough memory that gets freed up in generation zero it does the same thing that it didn't generation zero to gen 1. survivors get promoted to Gen 2 which is the final one and that's pretty much where all long-lived objects end up in the end and similar to generation zero if there still is enough space it's going to run a Gen 2 collection Gen 2 collection is a bit different so objects there are usually just objects that live for long that is they've survived through two collections this generation zero and one and collecting a generation as you see means that it has been doing all generations before it as well so what generation 2 collection is called a full garbage collection and by definition is the most expensive one that's also the old the only uh collection that actually collects things from the large object Heap but the large trucks were cheap due to like those mostly being like big objects is not compacted so generation 2 is expensive to run quickly over like reference versus value types preference types are the types that you normally knew up or our object okay examples of those are strings objects classes and arrays the value types do not get allocated in Heap memory so they don't get cleaned up because they don't have to those are in bytes characters for examples bulls and struct structs like install classes now the difference between those for example when your colleague methods is reference types get copied by reference so you can think like when you're sending a reference type as a parameter to a method the value of it is not copied you're just sending a pointer to that reference type value types however get copied so like if you create a string send it to a method that method changes the string it's going to change the original object however ever if you send an integer into a method and you for example incremented the original integer is not going to get modified because it was copied before it was sent to the sent to the method so knowing all this then we can pretty much at least get a good sense of how we can increase performance like best thing of course is to not allocate at all because then the garbage collection doesn't have to do anything but that's unrealistic we're always going to have to allocate something so a good aim for us to do is to try to keep the objects that we have short-lived and to keep them small to avoid them ending up on a large object keep because as I said that only gets cleaned up as part of a Gen 2 collection to sort of show you a little bit of an example of this if we want to convert a string to utf-8 bytes this is something that a lot of us have probably done at some point in time however this is where an allocation takes place because we need to put that string create a byte array for it and converted into that byte array but we can actually avoid this allocation so some tricks that's where we get to memory pooling there's a class called the array pool it's good because it actually keeps a bucket of arrays for different types that's why it's a generic and you can rent the Rays from it and then work with it and once you're done working with it you can return it again now there's a gotcha here you have to remember to return it because if you finish up all the arrays in the pool then you have to turn them it's gonna have to start creating new ones but it's not going to clean the other ones up because they're being referenced by the pool so if you don't return them you're going to end up with a memory leak and a good way to do that is to use a try and finally I'm actually not doing that in this example just to keep it short but this is how it works for example you rent you rent bytes from the array pool and you can actually use that array for example to turn a string into UTF encoded bytes and then that get byte call is not going to allocate any memory because it's already been allocated as part of the pool so you can reuse it again and again there is a helper you can use which is called the memory pool it's very similar to the array pool actually uses an arrival underneath but it helps you take care of the investing and returning by implementing an eye disposable so you can just use by using declaration for the memory that you wrapped and it'll get returned eventually when you're done using it another one of those things that you can do is stack allocations they're a bit different because they are not allocating objects on the Heap so they're not creating objects that need to be cleaned up so by definition they are often a lot more performant a lot faster to do but there are caveats to it there are limitations you don't have an endless amount of Stack space I think we've all seen what happens for example if you have a an infinite recursion you end up with a stack Overflow exception because you're not out of Stack memory that's why you cannot like do this or use endless amount of it there are other limitations as well you cannot use it in async methods because the compiler will generate all sorts of State machines and Method jumps between it so there's no way for it to keep track of the stack allocated memory and you have a limited memory space you should never assume that you have more than just a few kilobytes of memory to use also when you're stack allocating memory it is best to allocate a constant size because that allows the compiler to do a lot of optimizations to make the allocations faster there is one hint in there which I didn't put on the slide you should never use a stacker location in the loop for obvious reasons because you will of course eventually end up with a stack Overflow exception if you do that now there is a trick you can use to combine stack allocations and the array pool where you can pretty much at runtime decide which one you're going to do so in this case we decided okay we're gonna allocate or have a limit of like one kilobyte we create a knowable reference told by the rail which is the pulled array which will only get populated if we need to go to the ray pool and can't use the stack allocation then we figure out how many parts we're going to need if it's under the stack allocation limit we use the stack allocation otherwise we use the array pool and then at the end we check if we actually erected an array and return it if we have to now to put this into perspective and sort of show you what happens like with the utf-8 example if we run a benchmark on this uh for like three different sizes a string that has 64 bytes or 64 characters 512 characters and 1024 characters for short strings you will actually see that just like directly allocating it just using the the plane method is actually faster than using a memory pool and the reason for that is because the memory pool like I said uses eye disposable so there's overhead involved there it has to have a try finally around the eye disposable it has to call the dispose method so for short strings memory pool is actually more expensive there but as the strings get larger you can see that the standard method gets slower in every case and in all of these cases stack allocations are the fastest ones if you look at the column there that says allocated you can see how much memory is being allocated in each case but you can see for example when I have a string of length 64 I'm allocating 88 bytes of memory there are there are extra 24 bytes there that are being allocated every time and I'll get to it later why that is you also see that those same 24 bytes are allocated in the memory pool instance [Music] and that brings us to boxing when you have a value type that you need to convert or not convert you need to store it as maybe it implements an interface the runtime can't store it directly it has to put it inside an object class and store it somewhere so even if you have a value type if you treat it as an interface you're going to be allocating memory because it needs to be put into that object so for example this happens when you wrap a value in an object class for example let's say you have let's say you have a list of like icomparables if you put an integer into the list you're going to allocate memory because it's going to need to treat that value type as an interface and interfaces are actually objects so memory needs to be allocated on the Heap every time you need to box a value type the memory allocated on 32-bit machines is always well at least 16 bytes depends on the size of the value type being boxed and it's at least 24 bytes on 65 4-bit machine that's the 24 price that I was showing you earlier because we had we had our memory pool used as an eye disposable so it needed to wrap it in an interface that's why that method actually allocated 24 bytes even though we were using full memory and this is the reason for example if we put an integer and box it this is the object that actually gets created in the memory layout you get an object header that depends on if you have a 32-bit or 64-bit machine how big it is it needs a method table pointer than it needs the actual value and because it needs to like align everything it needs to align the objects to like specific architectures it needs to have eight bytes for padding as well so in the example here boxing let's say I create an interface I want to create a serialization interface and to minimize allocations I create a struct to implement that interface so all that interface does is it takes a spanner bytes and it's supposed to write that object into that span another method calls you realize which takes any object that or any object or structure whatever it is that implements the interface and serialize it to the bytes in there but again to minimize allocations I created a struct I'm using stack allocated bytesville destination so although there are no allocations there and I call the share release method but because we're treating it as an interface this is going to allocate memory regardless this is always going to allocate on at least on my machine which is 64 bits at least 24 bytes of memory but there is actually a sneaky way that we can get around this we can use generics to trick the compiler to say my serialized method is going to be generic this time but I'm going to restrict it to taking objects that implement this interface so when I actually now call this method the compiler is going to basically create an implementation of this method using the struct directly instead of the interface so no boxing will take place but there's two other things that actually happen as well in the previous version we were actually using the interface it also needs to do what's called the virtual method call because it's receiving an interface it doesn't know what the object is that's coming in so it can't do any optimizations well it technically can if you have very few types implementing the interface but usually you should consider that it doesn't and that brings us to something called inlining if we have a small method that doesn't do a lot then they compiler will sometimes instead of making your code call that method simply take the content of the method and put it instead of a method call so it saves a method call and the like the code version the like generated code version of the method becomes a little bit bigger but it's faster because there's no overhead in having to like copy parameters and everything else so this actually gives us three benefits we get rid of the boxing we get really virtual call on the interface and we actually allow the compiler to start inlining things and to for example in this case I like the implementation of MySQL like the Bison was just taking like an integer converted to bytes because integers just four bytes in memory the difference in performance here is actually quite dramatic if we use the interface call like pretty much every call takes six nanoseconds but using the generic interface it goes down to 0.
4 it's actually 93 faster than the other one now just to be clear this is a micro optimization like you might have a lot of code that you would change to do this and you wouldn't notice anything in your application but if your application is doing a lot of doing a lot of things with either interfaces or struct types you should see a difference and speaking of serialization that brings us to indian-ness is something that is dependent on the architecture of the machine that you're running machines can be like big Indian or little Indian and then when it comes to networking a lot of a lot of libraries or protocols say that like you should be sending things in network byte order Network byte order is just big Indian to show you what I mean by engineers if I take an inside integer with this value Which con like converts to the hex like AABB ccdd on a big ending machine this is stored in that order in memory it's going to be stored as a a b b c c d d but on a little Indian machine which is probably most of your machines it's actually stored in memory as ddccpaa so if you would send this send its integer over a network that's expecting this to be in the correct order you would actually be reversing the integer without knowing it so this is something you need to be aware of for example if you're writing cross-platform serialization code we're sending stuff over the network according to like different protocols so there are of course ways that we can just get around this we can like detect the ndns we'll just reverse the bytes if we want to and this is what about rabbitmq. net client used to do this is this is not very pretty code like if you if you've used bit wise like operations like ending or or shifting things you might realize what it's doing it's basically taking the bytes in that long and it's reversing them but it's trying to do it all like without allocating anything so there are no temporary variables that are really being used that require allocation but of course for us to be able to like to do this we need to read it as an unsigned integer so we're like sure that all the bytes are in the correct place however there is an easier way to do this because. net has a thing called binary Primitives which does the same thing we can read eight bytes we can simply say like read this as a big Indian there are also right in 64 big Indian and right operations for most other value types that you have and to visualize this if you look at the actual assembly code that gets generated for these two methods it looks like this and I'm going to highlight the important part the old method is on the left those are all of the CPU instructions that needed to be run just to make sure that we were returning the bytes followed along in the correct order using a binary Primitives class it boils down to one instruction because it turns out our CPUs actually have a single instruction that just does this take this memory thing reverse it store it somewhere else and again like with the other Benchmark that we saw the difference is actually quite dramatic this the new method is actually 95 faster and I think we can all agree on which one is more readable as well so that brings me to something called channels this is one of the things that I wish more people knew about um if you've ever if you've ever written for example a concurrent application that uses multiple threads and you need to pass things back and forth you might have you might have used a lot of like different constructs like cues you might have used like locks around them to make sure that like one thing is pulling stuff from the cube one thing is putting stuff in there this is what channel s do for you it's made to sort of do this producer consumer pattern it has optimizations for different scenarios like do I have multiple Riders do I have a single Rider single reader multiple readers whatever and it has both async and synchronous interfaces as well because as soon as you start doing something like this async you have to do a lot of complex things for example you might have used a q and a lock you might have used a concurrent cue in semaphores and then you start running into like task completion sources if you want to have this async and whatnot channels just make this easy because they take care of all this for you out of the box Channel comes with two implementations there's an unbounded one which is basically says I can put as many items in the channel as I want which is good like if you want to do things quickly but you have to be really sure that you can actually process things faster than they come in because otherwise you're going to run out of memory eventually but for those cases you actually have the bounded Channel the boundary channel can set a maximum number of items that can stay in the channel so it's great for buffering things up you can say like I want to create a channel that takes like 100 items and if it's full the producer is going to have to wait until the reader is actually taking stuff out of it before it can put new stuff in if we look at just a simplified example of this let's say we have the Creator method at the top that processes an integer and just to simulative work we're going to randomly sleep for up to 100 milliseconds asynchronously of course then we write to the console if this is number this number is odd or even then we create another method an asynchronous method which processes Channel and you can see that it takes Channel reader as a parameter it can asynchronously for reach through it so for every item that comes in asynchronously it's gonna call the process method if there are no items to read this method is not going to block and take up one of our threads it's just going to yield it back and it could go do something else then we create a bounded Channel 10 items and we start the consumer process now at that point it's not going to do anything because there's nothing in the channel then we just do a simple for Loop put the Thousand items in the channel asynchronously of course because we have to wait for the reader to take them up because we can just keep maximum of 10 items in there at a time and we write when we've actually written something to it and once we're done writing we complete the channel and we have to complete the channel to let the reader know that there are no more items going to be coming through otherwise the reader would just run forever but this is to signal to it and say hey like you're done and then we wait for the consumer to finish and if we just run this really quickly can this is what it does just runs very easily it's just two-way synchronous tasks one is writing always reading television telling you if it's an odd or even number but we of course have multi-core machines so we might want to do this in parallel take advantage of them when we're using channels we can actually just focus on that part of the code we don't have to worry about uh like do I have multiple readers or multiple writers channels just take care of that for us so all we got to do really is change the processing method or the one that actually reads on the channel and this is what I actually like in this example because doing this in parallel is actually simpler than to not doing it because we have asynchronous parallel helpers for this and to give you an idea of the difference we can do something like this I think this is even still running doing it in parallel becomes something like this and it's done taking full advantage of the course that we have on our machine without really having to focus anything on how we were producing consuming things just focusing on the actual thing reading from the channel so gone over a few things that we can actually use in these cases now this talk is about how these things were actually applied in the real world and I actually applied these things on the rabbitmq.
net client rabbitmq for those who don't know is like a it's a well it's a queue message queue and it uses the amqp protocol just to give you a very quick overview what it does connections have channels like these are not the same channels I was just showing you this is the webmq term Channel for multiplexing purposes so you can be using the connection for for multiple things at a time and channels issue commands like create a queue declare a queue delete it or publish a message whatever commands eventually when they get sent to the server are serialized into frames frames are just like just by the race really they consist of a header payload and marker to show you sort of like a very simple explanation of the layout a frame looks like this one byte for the type two butts for the channel payload length is 4 byte payload different depending on what is actually being sent if I'm creating a queue I need to send the queue name and other things if I'm publishing a message I might be sending headers Etc and then there's an end marker telling you that like this is the end of the frame the payload layout is like what you would actually be put into the frame payload two bytes for the class ID class ID can be like Q or exchange whatever and then the method ID which is like create delete publish things like that and then the arguments that go with it like the queue name for example so we can see that we have things we need to serialize things we need to deserialize and work with so what the webmq client did before we actually made these changes when it was sending a command this is what it did step one was it created a memory stream it wrapped that memory stream in a network binary writer which was basically a helpful helper class that did all the serialization like big Indian serialization other things wrote the commands to the network binary writer which put it on the memory stream grab the resulting byte array out of the memory stream because now we have serialized this into bytes set that to the actual Network stream needed to lock it of course because we can't be having many people send trying to write bytes to the network at a time wrote it and had to release the lock now there are multiple problems here related to what we were just going over always creating a memory stream is expensive because it's allocating a new byte array under the membership at all times now you you can't tell the memory stream to use an existing byte array if you have one and you can tell it to like you can initialize it to a certain size also if you want but in the cases where we're having commands we have no idea how big or small they're going to be memory stream depending on how big the comment is might have to resize the array and that means allocate a new array copy everything over making more space and they might have to do this multiple times if the command was really big for example if you're publishing a large message another problem was that the helper class the network binary writer always assumed it was running on a little Indian architecture so usually a pretty safe assumption to make but if it was to be run on other architectures this would fail there's a lot of custom serialization logic which I'll show you for example with the big method that was doing all the like bit shifting ending boring whatnot doing a lot of these things is not good for the CPU because like you're you're basically like creating a very big assembly code out of it doesn't inline very well because it gets complex which is not good for the compiler either and doing all the Locking of course on the connections slows things down as well like if we were doing a lot of things in parallel you would be getting into a lot of lock contentions you're like nope you have to wait till I get the lock so I can like write this Frame because of someone else is writing another frame or a big frame so every frame means like had to take a lock write it release it so it added up quickly in case you had like high concurrency every frame implementation then had like knew how to serialize it was a class and like I showed you earlier classes are reference types so that means even if we were just creating a simple command that was creating a queue we needed to start allocating bytes or extra bytes at least so we need to figure out where the solar problem so we thought okay instead of writing conventional memory stream let's just Implement a frame interface knows how to calculate its size so we can actually know beforehand how many bytes we're going to need and then the interface like tells it or instructs is on that it needs to know how to write itself to those bytes we changed the serialization Desolation to use the binary Primitives as I showed you before so like most properly cross-platform code was a lot easier to read as well then we decided to use the channels as a buffer between the connection and the outgoing frames so instead of always taking a lock on an upper connection and start writing we will just put it on a channel so after the changes what we do like how we would actually send commands to the webmq server is we would calculate the command size because we have an interface to do that we ranked an array that is big enough for us to write into it we again have an interface we know how to write to that array send the array to the channel writer and there's just a single reader picking everything up from the channel writing it to a network stream and once it's done it sends the array back to the array pool so in this case we're not well we're hoping we at least not allocate anything to validate that these changes were actually doing what they were supposed to be doing we needed to profile and Benchmark to changes so created a simple wrapping Cube test application what it did was it opened the connection to the server Creator Channel and subscriber sends 50 000 messages gets it back with uh we decided to have like a big payload wait until it receives them all back and then quits so doing the math on this like we were sending less than two gigabytes of data over the network should be no before we actually made all these changes this is what the revenue client did it allocated 7. 14 gigabytes of memory to send 50 000 messages so to send less than two gigabytes of messages we were allocating more than seven gigabytes those are not like allocations that were all being done at the same time so like the the client didn't take 70 gigabytes of memory like it was constantly like allocating it and cleaning it up but you can see that like it's allocating a bunch of bytes it's allocating a bunch of memory streams and it's actually allocating or creating and the garbage collection having to collect like four million objects so this is not very optimal after we did the changes that we that I just showed you it went down to this 99 megabytes in total allocated to send this because we can simply reuse all our flight arrays again and again and again also reduce the number of objects allocated from like over 4 million down to 1. 8 and this was actually just by making these changes that I actually did in there there was room for a lot of improvement more and it has actually been improved I think we're actually down to like using this using this same Benchmark I think currently we're around 30 or 40 megabytes that get allocated so this is sort of a way that you can actually apply things like this in the real world with actual real world benefits but doing this is as with everything else you can't just like go all out and start optimizing everything without actually knowing what you're supposed to be optimizing and so we're going to need some tools to do that to go over some of them there are if you're using visual studio it has actually a built-in profiling tool which is really really nice and has actually been improved a lot lately it's become a lot faster than it used to be and it's built into like I said built into Visual Studio but if you're not reusing visual studio uh jetbrains has memory allocation tools as well that you can do memory profiling on you can see for example it does uh it shows you like visually how much memory is in each generation like the generation zero generation one two or the large objective for that matter and it shows you like the difference in what happens when it actually does collections it also has a really really nice snapshotting feature so you can take snapshots at your applications running and actually compare them so you can see like the difference in for example how many strings I've been allocating between the snapshots or how many of them are getting cleaned up or how many of them are still left so you can see use that for example to track memory leaks another tool jetbrains has is dot Trace now this is a CPU profiler so instead of profiling your memory usage you can use this to like profile your application and see where it's actually spending all the GPU time you can also use it to see if you have a multi-thread application you can see all the context switches going on see which tasks are running or which threads are actually doing anything it can also help you spot for example if you have threads that are blocking or waiting for something when they should when they should be like yielding or attached to other threads so it's very good to to look at threat utilization as well I forgot to mention for example Visual Studio tools they actually have both a CPU and a memory profiler built in uh so that's pretty handy if you wanna really get down to something Nitty Gritty uh you can use perfume it's not it does not have the Jose it's not have the best user interface but if you get used to it like I would not recommend you start using this tool like to begin with use something simple to begin with but this tool is really really powerful it can show you a lot of things if you know what you're looking for and you can do CPU allocations no CPU sorry profiling memory allocations they can even start looking at things like when you're jumping from manage code into native code you can tell you for example like if you have a managed application that's opening and like opening a lot of files it can actually show you once it goes output to the operating system to open a file and it can show it like ah Defender is pulling in there trying to like scan the file before I'm opening it and everything else so it's really really good but take some time getting used to benchmarks the benchmarks I show you or like the results I showed you were created with benchmark.
net benchmark. net is really released use and it's really powerful it's great for these micro benchmarks it's not very good at showing you for example like it's not very good at benchmarking an entire application that does a lot of things it's best for benchmarking like small pieces of code and it gives you results in like multiple formats you can get a text format markdown format it can give you reports for example of as we saw the memory allocations it can even give you reports of the assembly code that gets generated so you can start running this seeing the difference in in code generation and a lot of other very very cool things you can also you can also for example I guess down here you can say I have a benchmark I want you to run it on these different types of the internet framework and compare the results so you can see the performance improvements for example between. net framework versions all.
net corporations for that matter another tool that I'm really fond of is shortlab.