So of course there's probably very little in this life that I can do conventionally so I'm going to take a moment and um provide a little bit of background and an intro forj and Patrick because in some sense as undergraduates they've been thrown into the lion's den um do you mind actually just flipping to that timeline side so the work that they are or they engaged in this last semester really dates back so actually about 2004 so almost two decades ago a set of ideas while I was at the University of Toronto where we started
conjecturing that statistical inference mechanisms could detect and find emergent patterns that arose in a system at a very low level that would permit or suggest the construction of a hybrid type of computer system that married biologically motivated statistical inference mechanisms which Have a a promise of scalability and low power implementation to add to a computer system the kind of things that we can't get from which is the ability to find inherent short cuts and patterns in how it's behaving and exploit them automatically so the Hope was that if we convert the problem of building a
computer into one where you preserve the deterministic model as the programming model but you offload its performance into figuring out how to make statistical inference scale and be power efficient okay so these are pretty heav lofty ideas but since that period of time we've had some very interesting fun Milestones including apparently the fact that um somebody saw fit to give me tenure which despite me uh and I got a career uh NSF career award around some of these ideas so a lot of the work that they're going to talk about today is US finally having
an opportunity to have the machine learning Technologies catch up to what we were hoping for so we could try some of some of our conjectures out okay so but along that way we built systems that tried to exploit some of these ideas to automatically scale uh computation without ever looking at the code okay that was some work uh published at Asos in 2014 in 2016 we actually built a neuronet accelerator Ator attached it to a computer and saw that our model actually gave rise to a real interface where a computer's execution at a very low
level could automatically be hooked up to a neural net and actually produce feedback that allowed some of these shortcuts now aggressive out there research I believe happens because we are a community of supportive people encouraging risk and going out there and saying crazy stupid things so these poor undergraduates are now under the risk of dealing with my crazy so with that in mind before even they get to an end they deserve some Applause for surviving this all thank you Professor ababu for that introduction um so first so we have this background of the ideas behind
our our work and so now we'll get into what exactly we did because these are pretty headyy ideas there there's a lot of spot behind them but it's important to be able to concretize and actually test out some of the theories and hypotheses that we have and that gets into what we did this semester um so our hypothesis it says on the screen we use low-level binary representation in the form of what we call State vectors through a programs exec ution we believe that it can be exploited to improve the performance of that computer using
neural networks so as in previous works but in a more General sense we will take all the memory of a computer all the line them up as a state vector and then try and detect patterns and changes so that if we see a recurring pattern or some other kind of thing that say a neural network or other machine learning techniques can exploit we can use those to improve execution and to learn more about the underlying processes and machine itself um specifically this semester we worked with the micr program running on the 6502 micro processor and
We hope to bypass deterministic computation by offloading ex execution to neural networks so ning the idea of this state Vector computer is deterministic it goes through Specific Instructions that carry it to differentes in a vector space and taking that and using statistical inference and Merl networks to create a model predicting not necessarily exactly where it will go but idea what and why is it a good application to start with it's a good application because we'll get into it in a second but actually on the next slide but starting with the 6502 and micr chest the
6502 is a processor from the 1970s that was used in the Apple 2 and the kim1 and several other Home computers so it's a it's a very small computer compared to today's modern standards it only has 6 4 kilobytes total of memory it's a processor um so in that sense it's it's pretty easy to or oh well I shouldn't say pretty easy it's a lot easier to deal with that than any megabytes of of Ram or total memory um and you're able to view the whole machine the whole deterministic process but it's still large enough
where it's a general process this computer can execute a lot of different things um so we picked the micr chest Program it's the first uh program that built for the 6502 as as a kind of game um it was built by someone named Peter Jennings who then sold it through mail order distribution it has kind of a fun history I'd recommend looking it up after um but we did this using uh simulator from Canada by the way that's true um so we we have a custom 6502 simulator that goes through and it's a c program
that runs on uh on an x86 architecture but it you can go through and execute each individual ual instruction of the process and it's able to create traces and create those State vectors that we were talking about where we're able to take a snapshot collect all of the memory all of the registers put them together and get that state Vector um so again the Y um it's much smaller than the modern computer but it's still a general purpose computer so in the image here you can see um a few images of what it actually looks
like to play micr chest which we'll do in a little bit but this is an example of what a trace would look like where each black and white pixel is either a zero or a one for a bit and you can see how it changes over time it's a stochastic process and the goal would be to go from Somewhere up here to somewhere down here to skip execution using our our statistical techniques um how is this research related to the previous works in the past with ask and with Dana it used something called online learning
where a neural network trains as a process is executed and it uses one example each time to improve and to go through like gradients assign and reset the weights but this time we're using batch processing we already have the program executed we already have all of the traces and as a result we can do more it gives us more freedom and flexibility in terms of the networks that we could try we can try different architectures of our neural network we can try different parameters we can see what works and what doesn't and instead of going
along with the program as it's running or with the the system as it's running we can take a step back and we can see what works what patterns are we observing and have more of a data driven approach as well where we collect everything that's associated with our process and make sure that we're doing the right thing and what we're saying we're doing and that that's matching up and also that we can test more options however we're going to do some Quick math and so we have a custom piece of software in the 6502 as
simulator and we also have quite a lot of data that we're going to be dealing with because for each state of the 6502 it's 64 kilobytes of memory Plus 8 bytes on average it executes about 14 million instructions per board State and in just our experiment that we ran this semester it was 358 board States but we have many tournaments that we're hoping to go through so when we produce a trace of this data it it leads to quite a lot um that's a lot of data and so the question becomes how are we able
to use this soft Ware but also how are we able to deal with this massive amount of data and that's where theoc and nerd come in so we have this problem we have a lot of data we have a lot of custom firmware and we have a lot of things to do with it so how do we bring this problem into the physical this is our basic framework that we set up we have our two main computers the ja CPU and the J GPU and as you can see we have a lot of storage with
them as well so that's what we keep for large amounts of data um along with it we have our VMS we're able to run our custom uh software that we have to be able to run run the 6502 and all of our experiments and uh being able to hop into them from anywhere obviously with SSH and being able to work not only with our group That we did our research on the semester but with sanay as well a he's able to work on the Jupiter environment which are able to um pop up on these computers
and be able to work there as well as being able to see statistics of our computers outside of the computer making sure we have the space and seeing all of what's going on in the inor but the actual process of collecting our data we start with the PGN notation these are tournament games of uh chess notation that you can get offline online um and there we have many many games of many tournaments and using these games we're able to extract multiple games from from these tournaments and put them into a state that's easier to work
with first we start moving it to a fend notation which is foresight Edward notation a common notation and using these boards we're able to uh put them into a state where we're able to use uh the micr state notation and run them through our 692 to be able to get both a start state of a board and an NCM board out to Play after we've collected these boards we're able to um maybe just to clarify so the board State we get from online that's what we call the start State and then we're going to let
the Entire Computer including the the program make a move so we're letting the program evaluate and go 14 million sets of instructions on average it executes for it to produce its response to that board state so that's the so this is not neur own networks at all right now this is just sort of gathering some evidence of these kind of movies about what this program really did okay so that's what they mean when they say start end States it's the start state was given the end state was produced by this computer right and it's a
lot of data because you're you're getting literally every full Fidelity step Trace every bit that yeah so we have uh while we run We Gather every every state Of the machine as it goes through however the particular study that we're going to focus on uses a subset of it but that's just because it's a starting point yeah so when it comes we took one tournament we took 358 start and state pairs so the start state is some situation present in a game um this is a tournament from Hamburg in 1868 so back several decades ago
um someone was in this position and was faced with what move should I do and then it turn then we don't evaluate what that person did or what the ideal chess move would be we evaluate what exactly would the micr chess uh CPU program as in as created by Peter Jennings how would it evaluate that and what move would that program make as a player M no right no I mean on an 8bit computer back in 1970 something it's incredibly competive compared to deep blue not a chance compared to the average I don't know I
don't I can't even play chess so what do I know say it's surprisingly good um but there's also a number of issues outside of that as far as the micr chest program itself when we demonstrate it later we can show that like you can move any piece anywhere on the board um it doesn't really have like oh you can't do this move this is not a rule like you you can do anything you want so that's Kind of a danger which is why we we chose preet up boards and let the computer play through it
so that we know that what's like intuition behind the logic on that program effectively like um what it does is it goes through and it evaluates each potential it's either like a bunch of candidate moves and it evaluates it in a chest process where you check for checks and like see should I move out of a check then you check for captures like what pieces should I take and then you look at strategic position and so it goes and then based on I'm not sure the exact like how it evaluates but it some kind of
rating and then BAS Bas on that it picks the move with the highest point rating and selects that so it does have some kind of strategic intuition is that possibility also like select some distribution over that or uh yeah because it's it's programmed to play certain openings and certain moves so for example if we were we're hoping that like when we study beginning moves that it will be able to see that like it always happens to play these three openings and you can start to detect like you can build a model for where it is
without actually even looking at any of the source code or the program um so when we run these experiments we have these this start State we have this end State we get it into a form where we Can input it into a neural network because 65,000 bits is a is a lot of input neurons so we convert it to only the bits that can possibly change we use that as the input layer and we use the same for the end State as the output layer of the neural network we use tools that we buil we
call it I tool where we can input a series of parameters that we want to sweep and we do a grid search where we go over different numbers of eatbox different learning rates different functions for like a threshold value and other things like that and we produce a bunch of neural networks with these configurations then we go to our test set and we test it on these neural networks we see how it performs where it makes the errors and hopefully in the future we'll be able to start interpreting why um here are some of the
results from our experiments these are what our files look like it produces a lot of files and a lot of data and it's all reproducible we we spent a lot of time building tools to make sure that we can reproduce this data we've done a lot of testing to make sure that like this is stuff that even though it it's quite a lot we can like if we hand this to someone else they'll be able to reproduce like what we did this semester um so back to our our platform how what are the actual tools
that we Use like we have this nice idea of oh we're going to take this these board States for micr chest and we're going to put them into a neural network somehow and then we're going to train the neural network and then test it but how do we actually do that we had we built a lot of software underneath it using our virtual machines using the data storage that we have um so the first step in our process is how to even SSH into these virtual machines and so we we just type SSH but there's
a bunch of things in a repository called ner tools that we use in order to be able to use open stack and get into these virtual computers um this is an example of one of the scripts that we use but they help in the sense that these are virtual machines and so similarly to if you just set up an instance of AWS or something like that if you leave the virtual machine running it will keep running and so you need to be you need to have a way to call the machine where if you're inactive
for a certain amount of time or if you've used it for like however many hours you want to be able to cut it off to save your research budget and so we have tools that if for some reason we leave our computer and forget to exit out of the show or if it keeps running through a process and it gets stuck in some kind of loop that we're not aware of it kills the virtual machine before completely destroying the budget Um so then with this we also have um like our software how do we actually
play micr chest on the 6502 so for this case um we have our CPU here and we have so can I jump in for a sec on that just for the S thing so we uh one of the the things is that at least right now I think we're still charging for gpus only when the machine is powered on unfortunately what we found out is that so so this is a great so this the SSH will if it's powered off will go and power back on that VM before automatically for you right which is um
I think a very useful tool for anything that's that's statically acquiring that that has his charging for things especially gpus which are ridiculously expensive to so let me just restake just so that everybody caught what you were saying so um this isn't a a built-in feature but we just hacked around so we we we figured out that some of the interfaces that were available meant that we could make it look like it was an ond demand SSH so SSH to this arbitrary name it's not an IP address but it's a name that will automatically get
resolved using the the open stack tools to an IP Address if that doesn't exist if it's not ready it will trigger open stack to bring our VM up and then the other stuff that we're running inside the VM will if we've got science experiments running uh it will keep the machine running if it detects however that we're all off of it and there's no experiments running it will considered idle and it will pull the machine and shut it back down so that that way Sanjay I CJ Patrick We're unaware of this our workflow doesn't change
but um we we also have a Max timeout limit you know in case we up because I don't have infinite budget uh the other thing that we really wanted so those of you from the Nerf plan we really wanted access to our budget so we would have actually stopped the machine if we were exceeding a budget we couldn't figure out any interfaces for that yeah so that's just very practical running data science like experiments in the real world we have to worry about money so we we think like oh we're going to be exploring these
neural networks going to be diving into the depths of micr and we Spent the first couple weeks like working with open stack and troubleshooting and because the price of that GPU machine is expensive on the other hand as we got to the end of the semester we really appreciated it that GPU we have access to has um what do I care a fuckload of memory um and it really allows us to explore very uh complex uh neural network configurations but it's expensive true so back to micr chess um so we have our 6502 our simulator
we have the micr program and so we're able to run it in a way where like you hit C to clear it when you hit P this is when the computer goes through its process and evaluates mooves so the computer is the the group on top it's the white pieces and so when I hit p in my keyboard it's going to go through a series of instructions where it figures out like oh these are all the possible moves I can make as my first move and this is the move that I should do and it
chooses the most common move opening move in chest as a result um as a human you can also play against it you select the piece like for example I select the pawn on Square 63 it uses octal notation instead of traditional Chest notation because it's easier for it to store move it to square 43 and so then you can move that pond and then you can play against the computer again the computer chooses like move the KN out here you can also switch sides which is helpful for us when like we don't know we don't
we don't have to know who's playing first whether it's black or white we can automatically put the computer on the top and they're the ones that um and then once you type Q it provides a bunch of statistics about everything that happened as far as what the values of the registers are but importantly you can also see how many instructions were executed um how many EX how many were executed per second and a bunch of other data you can all there's also a bunch of other flags that you can add to the 6502 that allow
you to find the trace dump it add it to a file and then work with it like be able to do so in your use case it's just send a board you do the work collect the trace put the output it's not like a full game it's like one yeah it's yeah we take each board State and then that's that's our game schol yeah so then we go through collecting data so we talked about the process of going from a PGN into a state Vector itself and so the PGN file is just a list of
moves it's not very helpful in telling you how the board Looks unless you're able to visualize a chest board in your head so what we do is we convert those to Fen files foresight Edward notation where we show on each of the eight lines of the board this is where the pieces are um this is how it's set up whether the the square is empty whether it contains this fond or anything like that and then we are able to take that board State and then put it into the initial State Vector for the micr program
and then run that and once we run it through the 6502 that is our input state that we give it and the output state that it comes out with as well as all of the traces in between so let me just point out something that these guys had to do right so got a chest board right but that thing on the far right there that UC board that's a binary Vector right so they went in they reverse engineered where in the program the bites are that represent the board they figured out what that meant and
then they converted made a tool that just generates a bit Vector that they can slam into a state of the program and then automatically now the program's sitting at that board state so that's already this idea that we're converting the world into just binary operations on States of register in Memory and then you only store the vectors that are possibly like change in the system right because there are some that will never change uh so no so the the initial State that's put into the computer is the complete State they get an entire end State
and actually every state in between but the study they do is that they and they'll talk about this in a second they're going to go and analyze of all the games they've ever played which bits in the locations did ever change and then for the for the sake of scalability for the moment we only work with those and just to another point about one them through this semester we spent a lot of time working um in a Bottoms Up kind of framework where we built a lot of low-level tools we built a tool that takes
a PGN file and turns it into a f file and we built a whole bunch of these tools to the point where now we can just run one script and it executes all these scripts below it which allows us in future semesters we can build on top of this work and it allow it like data collection can become a lot faster running actual experiments as we'll see in a bit can be a lot faster and so like a lot of the the going was slow but we're hoping that this will accelerate like future developments as
Well going off the question that was just asked this is how we ended up taking the the excitable bits or in this case we took the start State and an end State and using an exor we were able to just find the specific bits in that start and end State play that were changed throughout the process um and then going through all of the games that we had that's when we order all these State vectors that we had of only the bits that were changed into one large or um s Vector which is our mask
that we used to mask out on all the games the bids that we were actually sending through the neural network just out of curiosity did you ever check like in the test cases like did you ever find out in the test Cas that these were the only bits or like was there like a anomaly or something where some other bit also changed like in the test cases so this this was for both our training and our testing St for for like our our total data set basically um we did look into it because we have
a tool called we call it SV info what it does is it takes one of these State vectors and it tells you which bits are one and where they're located so then we can use that and figure out like wow there's a whole bunch of bits between locations X and Y that that's probably the board that changed or like oh the all these in the End this is determining whether black or white is the one to move and then you can evaluate it from that way too as far as like oh we we created this
process but like just a heris check is this actually working we can go and look and see like does this make sense that this changed I'm just GNA clarify something for pro because just to make sure that you're at it so the The Mask yeah that says these locations change was evaluated for all games not just the training set not just the test set okay and it I'm sure they'll tell you the number but it's about 690 bits locations in the 64k vector that are actually changing from a start to an end when you consider
all starts and ends this is totally lostness right there can't be a bit that ever change that you don't accumulate in M yeah and then we can run it on this is just specific to our data set right now this is not specific to all board positions of all time put on micr chest but the goal is that in the future when we start working with different data larger data sets that we able to figure out and keep adding you just say that of 64 kilobytes You had only 600 some bits that were changed 690
odd bits ever actually change from start to end so they might have changed as intermediate results like it's locations right these locations changed from start to end mhm if a bit you'd have to go through every state if you wanted to find out what bits ever change because right now got set back to its original we wouldn't be aware of so these are only the start and end states that we're choosing not all of the 14 million States in between where all the registers I was wrong yeahor but I'm gonna guess that they kind of
like are containing to clear regions right that start to be so those are the area write memory so there's so that's all the interesting stuff right what what how would we discover these what do they mean but for the moment we got to start something so then we also put these tools because at first when we started working we did not use um some more conventional neural network tools we used something called fan fast artificial neural networks which is running C and we use specifically float Fan and so we had to build a few Tools
in order to convert these bits to floats um so that we only have 690 float values that are going used as inputs and outputs to our Nal Network so that's what some of these tools are here um validating our tools with tests It's always important to test test test and make sure that your tools are doing what they say they do even when you're changing and in the middle of an experiment to rerun these um is always helpful um and then the process of actually running an experiment so this is the so the first part
is the collecting the data and this is goes to the second part where we have our data set we want to take out uh like we have our layers of our neural network we have what the input neurons will be we have what the output neurons will be which will be these bits that have that could possibly have changed and we go through Force phases where we set up whichever neural network configuration we're working with whether it's either fan or P torch we then go through and train on whatever percentage of the data we're choosing
or whichever data set we want at that time um we then test on a unique unque data set that is separate from our training data set and then go through a results process where we create a a Jupiter notebook that calculates some basic metrics as well as includes other numbers that we can then work with further in that notebook and each Experiment that we run produces its own unique configuration of the neural network and its own notebook that you can go and run so if we run this command once it could produce say like 30
neural networks with all unique configurations and then we can go in and figure out which ones work better which ones didn't and why so these are some screenshots from from the results notebook we tracked local errors which are in this file um in this specific board it moves from start to end State and it was supposed to predict that these 128 bits changed but it actually predicted however many bits and then we divide that by the total and so you can see like just get a relative idea of which ones were more accurate that's also
why it's above one is because there are 690 bits total that change but sometimes only like 110 bits change in in this between the start and the end state but our neural network predicted that 1 would change and so dividing 150 by 110 you get a number that's one things like that we also graphed the total errors which is like the global errors of all of our vectors and the training loss function uh for every 10 box fact to read without the eight the eight plots I can't see the title trial trial I see so
yeah so these Are for each individual trial um which is the configuration of a neural network okay and also that that one's on the left those local areas that are above one that was all when we were using fan which is a very old tool which is very brittle and we're not even sure we used it right but soon as we switched over to using more modern tools High torch and so on we actually eliminate B the neural networks actually get very good very fast yeah this is one of the results from one of the
specific experiments that we did so a specific neural network configuration in this case you can see in the top right it we started with the 690 changing bits starting in the end state but in between we had three hidden layers um double the size of the the starting and the end stage that gives the neural network a lot of flexibility to try to track uh what's going on in between the process of when it's predicting um but basically we were able to see out of 73 results for this experiment you were able to see that
16 of them only eight bits were actually predicted incorrectly so when it went from the start and the end State it's prediction was almost spot on maybe off by one bite um but we but out of one bite in 690 bits that are changing and 6,000 um total bytes in a state Vector That's not a lot of error that we see for this one I missed it how do you divide the the training and test data set so we we randomly select in in this specific experiment um we went through we had seven different games
each game had a number of board States and so we randomly assigned those board states to a training and a testing set what was the the percentage it was 8020 split so for training and 20 for testing yeah yeah that's a pretty typical yeah yeah so that's how we miss but like just taking a step back right you're trying to replace 14 million instructions with a couple major multiplies and sigmoid or something right like that's the idea and you have eight bits that are wrong it seems like you're honest that's like the big picture eight
of them were predicted totally accurately so you you could have just run this neural network and you would have gotten the exact state that you would have run if You' run 14 many instructions and it's sort of ridiculous that it should have done this well with such little data and it's actually also nonsense we can get that would it reduce your data set to use some kind of hyper parameter optimization sure actually matter of fact all kinds of things could be done at this point a lot of this work was to set us up to
start doing this kind of Study because there's actually the architecture of the network doing things that are much more interesting like actually using we all know as computer systems people the fundamental unit is a bite restricting ourselves to only look at the bits that changes the input that seems ridiculous right there are all kinds of interesting things now for us to start exploring Eric yeah yeah so I was going to say I ask what is the what is the comparison between the computational cost of this neural net to the 14 million instruction you skipped and
what do you expect that to be once you have your Hardware accelerators Etc okay so the second part yeah first part it's right I mean this is a huge GPU sucking up more power than that CPU for sure despite it being my shitty C code that's running the 6502 the idea here is not so much this particular technology right now we're just trying to study the baby steps sure is it even viable with no semantics to take a binary vector representation of a Computer and infer that there is opportunity for acceleration now the second part
of what you said speaks directly to my original motivation if you don't look at the crappy pardon all the machine learning people here the horrible things we as computer scientists have done to do machine learning you actually look at there's two parts of this community there are those who have been studying it from a biological perspective and those who have been simply making sure that Google can predict what we're about to type next using a lot of computers and a lot of energy those who have studied the neurobiology suggests there are both scalable and power
efficient implementations if we're willing to go radically different architecture I I've seen hugely efficient implementations once you have an architecture set right I I watched the PHD thesis presentation at MIT about taking quing models and turning them into having the the the actual architecture in Hardware you fed the model to it and it was now it kept up with real time and was as accurate as it was running on GPU for hundredth of the cost right just most people have per milon but you're right that's so what Eric is saying is while everybody today For
those of us who are sort of not in the field we hear about machine learning it has to do with the Matrix math that's being deployed on gpus because that seems like the way we could get this to work but the there are other branches of machine learning that study if you knew the architecture like biology you could grow or build a custom circuit that is radically more efficient as a matter of fact a lot of the most early interesting stuff on what is called neuromorphic Computing demonstrated Nano or Pico jewels to do an inference
not gws of GPU please not that I have an opinion yeah there has been one particular thesis Strat yeah so what this means as far as future deployment uh or future development so future directions first prod we talked about a lot of them already but um applying them on different programs on this program we can apply them to more games we can achieve larger scale we can start to look at the full trace and pick different spots instead of just the start and the end States and we can Start to learn the actual representation of
what what is this computer doing how is it able to calculate these moves um we can do them with more and less complex programs and see how our neural networks Fair um we can also use like different architectures for our neural networks and continue to improve on that front we can also start looking at new systems uh new systems in the sense that they're more modern such as a risk5 architecture s86 um goals and aspirations for the future is to find a model of execution somewhere as well as achieving architecture Independence because we our hypothesis
is that we can find some model of a computer outside of the specific instruction set and everything to it that we can abstract away um and the goal is to continue building a concrete foundation for future data DP just want highlight this last point because there have been plenty of my graduate students and folks who have C their heads against this one of the big contributions these guys did is bring this into a world where we can now produce Jupiter notebooks massive amounts of data we are not the experts here we don't claim to be
the experts on Machine learning but now we can actually bring people into the fold to work on this problem and they don't have to know that they're working on speeding up computers they can do what they do well and we can work with them in a really kind of seamless way now of course I don't know if sanj's on the call so I can say whatever I want maybe he would say no that I could never work with you guys but oh perfect so then it is the best collaborative experience that he has ever had
working with assist I have you tried the last closing the loop so taking the infert vector putting it back into the machine State and so um we so this piece of work right now a lot of what we focused on was setting things up to do studies so the past work has been all about could we get any form of online real acceleration and we exactly take an inference we feed it back back in we take that inference we feat it back in and we can actually play forwards and backwards but we're also very careful
in that past work this is ridiculous right we're saying the beginning of a move to the end of a move In past work where we what we made the prediction problem was much more engineered and precise we basically said hey look you stand back we're watching a reoccurring pattern let's make that thing the thing the prediction problem problem is because now that might be a a a front end of a loop it might be the start of an interrupt right whatever the case be in the LA when we were doing this they were computationally bound
work we purposely chose Loop heads and then we said ,000 Loop iterations forward that's your uh uh task here we just wanted to say hey look we take something extreme an i driven interactive program with a complex algorithm behind it and just say semantically the problem is can you replace the entire algorithm which is roughly ridiculous right but we can ask the question concretely we can study it and now we can start exploring all kinds of interesting things but the ask paper talks about how we recursed the networks and things like that no but just
here stupid uh oh just one wondering when you Feed back your inference does the program work or like one of those beats are actually physically I see what you're saying so yeah so we actually the we only actually to be honest got Real Results towards the end of the semester because we finally got everything working so um one of the two built that final tool that took our inference and put it back and order it into the original state so and now the nice thing is yeah we can drop it in place now since we
know that it's bit per the ones that are bit perfect of course it now what we're doing is looking at ranking and clustering where the errors are what they are because for instance if it is in a right now we're not doing anything no filtering if it's in a transient bit right that's right after to write before read then of course it won't make a difference we're only starting that work now yeah okay as well as if it's MSB versus LSB right make a lot yeah this might already be clear but in in some of
our other work uh we worked something analogous to Branch prediction just at these larger intervals so it had that property of sure you might pay for cost of incorrect predictions but it's impossible to Commit to incorrect computation so that was uh like when we try to make this a little bit like tighter those are the type of guarantees that you can provide yeah obviously you can do though right say again here is less obvious that you can do that yes so here so what Tommy's talking about is when we actually took these ideas and built
a system to actually achieve speedups this is the the underlying study of the networks so what Tommy's talking about is what we did to stop to to bypass this incorrect problem what we do is we take the output of the neural network right we then pretend it's like a speculation we take a core and speculate from there to get a new end State corresponding to that start State we then put that into a cat now if the computation hits and finds in the C then you jump to the end and that we were able to
demonstrate as crude as it was one it can't make a mistake two it can actually we were able to see that the system could achieve its theoretical speed up maximums which in our case it was a third speed up um using no par nothing was parallel in the programming But the implementation was completely par so I'm taking this away from sort of the end results this part of this talk I think is about you know creating an environment where you can go and log into a computer right and just have lots of data and lots
of and computational resources available to you to work on in a very interactive fashion right um now and batch what's that and batch you can also launch batch experiments if you want to sweeten new parameters generate new data and so on yeah now if I look at so first of all there's something that's turned out to be awfully broken in this which isn't your fault at all is that what we found out afterwards was that openstack reserves a GPU for you if you have a M that's even if it's powered off so right now I
don't know if we're charging you for it or not yet but but but the real charges someday will be it doesn't matter if you power it off which is insane to me but that's not my that's that that is something we haven't be a to work around containers on the other hand do have this attribute where if it's if it's off then you don't you you End up freeing all the resources of it except for the storage um I don't know that but but actually I'm curious of anyay from redhe hats online is like could
we you know this IDE of being able to just log in interactively and have the machine powered down where it's not being used that it seems like people by default we had to do something special for our courses to have all these amount these resources freed when you're not doing it does I don't know if open shift or open shift AI supports the releasing resources when it's not active um and this idea of just being able to use this as an interactive service SSH and add an extra little bit of shim has anybody done that
with the container environment is there any reason you wouldn't use containers for this because of all of our custom software I mean are are we're using a specific tool chain we're using our own C code a lot of this was custom and we needed an environment that we were familiar with if somebody was going to do the work and give us an environment of containers that do it all like for the course you created your own image with so I didn't want to do that okay so it's about the creating a custom container and stuff
I mean and I wanted these guys and sanj to basically just SSH yeah well it doesn't seem like a far stretch um but it seems like something we want to support like part of this is just a use case for for what we need to provide to researchers that just want to have that interactive experience from redhead is online first you have any opinions on this um no not really nothing that I could add and make it better from here this is something we should try to internalize and figure out how we can actually support
exactly this kind of use case because it seems like generically very interesting but very useful a question so when you say architecture Independence do you mean that you have a accelerator that's attached that your you're working bid to do it or complete architecture Independence um so what we mean is that a binary Vector representation of a computer is a binary representation of any computer right I mean you can you can create ones and zeros I don't really care what the instruction set was I Don't care what the code was we're operating at the level of
the underlying literal touring configuration states of the machine now we've got other work where what we did was so imagine the following um let me start not with the state vectors let me start with something easier every machine you can say you could enumerate all of its Ops right x86 it's in a un huge number like well over 4,000 plus uh uh unique op codes right now if I took the op codes of any computer is going to go from zero to n right and then what I did was I said okay I'm just going
to watch computation and the most frequent whatever the op code is on both platforms it's whatever its most frequent op code is I'm gonna make that the zero instruction op and then I might see a different one then becomes one and so on now what you can do is you can create a a trace from both computers right with their op code ranked index per time unit and you can plot it on the same graph right you can Actually watch and plot two different computer executions where X is time Y is op Cod right and
if I have a problem with scale then I can scale now think of this if I now take a machine's State Vector represent which is very different right 6502 is 64k uh we've done it with x86 machines uh with I think we've done two things we've done sparse address spaces and we've done condensed address spaces to four gig and now what you do is you do an sore so at every time step instead of gathering the full State Vector like these guys have done you say what is the exor between this and the last state
now you record the exor and you're going to find that theor actually repeat and now they become a dictionary a token or a new kind of instruction now you rank those by frequency and you plot again and we've done that and it is fascinating you can take micr chess on x86 you can take micr chest on on 6502 you can plot them and you will see that if you start expanding or constra the timeline all of a sudden you can Find where they fit which maybe is not that surprising the x86 has about a anywhere
from a one to nine instruction compression compared to 6502 because 6502 is a wrist Blake architect if you use that dilation Factor all of a sudden the graph start to look the same so if you learned on the underlying structure what's going on I'm going to argue that the neural net is actually learning something that is the actual property of the execution the algorithm rather than a fe actually more than the algorithm we demonstrated that it also includes the Distribution on the input data an algorithm is one thing but a shortcut has to do with
finding and exploiting things where the data and your algorithm so are interacting why didn't you like why did you obviously the whole state Vector here 64,000 times 4K four times 8 why didn't you do the xor because that would presumably be tremendously less data we do I mean our data set has all of it in there but we just wanted a very straightforward concrete problem that we could map to a prior thing we worked on And ask was the easiest one for us to focus on take one more swing at that question of architecture I
think it's really um so like you know I think this chess thing is really cool but you could take it or leave it right I think the real magic is in the transformation from uh a general purpose process right an instant in time to in alternative representation that can be consumed by other Hardware right so like that is the the first step of that Independence right so going from like processor format to uh to GPU format right but it's not just GPU format it's a bit Vector that's an input that's amenable to just linear algebra
right and like holy crap the the the processor there is math right so like I think it's all about looking for these Transformations like the The Gather that takes it out of the CPU and into that new form format the scatter that brings it back into the processor format you could start to wonder like hey can I start putting these Transformations together can I take it off an x86 processor put it on an arm processor right can I take it off that and put it back into math so like the bigger idea of architecture Independence
is kind of in generalizing this transation and there's evidence biologically that it's a it's a multiscale iterative process we have parts of us that are operating in this moment in time in this location right but our knowledge that makes us efficient with what we're doing right now is actually much slower and much more abstract right and somehow we take what's happening right now we propagate it up we get predictions on the way down that become localized to what we're doing now so we wrote a proposal where we start talking about this idea as a piece
of a much larger architecture where you imagine neural networks or whatever you want to call them State Vector Pairs and neural networks that are kind of being abstracted and abstracted and abstracted away that then are going to be giving you predictions downward now whether that goes to the end to an x86 to an arm To a 6502 it's the as you go downwards you start specializing the prediction to your particular consumer and the cashes I talked about in some sense is the final realization of that when I stick a state end and start end that's
to this computer in this location right now there's again this is obviously Decades of ideas and research so I'm happy to keep going but I want these guys to be able to sit down um so again I personally think it was an incredible uh Hall to jump into this and the Very fact that they didn't go running in the middle of it with all the little pieces that we were trying to get working and validate was tremendous so is that a poor sign of of intelligence really cool really cool stuff an hour anyway so of
course there's probably very little in this life that I can do conventionally so I'm going to take a Moment and um provide a little bit of background and an intro forj and Patrick because in some sense as undergraduates they've been thrown into the lion's den um do you mind actually just flipping to that timeline side so the work that they are or they engaged in this last semester really dates back so actually about 2004 so almost two decades ago a set of ideas while I was at the University of Toronto where we started conjecturing that
statistical inference mechanisms could detect and find emergent patterns that arose in a system at a very low level that would permit or suggest the construction of a hybrid type of computer system that married biologically motivated statistical inference mechanisms which have a a promise of scalability and low power implementation to add to a computer system the kind of things that we can't get from which is The ability to find inherent short cuts and patterns in how it's behaving and exploit them automatically so the Hope was that if we convert the problem of building a computer into
one where you preserve the deterministic model as the programming model but you offload its performance into figuring out how to make statistical inference scale and be power efficient okay so these are pretty heav lofty ideas but since that period of time we've had some very interesting fun Milestones including apparently the fact that um somebody saw fit to give me tenure which despite me uh and I got a career uh NSF career award around some of these ideas so a lot of the work that they're going to talk about today is US finally having an opportunity
to have the machine learning Technologies catch up to what we were hoping for so we could try some of some of our conjectures out okay so but along that way we built systems that tried to Exploit some of these ideas to automatically scale uh computation without ever looking at the code okay that was some work uh published at Asos in 2014 in 2016 we actually built a neuronet accelerator Ator attached it to a computer and saw that our model actually gave rise to a real interface where a computer's execution at a very low level could
automatically be hooked up to a neural net and actually produce feedback that allowed some of these shortcuts now aggressive out there research I believe happens because we are a community of supportive people encouraging risk and going out there and saying crazy stupid things so these poor undergraduates are now under the risk of dealing with my crazy so with that in mind before even they get to an end they deserve some Applause for surviving this all thank you Professor ababu for that introduction um so first so we have this background of the ideas behind our our
Work and so now we'll get into what exactly we did because these are pretty headyy ideas there there's a lot of spot behind them but it's important to be able to concretize and actually test out some of the theories and hypotheses that we have and that gets into what we did this semester um so our hypothesis it says on the screen we use low-level binary representation in the form of what we call State vectors through a programs exec ution we believe that it can be exploited to improve the performance of that computer using neural networks
so as in previous works but in a more General sense we will take all the memory of a computer all the line them up as a state vector and then try and detect patterns and changes so that if we see a recurring pattern or some other kind of thing that say a neural network or other machine learning techniques can exploit we can use those to improve execution and to learn more about the underlying processes and machine itself um specifically this semester we worked with the micr program running on the 6502 micro processor and we hope
to bypass deterministic computation by offloading ex execution to neural networks so ning the idea of this state Vector computer is deterministic it goes through Specific Instructions that carry it to differentes in a vector space and taking that and using statistical inference and Merl networks to create a model predicting not necessarily exactly where it will go but idea what and why is it a good application to start with it's a good application because we'll get into it in a second but actually on the next slide but starting with the 6502 and micr chest the 6502 is
a processor from the 1970s that was used in the Apple 2 and the kim1 and several other Home computers so it's a it's a very small computer compared to today's modern standards it only has 6 4 kilobytes total of memory it's a processor um so in that sense it's it's pretty easy to or oh well I shouldn't say pretty easy it's a lot easier to deal with that than any megabytes of of Ram or total memory um and you're able to view the whole machine the whole deterministic process but it's still large enough where it's
a general process this computer can execute a lot of different things um so we picked the micr chest program it's the first uh program that built for the 6502 as as a kind of game um it was built by someone named Peter Jennings who then sold it through mail order distribution it has kind of a fun History I'd recommend looking it up after um but we did this using uh simulator from Canada by the way that's true um so we we have a custom 6502 simulator that goes through and it's a c program that runs
on uh on an x86 architecture but it you can go through and execute each individual ual instruction of the process and it's able to create traces and create those State vectors that we were talking about where we're able to take a snapshot collect all of the memory all of the registers put them together and get that state Vector um so again the Y um it's much smaller than the modern computer but it's still a general purpose computer so in the image here you can see um a few images of what it actually looks like to
play micr chest which we'll do in a little bit but this is an example of what a trace would look like where each black and white pixel is either a zero or a one for a bit and you can see how it changes over time it's a stochastic process and the goal would be to go from somewhere up here to somewhere down here to skip execution using our our statistical techniques um how is this research Related to the previous works in the past with ask and with Dana it used something called online learning where a
neural network trains as a process is executed and it uses one example each time to improve and to go through like gradients assign and reset the weights but this time we're using batch processing we already have the program executed we already have all of the traces and as a result we can do more it gives us more freedom and flexibility in terms of the networks that we could try we can try different architectures of our neural network we can try different parameters we can see what works and what doesn't and instead of going along with
the program as it's running or with the the system as it's running we can take a step back and we can see what works what patterns are we observing and have more of a data driven approach as well where we collect everything that's associated with our process and make sure that we're doing the right thing and what we're saying we're doing and that that's matching up and also that we can test more options however we're going to do some quick math and so we have a custom piece of software in the 6502 as simulator and
we also have quite a lot of data that we're going to be dealing with because for each state of the 6502 it's 64 kilobytes of memory Plus 8 bytes on average it executes about 14 million instructions per board State and in just our experiment that we ran this semester it was 358 board States but we have many tournaments that we're hoping to go through so when we produce a trace of this data it it leads to quite a lot um that's a lot of data and so the question becomes how are we able to use
this soft Ware but also how are we able to deal with this massive amount of data and that's where theoc and nerd come in so we have this problem we have a lot of data we have a lot of custom firmware and we have a lot of things to do with it so how do we bring this problem into the physical this is our basic framework that we set up we have our two main computers the ja CPU and the J GPU and as you can see we have a lot of storage with them as
well so that's what we keep for large amounts of data um along with it we have our VMS we're able to run our custom uh software that we have to be able to run run the 6502 and all of our experiments and uh being able to hop into them from anywhere obviously with SSH and being able to work not only with our group that we did our research on the semester but with sanay as well a he's able to work on the Jupiter environment which are able to um pop up on these computers and be
able to work There as well as being able to see statistics of our computers outside of the computer making sure we have the space and seeing all of what's going on in the inor but the actual process of collecting our data we start with the PGN notation these are tournament games of uh chess notation that you can get offline online um and there we have many many games of many tournaments and using these games we're able to extract multiple games from from these tournaments and put them into a state that's easier to work with first
we start moving it to a fend notation which is foresight Edward notation a common notation and using these boards we're able to uh put them into a state where we're able to use uh the micr state notation and run them through our 692 to be able to get both a start state of a board and an NCM board out to play after we've collected these boards we're able to um maybe just to clarify so the board State we get from online That's what we call the start State and then we're going to let the Entire
Computer including the the program make a move so we're letting the program evaluate and go 14 million sets of instructions on average it executes for it to produce its response to that board state so that's the so this is not neur own networks at all right now this is just sort of gathering some evidence of these kind of movies about what this program really did okay so that's what they mean when they say start end States it's the start state was given the end state was produced by this computer right and it's a lot of
data because you're you're getting literally every full Fidelity step Trace every bit that yeah so we have uh while we run We Gather every every state of the machine as it goes through however the particular study that we're going to focus on uses a subset of it but that's just because it's a starting point yeah so when it comes we took one tournament we took 358 start and state Pairs so the start state is some situation present in a game um this is a tournament from Hamburg in 1868 so back several decades ago um someone
was in this position and was faced with what move should I do and then it turn then we don't evaluate what that person did or what the ideal chess move would be we evaluate what exactly would the micr chess uh CPU program as in as created by Peter Jennings how would it evaluate that and what move would that program make as a player M no right no I mean on an 8bit computer back in 1970 something it's incredibly competive compared to deep blue not a chance compared to the average I don't know I don't I
can't even play chess so what do I know say it's surprisingly good um but there's also a number of issues outside of that as far as the micr chest program itself when we demonstrate it later we can show that like you can move any piece anywhere on the board um it doesn't really have like oh you can't do this move this is not a rule like you you can do anything you want so that's kind of a danger which is why we we chose preet up boards and let the computer play through it so that
we know that what's like intuition behind the logic on that program effectively like Um what it does is it goes through and it evaluates each potential it's either like a bunch of candidate moves and it evaluates it in a chest process where you check for checks and like see should I move out of a check then you check for captures like what pieces should I take and then you look at strategic position and so it goes and then based on I'm not sure the exact like how it evaluates but it some kind of rating and
then BAS Bas on that it picks the move with the highest point rating and selects that so it does have some kind of strategic intuition is that possibility also like select some distribution over that or uh yeah because it's it's programmed to play certain openings and certain moves so for example if we were we're hoping that like when we study beginning moves that it will be able to see that like it always happens to play these three openings and you can start to detect like you can build a model for where it is without actually
even looking at any of the source code or the program um so when we run these experiments we have these this start State we have this end State we get it into a form where we can input it into a neural network because 65,000 bits is a is a lot of input neurons so we convert it to only the bits that can possibly change we use that as the input layer and we use the Same for the end State as the output layer of the neural network we use tools that we buil we call it
I tool where we can input a series of parameters that we want to sweep and we do a grid search where we go over different numbers of eatbox different learning rates different functions for like a threshold value and other things like that and we produce a bunch of neural networks with these configurations then we go to our test set and we test it on these neural networks we see how it performs where it makes the errors and hopefully in the future we'll be able to start interpreting why um here are some of the results from
our experiments these are what our files look like it produces a lot of files and a lot of data and it's all reproducible we we spent a lot of time building tools to make sure that we can reproduce this data we've done a lot of testing to make sure that like this is stuff that even though it it's quite a lot we can like if we hand this to someone else they'll be able to reproduce like what we did this semester um so back to our our platform how what are the actual tools that we
use like we have this nice idea of oh we're going to take this these board States for micr chest and we're going to put them into a neural network somehow and then we're going to train the neural network and then test it but how do we actually do that we had we built a lot Of software underneath it using our virtual machines using the data storage that we have um so the first step in our process is how to even SSH into these virtual machines and so we we just type SSH but there's a bunch
of things in a repository called ner tools that we use in order to be able to use open stack and get into these virtual computers um this is an example of one of the scripts that we use but they help in the sense that these are virtual machines and so similarly to if you just set up an instance of AWS or something like that if you leave the virtual machine running it will keep running and so you need to be you need to have a way to call the machine where if you're inactive for a
certain amount of time or if you've used it for like however many hours you want to be able to cut it off to save your research budget and so we have tools that if for some reason we leave our computer and forget to exit out of the show or if it keeps running through a process and it gets stuck in some kind of loop that we're not aware of it kills the virtual machine before completely destroying the budget um so then with this we also have um like our software how do we actually play micr
chest on the 6502 so for this case um we have our CPU here and we have So can I jump in for a sec on that just for the S thing so we uh one of the the things is that at least right now I think we're still charging for gpus only when the machine is powered on unfortunately what we found out is that so so this is a great so this the SSH will if it's powered off will go and power back on that VM before automatically for you right which is um I think
a very useful tool for anything that's that's statically acquiring that that has his charging for things especially gpus which are ridiculously expensive to so let me just restake just so that everybody caught what you were saying so um this isn't a a built-in feature but we just hacked around so we we we figured out that some of the interfaces that were available meant that we could make it look like it was an ond demand SSH so SSH to this arbitrary name it's not an IP address but it's a name that will automatically get resolved using
the the open stack tools to an IP address if that doesn't exist if it's not ready it will trigger open stack to bring our VM up and then the other stuff that we're running inside the VM will if We've got science experiments running uh it will keep the machine running if it detects however that we're all off of it and there's no experiments running it will considered idle and it will pull the machine and shut it back down so that that way Sanjay I CJ Patrick We're unaware of this our workflow doesn't change but um
we we also have a Max timeout limit you know in case we up because I don't have infinite budget uh the other thing that we really wanted so those of you from the Nerf plan we really wanted access to our budget so we would have actually stopped the machine if we were exceeding a budget we couldn't figure out any interfaces for that yeah so that's just very practical running data science like experiments in the real world we have to worry about money so we we think like oh we're going to be exploring these neural networks
going to be diving into the depths of micr and we spent the first couple weeks like working with open stack and troubleshooting and because the price of that GPU machine is expensive on the other hand as we got to the end of the semester we really appreciated it that GPU we have access to has Um what do I care a fuckload of memory um and it really allows us to explore very uh complex uh neural network configurations but it's expensive true so back to micr chess um so we have our 6502 our simulator we have
the micr program and so we're able to run it in a way where like you hit C to clear it when you hit P this is when the computer goes through its process and evaluates mooves so the computer is the the group on top it's the white pieces and so when I hit p in my keyboard it's going to go through a series of instructions where it figures out like oh these are all the possible moves I can make as my first move and this is the move that I should do and it chooses the
most common move opening move in chest as a result um as a human you can also play against it you select the piece like for example I select the pawn on Square 63 it uses octal notation instead of traditional chest notation because it's easier for it to store move it to square 43 and so then you can move that pond and then you can play against the computer again the computer chooses like move the KN out Here you can also switch sides which is helpful for us when like we don't know we don't we don't
have to know who's playing first whether it's black or white we can automatically put the computer on the top and they're the ones that um and then once you type Q it provides a bunch of statistics about everything that happened as far as what the values of the registers are but importantly you can also see how many instructions were executed um how many EX how many were executed per second and a bunch of other data you can all there's also a bunch of other flags that you can add to the 6502 that allow you to
find the trace dump it add it to a file and then work with it like be able to do so in your use case it's just send a board you do the work collect the trace put the output it's not like a full game it's like one yeah it's yeah we take each board State and then that's that's our game schol yeah so then we go through collecting data so we talked about the process of going from a PGN into a state Vector itself and so the PGN file is just a list of moves it's
not very helpful in telling you how the board looks unless you're able to visualize a chest board in your head so what we do is we convert those to Fen files foresight Edward notation where we show on each of the eight lines of the board this is where the pieces are um this is How it's set up whether the the square is empty whether it contains this fond or anything like that and then we are able to take that board State and then put it into the initial State Vector for the micr program and then
run that and once we run it through the 6502 that is our input state that we give it and the output state that it comes out with as well as all of the traces in between so let me just point out something that these guys had to do right so got a chest board right but that thing on the far right there that UC board that's a binary Vector right so they went in they reverse engineered where in the program the bites are that represent the board they figured out what that meant and then they
converted made a tool that just generates a bit Vector that they can slam into a state of the program and then automatically now the program's sitting at that board state so that's already this idea that we're converting the world into just binary operations on States of register in memory and then you only store the vectors that are possibly like change in the system right because there are some that will never change uh so no so the the initial State that's put into the Computer is the complete State they get an entire end State and actually
every state in between but the study they do is that they and they'll talk about this in a second they're going to go and analyze of all the games they've ever played which bits in the locations did ever change and then for the for the sake of scalability for the moment we only work with those and just to another point about one them through this semester we spent a lot of time working um in a Bottoms Up kind of framework where we built a lot of low-level tools we built a tool that takes a PGN
file and turns it into a f file and we built a whole bunch of these tools to the point where now we can just run one script and it executes all these scripts below it which allows us in future semesters we can build on top of this work and it allow it like data collection can become a lot faster running actual experiments as we'll see in a bit can be a lot faster and so like a lot of the the going was slow but we're hoping that this will accelerate like future developments as well going
off the question that was just asked this is how we ended up taking the the excitable bits or in this case we took the start State and an end State and using an exor we were able to just find the specific bits in that Start and end State play that were changed throughout the process um and then going through all of the games that we had that's when we order all these State vectors that we had of only the bits that were changed into one large or um s Vector which is our mask that we
used to mask out on all the games the bids that we were actually sending through the neural network just out of curiosity did you ever check like in the test cases like did you ever find out in the test Cas that these were the only bits or like was there like a anomaly or something where some other bit also changed like in the test cases so this this was for both our training and our testing St for for like our our total data set basically um we did look into it because we have a tool
called we call it SV info what it does is it takes one of these State vectors and it tells you which bits are one and where they're located so then we can use that and figure out like wow there's a whole bunch of bits between locations X and Y that that's probably the board that changed or like oh the all these in the end this is determining whether black or white is the one to move and then you can evaluate it from that way too as far as like oh we we created this process but
like just a heris check is this actually working we can go and look and See like does this make sense that this changed I'm just GNA clarify something for pro because just to make sure that you're at it so the The Mask yeah that says these locations change was evaluated for all games not just the training set not just the test set okay and it I'm sure they'll tell you the number but it's about 690 bits locations in the 64k vector that are actually changing from a start to an end when you consider all starts
and ends this is totally lostness right there can't be a bit that ever change that you don't accumulate in M yeah and then we can run it on this is just specific to our data set right now this is not specific to all board positions of all time put on micr chest but the goal is that in the future when we start working with different data larger data sets that we able to figure out and keep adding you just say that of 64 kilobytes you had only 600 some bits that were changed 690 odd bits
ever actually change from start to end so they might have changed as intermediate results like it's locations right these Locations changed from start to end mhm if a bit you'd have to go through every state if you wanted to find out what bits ever change because right now got set back to its original we wouldn't be aware of so these are only the start and end states that we're choosing not all of the 14 million States in between where all the registers I was wrong yeahor but I'm gonna guess that they kind of like are
containing to clear regions right that start to be so those are the area write memory so there's so that's all the interesting stuff right what what how would we discover these what do they mean but for the moment we got to start something so then we also put these tools because at first when we started working we did not use um some more conventional neural network tools we used something called fan fast artificial neural networks which is running C and we use specifically float fan and so we had to build a few Tools in order
to convert these bits to floats um so that we only have 690 float values that are going used as inputs and outputs to our Nal Network so that's what some of these tools are Here um validating our tools with tests It's always important to test test test and make sure that your tools are doing what they say they do even when you're changing and in the middle of an experiment to rerun these um is always helpful um and then the process of actually running an experiment so this is the so the first part is the
collecting the data and this is goes to the second part where we have our data set we want to take out uh like we have our layers of our neural network we have what the input neurons will be we have what the output neurons will be which will be these bits that have that could possibly have changed and we go through Force phases where we set up whichever neural network configuration we're working with whether it's either fan or P torch we then go through and train on whatever percentage of the data we're choosing or whichever
data set we want at that time um we then test on a unique unque data set that is separate from our training data set and then go through a results process where we create a a Jupiter notebook that calculates some basic metrics as well as includes other numbers that we can then work with further in that notebook and each experiment that we run produces its own unique configuration of the neural network and its own notebook that you can go and run so if we run this command once it could produce say like 30 neural networks
with all unique configurations And then we can go in and figure out which ones work better which ones didn't and why so these are some screenshots from from the results notebook we tracked local errors which are in this file um in this specific board it moves from start to end State and it was supposed to predict that these 128 bits changed but it actually predicted however many bits and then we divide that by the total and so you can see like just get a relative idea of which ones were more accurate that's also why it's
above one is because there are 690 bits total that change but sometimes only like 110 bits change in in this between the start and the end state but our neural network predicted that 1 would change and so dividing 150 by 110 you get a number that's one things like that we also graphed the total errors which is like the global errors of all of our vectors and the training loss function uh for every 10 box fact to read without the eight the eight plots I can't see the title trial trial I see so yeah so
these are for each individual trial um which is the configuration of a neural network okay and also that that one's on the left those local areas that are above one that was all when we were using fan which is a very old tool which is very Brittle and we're not even sure we used it right but soon as we switched over to using more modern tools High torch and so on we actually eliminate B the neural networks actually get very good very fast yeah this is one of the results from one of the specific experiments
that we did so a specific neural network configuration in this case you can see in the top right it we started with the 690 changing bits starting in the end state but in between we had three hidden layers um double the size of the the starting and the end stage that gives the neural network a lot of flexibility to try to track uh what's going on in between the process of when it's predicting um but basically we were able to see out of 73 results for this experiment you were able to see that 16 of
them only eight bits were actually predicted incorrectly so when it went from the start and the end State it's prediction was almost spot on maybe off by one bite um but we but out of one bite in 690 bits that are changing and 6,000 um total bytes in a state Vector that's not a lot of error that we see for this one I missed it how do you divide the the training and test data set so we we randomly select in in this specific experiment um we went through We had seven different games each game
had a number of board States and so we randomly assigned those board states to a training and a testing set what was the the percentage it was 8020 split so for training and 20 for testing yeah yeah that's a pretty typical yeah yeah so that's how we miss but like just taking a step back right you're trying to replace 14 million instructions with a couple major multiplies and sigmoid or something right like that's the idea and you have eight bits that are wrong it seems like you're honest that's like the big picture eight of them
were predicted totally accurately so you you could have just run this neural network and you would have gotten the exact state that you would have run if You' run 14 many instructions and it's sort of ridiculous that it should have done this well with such little data and it's actually also nonsense we can get that would it reduce your data set to use some kind of hyper parameter optimization sure actually matter of fact all kinds of things could be done at this point a lot of this work was to set us up to start doing
this kind of study because there's actually the architecture of the network doing things that are much more interesting like actually using we all know as computer systems people the fundamental unit is a Bite restricting ourselves to only look at the bits that changes the input that seems ridiculous right there are all kinds of interesting things now for us to start exploring Eric yeah yeah so I was going to say I ask what is the what is the comparison between the computational cost of this neural net to the 14 million instruction you skipped and what do
you expect that to be once you have your Hardware accelerators Etc okay so the second part yeah first part it's right I mean this is a huge GPU sucking up more power than that CPU for sure despite it being my shitty C code that's running the 6502 the idea here is not so much this particular technology right now we're just trying to study the baby steps sure is it even viable with no semantics to take a binary vector representation of a computer and infer that there is opportunity for acceleration now the second part of what
you said speaks directly to my original motivation if you don't look at the Crappy pardon all the machine learning people here the horrible things we as computer scientists have done to do machine learning you actually look at there's two parts of this community there are those who have been studying it from a biological perspective and those who have been simply making sure that Google can predict what we're about to type next using a lot of computers and a lot of energy those who have studied the neurobiology suggests there are both scalable and power efficient implementations
if we're willing to go radically different architecture I I've seen hugely efficient implementations once you have an architecture set right I I watched the PHD thesis presentation at MIT about taking quing models and turning them into having the the the actual architecture in Hardware you fed the model to it and it was now it kept up with real time and was as accurate as it was running on GPU for hundredth of the cost right just most people have per milon but you're right that's so what Eric is saying is while everybody today for those of
us who are sort of not in the field we hear about machine learning it has to do with the Matrix math that's being deployed on gpus because that Seems like the way we could get this to work but the there are other branches of machine learning that study if you knew the architecture like biology you could grow or build a custom circuit that is radically more efficient as a matter of fact a lot of the most early interesting stuff on what is called neuromorphic Computing demonstrated Nano or Pico jewels to do an inference not gws
of GPU please not that I have an opinion yeah there has been one particular thesis Strat yeah so what this means as far as future deployment uh or future development so future directions first prod we talked about a lot of them already but um applying them on different programs on this program we can apply them to more games we can achieve larger scale we can start to look at the full trace and pick different spots instead of just the start and the end States and we can start to learn the actual representation of what what
is this computer doing how is it able to calculate these moves um we can do them with more and less complex programs and see how our neural networks Fair um we can also use like Different architectures for our neural networks and continue to improve on that front we can also start looking at new systems uh new systems in the sense that they're more modern such as a risk5 architecture s86 um goals and aspirations for the future is to find a model of execution somewhere as well as achieving architecture Independence because we our hypothesis is that
we can find some model of a computer outside of the specific instruction set and everything to it that we can abstract away um and the goal is to continue building a concrete foundation for future data DP just want highlight this last point because there have been plenty of my graduate students and folks who have C their heads against this one of the big contributions these guys did is bring this into a world where we can now produce Jupiter notebooks massive amounts of data we are not the experts here we don't claim to be the experts
on machine learning but now we can actually bring people into the fold to work on this problem and they don't have to know that they're working on speeding up computers they can do what they do well And we can work with them in a really kind of seamless way now of course I don't know if sanj's on the call so I can say whatever I want maybe he would say no that I could never work with you guys but oh perfect so then it is the best collaborative experience that he has ever had working with
assist I have you tried the last closing the loop so taking the infert vector putting it back into the machine State and so um we so this piece of work right now a lot of what we focused on was setting things up to do studies so the past work has been all about could we get any form of online real acceleration and we exactly take an inference we feed it back back in we take that inference we feat it back in and we can actually play forwards and backwards but we're also very careful in that
past work this is ridiculous right we're saying the beginning of a move to the end of a move in past work where we what we made the prediction problem was much more engineered and precise we basically said hey look you stand back we're watching a Reoccurring pattern let's make that thing the thing the prediction problem problem is because now that might be a a a front end of a loop it might be the start of an interrupt right whatever the case be in the LA when we were doing this they were computationally bound work we
purposely chose Loop heads and then we said ,000 Loop iterations forward that's your uh uh task here we just wanted to say hey look we take something extreme an i driven interactive program with a complex algorithm behind it and just say semantically the problem is can you replace the entire algorithm which is roughly ridiculous right but we can ask the question concretely we can study it and now we can start exploring all kinds of interesting things but the ask paper talks about how we recursed the networks and things like that no but just here stupid
uh oh just one wondering when you feed back your inference does the program work or like one of those beats are actually physically I see what you're saying so yeah so we actually the we only actually To be honest got Real Results towards the end of the semester because we finally got everything working so um one of the two built that final tool that took our inference and put it back and order it into the original state so and now the nice thing is yeah we can drop it in place now since we know that
it's bit per the ones that are bit perfect of course it now what we're doing is looking at ranking and clustering where the errors are what they are because for instance if it is in a right now we're not doing anything no filtering if it's in a transient bit right that's right after to write before read then of course it won't make a difference we're only starting that work now yeah okay as well as if it's MSB versus LSB right make a lot yeah this might already be clear but in in some of our other
work uh we worked something analogous to Branch prediction just at these larger intervals so it had that property of sure you might pay for cost of incorrect predictions but it's impossible to commit to incorrect computation so that was uh like when we try to make this a little bit like tighter those are the type of guarantees that you can provide yeah obviously you can do though right say again here is less obvious that you Can do that yes so here so what Tommy's talking about is when we actually took these ideas and built a system
to actually achieve speedups this is the the underlying study of the networks so what Tommy's talking about is what we did to stop to to bypass this incorrect problem what we do is we take the output of the neural network right we then pretend it's like a speculation we take a core and speculate from there to get a new end State corresponding to that start State we then put that into a cat now if the computation hits and finds in the C then you jump to the end and that we were able to demonstrate as
crude as it was one it can't make a mistake two it can actually we were able to see that the system could achieve its theoretical speed up maximums which in our case it was a third speed up um using no par nothing was parallel in the programming but the implementation was completely par so I'm taking this away from sort of the end results this part of this talk I think is about you know creating an Environment where you can go and log into a computer right and just have lots of data and lots of and
computational resources available to you to work on in a very interactive fashion right um now and batch what's that and batch you can also launch batch experiments if you want to sweeten new parameters generate new data and so on yeah now if I look at so first of all there's something that's turned out to be awfully broken in this which isn't your fault at all is that what we found out afterwards was that openstack reserves a GPU for you if you have a M that's even if it's powered off so right now I don't know
if we're charging you for it or not yet but but but the real charges someday will be it doesn't matter if you power it off which is insane to me but that's not my that's that that is something we haven't be a to work around containers on the other hand do have this attribute where if it's if it's off then you don't you you end up freeing all the resources of it except for the storage um I don't know that but but actually I'm curious of anyay from redhe hats online is like could we you
know this IDE of being able to just log in interactively and have the machine powered down where it's not being used that it seems like people by default we had to do something special for our courses to have all these amount these resources freed when you're not doing it does I don't know if open shift or open shift AI supports the releasing resources when it's not active um and this idea of just being able to use this as an interactive service SSH and add an extra little bit of shim has anybody done that with the
container environment is there any reason you wouldn't use containers for this because of all of our custom software I mean are are we're using a specific tool chain we're using our own C code a lot of this was custom and we needed an environment that we were familiar with if somebody was going to do the work and give us an environment of containers that do it all like for the course you created your own image with so I didn't want to do that okay so it's about the creating a custom container and stuff I mean
and I wanted these guys and sanj to basically just SSH yeah well it doesn't seem like a far stretch um but it seems like something we want to support like part of this is just a use case for for what we need to provide to researchers that just want to Have that interactive experience from redhead is online first you have any opinions on this um no not really nothing that I could add and make it better from here this is something we should try to internalize and figure out how we can actually support exactly this
kind of use case because it seems like generically very interesting but very useful a question so when you say architecture Independence do you mean that you have a accelerator that's attached that your you're working bid to do it or complete architecture Independence um so what we mean is that a binary Vector representation of a computer is a binary representation of any computer right I mean you can you can create ones and zeros I don't really care what the instruction set was I don't care what the code was we're operating at the level of the underlying
literal touring configuration states of the machine now we've got other work Where what we did was so imagine the following um let me start not with the state vectors let me start with something easier every machine you can say you could enumerate all of its Ops right x86 it's in a un huge number like well over 4,000 plus uh uh unique op codes right now if I took the op codes of any computer is going to go from zero to n right and then what I did was I said okay I'm just going to watch
computation and the most frequent whatever the op code is on both platforms it's whatever its most frequent op code is I'm gonna make that the zero instruction op and then I might see a different one then becomes one and so on now what you can do is you can create a a trace from both computers right with their op code ranked index per time unit and you can plot it on the same graph right you can actually watch and plot two different computer executions where X is time Y is op Cod right and if I
have a problem with scale then I can scale now think of this if I now take a Machine's State Vector represent which is very different right 6502 is 64k uh we've done it with x86 machines uh with I think we've done two things we've done sparse address spaces and we've done condensed address spaces to four gig and now what you do is you do an sore so at every time step instead of gathering the full State Vector like these guys have done you say what is the exor between this and the last state now you
record the exor and you're going to find that theor actually repeat and now they become a dictionary a token or a new kind of instruction now you rank those by frequency and you plot again and we've done that and it is fascinating you can take micr chess on x86 you can take micr chest on on 6502 you can plot them and you will see that if you start expanding or constra the timeline all of a sudden you can find where they fit which maybe is not that surprising the x86 has about a anywhere from a
one to nine instruction compression compared to 6502 because 6502 is a wrist Blake architect if you use that dilation Factor all of a sudden the graph start to look the same so if you learned on the underlying structure what's going on I'm going to argue that the neural net is actually learning something that is the actual property of the execution the algorithm rather than a fe actually more than the algorithm we demonstrated that it also includes the Distribution on the input data an algorithm is one thing but a shortcut has to do with finding and
exploiting things where the data and your algorithm so are interacting why didn't you like why did you obviously the whole state Vector here 64,000 times 4K four times 8 why didn't you do the xor because that would presumably be tremendously less data we do I mean our data set has all of it in there but we just wanted a very straightforward concrete problem that we could map to a prior thing we worked on and ask was the easiest one for us to focus on take one more swing at that question of architecture I think it's
really um so like you know I think this Chess thing is really cool but you could take it or leave it right I think the real magic is in the transformation from uh a general purpose process right an instant in time to in alternative representation that can be consumed by other Hardware right so like that is the the first step of that Independence right so going from like processor format to uh to GPU format right but it's not just GPU format it's a bit Vector that's an input that's amenable to just linear algebra right and
like holy crap the the the processor there is math right so like I think it's all about looking for these Transformations like the The Gather that takes it out of the CPU and into that new form format the scatter that brings it back into the processor format you could start to wonder like hey can I start putting these Transformations together can I take it off an x86 processor put it on an arm processor right can I take it off that and put it back into math so like the bigger idea of architecture Independence is kind
of in generalizing this transation and there's evidence biologically that it's a it's a multiscale iterative Process we have parts of us that are operating in this moment in time in this location right but our knowledge that makes us efficient with what we're doing right now is actually much slower and much more abstract right and somehow we take what's happening right now we propagate it up we get predictions on the way down that become localized to what we're doing now so we wrote a proposal where we start talking about this idea as a piece of a
much larger architecture where you imagine neural networks or whatever you want to call them State Vector Pairs and neural networks that are kind of being abstracted and abstracted and abstracted away that then are going to be giving you predictions downward now whether that goes to the end to an x86 to an arm to a 6502 it's the as you go downwards you start specializing the prediction to your particular consumer and the cashes I talked about in some sense is the Final realization of that when I stick a state end and start end that's to this
computer in this location right now there's again this is obviously Decades of ideas and research so I'm happy to keep going but I want these guys to be able to sit down um so again I personally think it was an incredible uh Hall to jump into this and the Very fact that they didn't go running in the middle of it with all the little pieces that we were trying to get working and validate was tremendous so is that a poor sign of of intelligence really cool really cool stuff an hour anyway