hello Community today we look at llms that fall apart and you know what especially reasoning llms you ask me hey when do I have an 8 billion llama model and when is the model enough and under what circumstances do I have to update to let's say 405 billion so for what task do I have to update and pay more and today I can give you the specific data for this or you said hey when do have to change here a classical llama model to this test time compute 01 or the R1 or germanized sying at what particular complexity do I have to switch the complete family of models when is it necessary and when I can still save a little bit in going with a smaller model well as a subscriber of this channel you deserve answers and you know it's not that today today for me is February 5th 2025 we look here at the community benchmarks and if you go with the category mathematical reasoning you see we have here number one number one and number one the o1 the R1 and GT great and if you look at the ovall we have here Gemini and then we have R1 and then we have o1 so we get an idea what are the best mods but you know what there another Point what is better if we have a lot L 3. 1 70b and we run it again and again for 128 samples and then we have a dedicated reward model or we simply go with best of n is this 70b model then better in the reasoning than 45 billion where I only have one pass what is here the relationship when is a 70b independently how often I run it not able to come up with the complex C solution that a 405b can where is the break even point for my models let's answer this question and here we go we open up this video with some beautiful ideas about logical reasoning because logical reasoning is here the Cornerstone of human intelligence and it remains here a vital and a central challenge for all the system that we have so we need a framework to work in so we need an ideal evaluation framework for the question I just formulated as to have four categories beautiful you read it you agree with me and in the last Millennium it was Microsoft who developed here a very efficient satisfiability model Theory an smt problem here and this was called Z 3 it uses your conflict Clause driven learning algorithm backtracking approach based here on the Davis Putnam Lan love land algorithm and if you want to have a deep dive this is the link for you plus I have to went to the library to see this this is from 1993 know I mean this I don't think computer were invented at that time but anyway there we have Glasgow Scotland and this zebra problem that is described here we're going to work with this so yeah the latest insight into a research comes from an idea from 1993 and you might say ah so this is here this zebra topic and you might say how do you why do you know this well you're not going to believe it but when I half a year ago designed my extreme logic test for strawberry exactly I went to the library and was looking here at all the different possibilities how to configure here a challenging logic test with backtracking and this is now here one of the sources that I used here half a year ago for my test and this is the video of it so let's have here this spark of Genius here and let's Jump Right In and you know from my last video where I showed you that Sanford University created here the S1 reasoning llm you know analog to the o1 and the R1 TDC we find out okay so we have a new data set from gerini then we have 26 minutes of supervised fine tuning on 16 h100 gpus and then here we have it because we used an open source q1 2. 5 32b instruct model we did the supervis fine tuning on this and now we have a new S1 llm beautiful and you notice that this here was all if you look at this closer here in training time compute so now imagine make it easy Al larm 3145 and some of you ask me hey do I need this test time compute or to be more precise when exactly give my particular task do I need the performance of test time compute because it is maybe more important or more expensive or the logic is not there whatever to answer the question do I need this we could look here at the performance data and my last video and you noticed here looking here at the S1 and at the qan 2.
5 and at the o1 and the R1 you know what we noticed we noticed if we only optimize the data so we only have the perfect data set our S1 without here what we call here test time compute and there was a particular algorithm that Stanford developed here for the S1 test time compute so without this we have a performance if you look at this of 92. 6 and if we activate the test time compute we have now a performance on a particular test of 93 . 0 so the difference is exactly 0.
4 percentage points do we need test time compute for this and you might look at the next text we have here 56. 6 59. 6 you might say yeah maybe no well you know you remember that I told you Stanford is not the optimal configuration they had a different aim they wanted to have a simple test time computer optimization so let's come to the core of today's video I promised you data so let's start here we will introduce here from a new publication zebra logic this is a new evaluation framework for exactly creating here logic puzzles with a controllable and quantifiable complexity this is beautiful we have here a data set this is a bench Mar of 1,000 logic grid puzzles with increasing complexity now the complexity matric is now twofold at first it is just the sheer size of the search space and then yes this is the reason why I told you about a Z3 conflict count this will be our second complexity measure and yes if you're an expert the cic problem is proven to be NP complete so let's have a look at this simplest case we just want to get an idea you know we just want to get a feeling for this so let's see we have here three houses numbered one two three and we have persons and then we have some Clues and this is what we want we want the houses then we have a name and then we have some I don't know whatever you want to call it assignment of values to this so this is a real simple Matrix where you say I just want to assign here all the property and all the values and all whatever there is in the test in the correct way give we have some Clues this is all there is and houses M different attributes for each house and the value the number of Clues is K couldn't be easier no now let's have a look here and you know a little bit of our theoretical physics we have a look at the solution space of our and neural network structures and we Define the solution space of the C logic puzzle here is the total number of possible configuration that can satisfy the unique constraints of the puzzle so we have now as an n * m grid and I just here formulated n houses M different attributes so the solution space is in the order of n factor to the factor of M beautiful just to give you a feeling as 3 * 4 grid has a solution space of 1,300 while a 4 * 3 grid has a solution space almost close to 14,000 so careful this is now going to get interesting and yes we need here as I told you the complexity metric for our tests and the second part of the complexity metric is the Z3 conflict count have a deep dive in the literature I've given you if you just want to know here the general idea whenever our set three encounters a contradiction that something is not working out in the Matrix solution the system just backt tracks a little bit it undoes some of the assignment already done and tries a different path in the reasoning could not be easier so the conflict count that is now happening inside the llm is the total number of such backtracking events during the solving process during inference so our two if you want complexity indicators the search space size and the set three conflict count are now the defining elements that we're going to look at if we now investigate the complexity of your task that you want to solve with a particular model let's come here to fact number one this is a study University of Washington Allen Institute for AI eii 2 and Stanford University University Stanford University is here our supervisor beautiful and as you can see this has just happened yesterday for me February 3rd 2025 on the scaling limits of llm phical reasoning and this is a beautiful study and they have the data from R1 included and o1 included so yes we can work with this and you might say ah this is the result so let's have a look here in the first we have our TTC our deep reasoning models now our o1 our deeps R1 then the preview and the O mini and as you can see here this is quite impressive and then we have the non reasoning models the claw Sonet the llamas the gbd4 omnis the q1 2.
5 whatever and you see the jump from this category of llms to this one is significant we go from almost 60 down to 36 so there is a significant jump and you say hey this is the result not so fast because this is not the result that we are looking for we want to have some much more detailed results so what we do we go now on a small grid side we go from a grid side of 2 * 2 two objects with two I don't know characteristics up to maximum three 3 * 3 GD this is real simple and now we look at the performance data look at this deeps R1 is even better than 01 but you see beautiful we are in the '90s and then Sonet is 85% almost and gbd4 Omni is still 80% but careful this is only for the most trivial cases for real easy simple 2 * 2 grits structures if we just go to three * 4 so we have three houses and four particular whatever characteristics up to the maximum of 4 Time 4 the performance breaks down it falls apart the model the model is not able to keep up here with the logical reasoning Sonet Falls from 84 to under 30% and as 3 * 4 logical reasoning path is not really what we will call as human a complex problem but you see here all the non TTC models just fall apart and yeah even gb4 Omni below 20% accuracy and we are just medium siiz but you see the other one beautiful still in the '90s now if we go even a step higher up to a gr of 6 * 3 you see the non TDC models here our second group just vanish it's gone they can't reason anymore this is hopeless there's no way that they will find a solution so if you have a task complexity where you have six objects I don't know six different cars and you have three different colors and you have a reasoning on six different calls with three different colors the non TTC will just vanish unable to solve any task and now it gets interesting now now we see o1 has different but real close r one and then the yeah forget about o1 preview should be gone an o1 mini you see okay so those models have in the higher complexity we need those models because the non TTC model just gone and then they even did some test up to and this here is the size of the solution space and if it's really x large you see if you have a grid of 6 * 6 in our zebra logic then even o1 comes down to below 50% R1 is below 30% and yeah so you see you have to go for the specific complexity of your task you can't just take you the overall like I've seen in some publication already and say this is the result no you have to go for the complexity of your particular task and then you know exactly for what complexity you can use what model and what you expect as a result and now you know half a year ago I developed exactly the same test I had here seven * three elements so if you want my 7 * 3 is not anymore a large but is already here in the x large category and this is the complexity level that my test that I designed for 01 is performing and you see yeah this is necessary because we are here between 40 and 50% and this makes sense so you see half a year ago I had the same idea I showed you the First videos and now yesterday yesterday we got is new research here so maybe somebody has seen my videos I hope so yeah and one of my last videos I showed you 03 mini High exactly on this test and in my humble opinion 03 mini high is even better in the causal reasoning than the 01 model but there has so many other facts that we should watch out for you remember that in the beginning of the video I told you hey when should we we pay up here from 8B to 45b let's have a look here at the complexity that we're going to count and here we have them the 3B the 8B the 70b and the 45b okay this is an instruct you never mind so the red one is the biggest and the blue one here this one is the smallest model what we see we have here on the y axis the accuracy beautiful and then we have here one only one of complexity measures that we have and this is here the search space size lock scale but remember only one and if we just only look at this one dimension we do see how the performance of the size the she size of the free trainable parameter have a different performance almost everything comes down to 10 to the power of six here has almost at around I don't know five six 7% accuracy so think this model here vanish from existence interesting and now to the second question when should you change and what is the jump in the performance if you have to pay a little bit more for the TTC models the reasoning models in inference time models have a look at this this is now GPD 4 Omni Mini at different paes if we have here one sample up to 128 samples so this whatever Li line here in 128 and this here is just the Blue Line in one sample size so there is a difference and if you have here the search space size 10 ^ 4 you see we are here close at 60% but with the smallest plus one we are here below 10% so here we can gain performance but careful we only have one of the complexity dimensions here we have to take care about the other one but more about this in a minute but I just wanted to show you yes there is a particular effect if we have here let's say we we run the same thing here 128 times and then we have a reward model or the best of end model or majority voting or whatever you have but now let's compare this with the gray corve here from the TTC model the smallest the o1 mini this is really the most insign significant TDC mod it outperforms everything else so to answer here when do I have to pay up here to the TDC reasoning models to the complex models exactly when you have your problem with search base size is higher then you would need here a model that at least is here I don't know 10 ^ 6 here at 60% or maybe here so you see there is a significant difference it is not that you can stay here with a 40 mini and just run it 128 times and take the best solution of this this does not provide a better solution then even here a pass one with an 01 mini so you see scaling here this chain of sords in the inference time calculation in the test time calculation this brings you exactly this performance jump so therefore now you know exactly what is the effect here and this is just the 01 mini the 03 will be somewhere else okay now let's compare the TDC models now careful change in the dimensions now my xaxis I have changed not to the second complexity measure this is now the number of the set three conflicts that my llm Encounters in the inference run and again here my accuracy no problem at all so we have here in blue the o1 and you see this is here if you want the best model then we have the R1 and then we have yeah in green the o1 preview forget about it and then in red the o1 mini yeah should also not really be of any importance in the next weeks and months to come but then here this two lines here in brown and whatever this other is here in FLA this is gbd4 Omni and now you say should I use an 4 Omni model or should I pay up for an 01 or an open source R1 model here you see exactly given the number of the conflicts this means we increase here the complexity of the query of our task that we ask the llm to perform the higher the complexity the more conflict the system will have to check out to find the correct solution and then more or less you can go here and you see or you can expect here a certain level of accuracy for 01 it would be 40% here at close to 60 is 60 is the number of the set conflicts in this particular Ze logic test so you get here just to give you a feeling here so scaling the chain of sort tokens here in our TDC mods from the non-dc to the TDC models there's a significant jump you cannot compensate with anything else yeah o one has 10 times hidden reasoning tokens in the other models and beautiful yep now of course they did not know what I showed you in one of my last videos just days ago that if you just count here the number of tokens generated in TTC and you know in green we have the correct response and in red we have the incorrect response if you just count the number of tokens the 10,000 tokens calculated here by whatever 01 like llms for the incorrect response is just a runaway effect that we have because the system doesn't know when to stop doesn't know that it is on the wrong track and it will never find a correct solution this is where we have to be careful if you say hey we just increase here the reasoning tokens the number of reasoning tokens this means the time that we wait because this can also have a negative feedback loop because if we just allow it to increase to 10,000 reasoning token per response this might be that is all wrong all is r and we have no result generated so just giving it the power or the time or the energy or whatever that you can just run crazy with your problem solving capacity this is not the solution but this the authors of this study had no idea of this study because this was within two 3 days but now you my audience you know and you understand that there is even a higher complexity than the Au of the study okay let's come to the summary let's bring everything together so what we learned what I learned today now it is rather simple now if we scale up our llms and we go here with the non-tc we go with the Llama this scaling up the model size 8 billion 13 billion 70 billion 400 billion is only effective for smaller search spaces so if you have a real low complexity of your query of your task then scaling model size will bring you better results so from the 70b 45b yes you have an improvement in the accuracy of your model but only in simpler search spaces and I have shown you the graph so you can have a look whatever you like and you estimate the complexity of your query to be much more interesting this scaling the test time compute what if he change the test time computer inference Run 1 minute 5 minutes an hour what does it bring us really now we have here repeated sampling as I showed you the best of end for example majority voting reward model ranking whatever you want doesn't matter up to a certain amount this best of end sampling can indeed boost you the performance as I've shown you with gb4 Omni but depending on your majority voting system on your particular reward model that you use and utilize and whatever you only have moderate improvements and the gains that you have here by Best of 100 best of 1,000 it this gains here they Plateau out here four puzzles here in a higher complexity group so you cannot improve here or jump to a higher complexity group if you have repeated sampling this is only possible in a very limited real moderate Improvement but otherwise it will fail and now we just had a look now chain of Sword tokens here extended chain of Sword tokens in the test time computer yes we have seen this but I've also shown you here from the other study here they can just run crazy here for incorrect answers but also those authors here found here that these mods if we do this and we allow the system to have more reasoning tokens internally and we're wait now 10 minutes for an answer even those models reach a maximum reasoning capacity even those plate out so when the number of conflicts exceed here a certain point the model cannot proportionately increase the token leading to a suboptimal performance on even the hardest Parcels so also this is just here a very narrow bandwidth that we can say hey more internal reasoning token only gives us up to a certain Plateau up to a certain maximum reasoning capacity that depends on the model let me show you this here we have the o1 mini and the o1 normally and you see we have set three conflicts and the hidden the amount of the Hidden chain of Sword tokens the reasoning tokens and the o1 mini plateaus here out at 12,500 goes up to let's say 13,000 and the 01 full I would here maybe at the 177,000 Mark here starts to PL beginning to Plateau it's not really a plateau it's still going strong as you see here close to 20,000 token for reasoning chain of Sword tokens and those are the hidden tokens you don't see so you see I think this cor is here really explaining this to us I would have loved to see this for the 03 model but yeah I suppose this is too expensive and then you might say if you are subscriber of my channel you say hey but we have self verification and I've shown you three videos where we do a self-refinement of the AI system no we have a self verification self-reflection a self correction and whatever now the authors also examine this at a particular level here here and they say the self verication provides only a slight performance Improvement and they go here if you want to have absolute number from 31. 7% to 33% accuracy and may even decrease the performance with further iterations so more time more token is not a solution because imagine you a human and you don't know something and I can self-reflect if I'm not familiar with the domain knowledge if I don't know what it is about I can self-reflect for an hour and I will not find a solution I will maybe find out what knowledge I'm missing so I can go out and learn the knowledge domain that I need but just self verification does not bring something so 31.