speaking of how fast things have been going let's talk about scaling laws so for people who don't know uh maybe it's good to talk about this whole uh idea of scaling laws what are they where do your things stand and where do you think things are going I think it was interesting the original scaling laws paper by open AI was slightly wrong because I think of some uh issues they did with uh learning right schedules uh and then chinchilla showed a more correct version and then from then people have again kind of deviated from doing

the computer optimal thing because people start now optimizing more so for uh making the thing work really well given a given an inference budget and I think there are a lot more Dimensions to these curves than what we originally used of just compute number of uh parameters and data like inference compute is is the obvious one I think context length is another obvious one so if you care like let's say you care about the two things of inference compute and and then uh context window maybe the thing you want to train is some kind of

SSM because they're much much cheaper and faster at super super long context and even if maybe it has 10x wor scaling properties during training meaning you have to spend 10x more compute to train the thing to get the same same level of capabilities um it's worth it because you care most about that inference budget for really long context windows so it'll be interesting to see how people kind of play with all these Dimensions so yeah I mean you speak to the multiple Dimensions obviously the original conception was just looking at the variables of the size

of the model as measured by parameters and the size of the data as measured by the number of tokens and looking at the ratio of the two yeah and it's it's kind of a compelling notion that there is a number or at least a minimum and it seems like one was emerging um do you still believe that there is a kind of bigger is better I mean I think bigger is certainly better for just raw performance and raw intelligence and raw intelligence I think the the path that people might take is I'm particularly bullish on

distillation and like that yeah how many knobs can you turn to if we spend like a ton ton of money on training like get the most capable uh cheap model right like really really caring as much as you can cuz like the the the naive version of caring as much as you can about inference time computer is what people already done with like the Llama models or just overtraining the out of 7B models um on way way way more tokens than isal optimal right but if you really care about it maybe thing to do is

what Gemma did which is let's just not let's not just train on tokens let's literally train on uh minim minimizing the K Divergence with uh the distribution of Gemma 27b right so knowledge distillation there um and you're spending the compute of literally training this 27 billion ion model uh billion parameter model on all these tokens just to get out this I don't know smaller model and the distillation gives you just a faster model smaller means faster yeah distillation in theory is um I think getting out more signal from the data that you're training on and

it's like another it's it's perhaps another way of getting over not like completely over but like partially helping with the data wall where like you only have so much data to train on let's like train this really really big model on all these tokens and we'll distill it into this smaller one and maybe we can get more signal uh per token uh for this for this much smaller model than we would have originally if we trained it so if I gave you $ 10 trillion how would you how would you spend it I mean you

can't buy an island or whatever um how would you allocate it in terms of improving the the big model versus maybe paying for HF in the rhf or yeah I think there's a lot of these secrets and details about training these large models that I I I just don't know and are only priv to the large labs and the issue is I would waste a lot of that money if I even attempted this because I wouldn't know those things uh suspending a lot of disbelief and assuming like you had the knowhow um and operate or

or if you're saying like you have to operate with like the The Limited information you have now no no no actually I would say you swoop in and you get all the information all the little characteristics all the little paramet all the all the parameters that Define how the thing is trained mhm if we look in how to invest money for the next five years in terms of maximizing what you called raw intelligence I mean isn't the answer like really simple you just you just try to get as much compute as possible like like at

the end of the day all you need to buy is the gpus and then the researchers can find find all the all like they they again sort of you you can tune whether you want between a big model or a small model like well this gets into the question of like are you really limited by compute and money or are you limited by these other things and I'm more PR to arvids arvid's belief that we're we're sort of Ideal limited but there's always the like but if you have a lot of computes you can run

a lot of experiments so you would run a lot of experiments versus like use that computer to train a gigantic model I would but I I do believe that we are limited in terms of IDs that I think yeah CU even with all this compute and like you know all the data you could collect in the world I think you really are ultimately limited by not even ideas but just like really good engineering like even with all the capital in the world would you really be able to assemble like there aren't that many people in

the world who really can like make the difference here um and and there's so much work that goes into research that is just like pure really really hard engineering work um as like a very kind of handwavy example if you look at the original Transformer paper you know how much work was kind of joining together a lot of these really interesting Concepts embedded in the literature versus then going in and writing all the codes like maybe the Cuda kernels maybe whatever else I don't know if it ran on gpus or tpus originally such that it

actually saturated the GP GPU performance right getting gome here to go in and do do all of this code right and Nome is like probably one of the best engineers in the world or maybe going a step further like the next generation of models having these things like getting model Paralis to work and scaling it on like you know thousands of or maybe tens of thousands of like v100s which I think gbd3 may have been um there's just so much engineering effort that has to go into all of these things to make it work um

if you really brought that cost down to like you know maybe not zero but just made a 10x easier made it super easy for someone with really fantastic ideas to immediately get to the version of like the new architecture they dreamed up that is like getting 50 40% utilization on the gpus I think that would just speed up research by a ton I mean I think I think if if you see a clear path to Improvement you you should always sort of take the low hanging fruit first right and I think probably open ey and

and all the other labs that did the right thing to pick off the low hanging fruit where the low hanging fruit is like sort of you you could scale up you a GPT 4.25 scale um and and you just keep scaling and and like things things keep getting better and as long as like you there's there's no point of experimenting with new ideas when like everything everything is working and you should sort of bang on and try try to get as much as much juice out as possible and then and then maybe maybe when you

really need new ideas for I think I think if you're if you're spending $10 trillion you probably want to spend some you know then actually like reevaluate your ideas like probably your idea Limited at that point I think all of us believe new ideas are probably needed to get you know all the way there to Ai and all of us also probably believe there exist ways of testing out those ideas at smaller scales um and being fairly confident that they'll play out it's just quite difficult for the labs in their current position to dedicate their

very limited research and engineer ing talent to exploring all these other ideas when there's like this core thing that will probably like improve performance um for some like decent amount of time yeah but also these big Labs like winning so they're just going wild okay e

Scaling laws for AI: Bigger is better | Cursor Team and Lex Fridman