Okay. So last week, Ollama and Hugging Face, announced that they basically created a way that you can access any of the GGUF models on Hugging Face hub. So that currently is about 45,000 different models that you can pull down from.
So these are quantized versions of models that people have uploaded, et cetera. And often they're going to be more interesting than the ones that you find, the stock standard stuff that's actually on the Ollama model site. how do you actually access these?
it's actually pretty simple. So you can see here that to basically run one of these models you use the Ollama run command, just like normal. then you put hf.
co. And then a slash. And then you just pick the model that you want to use.
So for example, here, I would just pick this. I would copy it. and then I would paste that in there.
Now by default, it will take one of the 4 bit quantized versions and install that. But you'll see that a lot of the repos for the GGUF actually have lots of different quantized versions. So we can see in here, we've got everything from a 2 bit quantized version up to an 8 bit quantized version.
So how do we select it? we can add this on at the end colon, and then, whatever the quantization we want. Or we can just come over to use this model on Hugging Face hub, come down here and select Ollama.
And then now we can pick which one it is that we want to use. So in this case, I'm going to go for the tiniest one, the 2 bit quantize. I'm going to copy this over.
Coming to my terminal. And just run this And sure enough, this will start pulling down that GGF version and we can see at the top that okay, this is basically bringing down the Q2_K version in here. Okay.
So now you can see that it's fully downloaded and we can just use it like normal. I can see. Sure enough, this is a two bit quantize model.
it's quite quick. It's uncentered in this case, Okay, so we can use this just like we would, any other model, et cetera. if we want to set the system.
We can just come in here. Set the system like that. And we can see now we've caught our drunk complaining assistant that perhaps doesn't want to help us.
Now at any point, we can, do everything else that we can do just like normal with an Ollama model. So you see that the modal will actually show up in here. and it's actually going to be in its own repository under this hf.
co. But it will act just like any other Ollama model in here. And if we want to get rid of it, We can just simply come in here do Ollama RM.
And you'll see that. Sure enough it will be gone. Just like any other Ollama model.
So you can pick any of the quantization that you want to try in this way, it makes it really simple and quick to do this. All right. So if you're not sure what quantitation format to pick, let's just have a look a little bit about some of these.
So the most common one is going to be Q4 quantisation. So this is for 4 bit quantitations. so when you see the Q or whatever, the number is that tells you whether it's four bit, five bit, eight bit, et cetera going through this.
Now if you're not sure which ones to pick, usually the best one you're going to go for to get the most performance or the sort of bang for the buck is It's going to be the Q4 K models. You'll often see that after the K you either have like an S for small, M for medium or L for large, which will change the size of these. So generally here, you're making some kind of trade-off usually you're going to be making a trade off between size of the model, speed of the model, and quality of the model.
so like I said before, people often find that the Q4 K format tends to do really well for quality. And it's also not, you know, extremely slow. if you go to the Q8 models again, you're perhaps getting a bit better quality, but you're doing it at the cost of having a slower model there.
Now, how does the quality of the model change? it really is different from model to model. in the past, we used to sort of look at this as like the lower the precision, probably meant that it wouldn't be able to do certain kinds of tasks, like function calling like anything sort of related to reasoning inverted commerce, et cetera.
nowadays, you know, my attitude about this has changed. I really kind of feel that you need to try it out from model to model. it can change quite a lot.
If you want to model this just super fast and just good at say, chatting or something, and you don't really care about any sort of higher level kind of stuff. Often you can get away with a Q2 model. so basically using a two-bit quantize model like I showed before downloading.
Now, obviously that's going to be a much smaller model then the other higher bit rate models in there. You can also do things like make your own model file, just like normal. and basically just put from hf.
co and then the model name in there. And of course in that model file, you could put, hard coded system prompt. If you want to do that.
you can see, We can also change the chat template if you want to. Now it needs to be in this Jinja or a double handlebars kind of format for doing this. and occasionally you will find that some of the GGUF's don't have this set properly.
And in that case, you need to come in and set it yourself. But for most of the files, you're going to be fine just out of the box. Just being able to search for models that you can basically find the GGUF version.
and then download it in here. And so there's a lot of these models in here. Going right back to the old ones from the bloke.
through to a lot of the more sort of exotic fine tunes of the Llama models of the Mistral models, the Gemma models, Even the Gwen, 2. 5 models, you'll see that they themselves have GGUF versions and other people have done conversions of their models to GGUF as well. So this gives you a lot of models that you can start using with Ollama.
And don't forget as always, you can set up the Ollama to have the same kind of end point as an OpenAI end point to use it, if you want it to do something like with swarm or to do other things where people are using the sort of standard OpenAI end point in there. All right, I'm going to do another video about Ollama. And we're going to look at how we can actually put this in the cloud and serve it with a GPU in the cloud for this kind of thing.
But until then, I just wanted to show you that this is a really cool feature that has now come to Ollama and it gives you just access to so many other models so quickly. and just simplifies before you used to have to bring this down yourself, do all the setup yourself. Now this is something that you can just do out of the box get it working simply and quickly.
All right. As always, if you've got any questions, please put them in the comments below. if you liked this video and you want to see more videos like this, please click like and subscribe.
You'll see these videos as they come out more often. And I will talk to you in the next video. Bye for now.