FineTuning with Unsloth

January 17, 2025

videos

Script:

After our MLX video, many of you asked ‘What about Windows or Linux users?’ Well, today I’m excited to show you Unsloth - a blazing-fast fine-tuning solution that brings MLX-like speed to Windows and Linux. If you are eager to get started with Fine Tuning on Linux and Windows, this could be all you need.

And since you are so eager, let’s just get started with some fine tuning. Visit the website at unsloth.ai. As you learn more about Unsloth, the less clear this website gets. It looks like they do fine tuning but as we will see, they don’t…sort of. But let’s get to the fine tuning. Click the big green button to get started. And that takes us to the GitHub repository. Scroll down just a titch and you will get to the notebooks. So we see that the main UI is a notebook? Down in the lower half is one for Mistral. Click on the Start for Free link. That takes us to the Mistral notebook. This is hosted on Google Colab and connects us to a free instance with an Nvidia T4 GPU. This is a pretty amazing resource for folks on any platform. A free instance with a decent GPU. Its got some weaknesses and we will come to that later.

The actual UI is a modified version of a Jupyter notebook. It’s a bit different from the standard install since we have these cool play buttons to make it easier to get started with. Click the first one. We get a warning about this not being authored by GitHub, but click Run Anyway.

This set’s up our free instance with unsloth. We are getting the standard install and getting the latest bits on top of that. Sometimes the latest nightly can break stuff, but most of the time its ok. Now come to the second block and click play. We are setting some variables. And setting this fourbit models variable which is a bit pointless since it never gets used. It should really be a text block listing the models. The magic starts down at the bottom of the block when we initialize the model and tokenizer. We are grabbing the model, loading it up with some settings. Most of these are the defaults that you don’t normally need to specify. below the block is the output from this run. It says its patching the computer. It’s not. Its patching this model that we are going to use for finetuning. And then we see the progress downloading the model.

In the next block we add the lora adapters to configure what is going to get fine tuned. Most of this are the defaults for the unsloth framework. All we really need to define is the model, but we will see that later. Click play. It finishes in a few short seconds. Nothing else in the notebook needs to use unsloth. everything else is standard finetuning with Hugging Face…sort of…for the most part.

The next section is data prep. Data prep is the hardest part of finetuning and unfortunately unsloth doesn’t really help much. There’s a lot more you have to do here than with MLX, but again that’s Mac only.

So there is a lot of text here that explains what needs to be done. We will be using a JSONL dataset on Hugging Face and need to convert it to the format the model expects. So down in the code block we get the chat template and define some mappings. Then define a function that will be used in a map function to generate the prompt and completion with the template applied. So click play and it goes through the whole 70 thousand line of the dataset, pulls out 15000 of them to use for training and generates the templated output.

The next code block shows us the dataset. first we see the original content, and then the text field which is what was generated by the map function. The next code block is just if you want to play around with a new template, but we can skip it.

Now we get to the actual fine tuning. This is using Supervised Fine Tuning Trainer which is part of the Transformer Reinforcement Learning library from HuggingFace. These are good settings to use for the finetuning of a model so click play and let it run. Then in the next code block we can start the actual training.

This is going to take a while on the T4 machine. But the next few blocks are interesting. First we see how we can use the model for inference. Then the rest of the blocks are the different ways to save the model so you can use it later.

The problem with everything we have covered so far is that notebooks are a terrible way to do this after the first try. Also, notebooks on Colab have issues if you aren’t on a paid account. There are some things that can take a while and so you might be tempted to walk away. But if you do, you will probably see a dialog like this saying that your runtime has been disconnected and you should pay the ten or fifty bucks a month for Colab Pro if you want to have enough time to get that coffee or get your kids ready for school. You’ll have to do everything again so don’t walk away this time. But for some folks that don’t really want to get into it, this may be enough. I think most of you want something more repeatable and efficient, so let’s look at what’s really needed and start to create something better.

In the last video, we looked at MLX and it’s incredible fine-tuning speeds on Apple Silicon. We saw how it uses Apple’s optimized framework to fine-tune models quickly. Remember how we used LoRA to efficiently update just a tiny fraction of the model’s parameters? And how we prepared our data in that specific JSONL format with prompt and completion pairs? That whole process - from data prep to getting a working model - took less than 10 or 20 minutes on even a base M1 Mac.

The real magic was at the end when we could take that fine-tuned model and run it directly in Ollama. All we had to do was create a simple modelfile, figure out the right template to use and then run an ollama command to add it to the system. No complex conversions, no compatibility issues - it just worked. That workflow was a game-changer for Mac users.

But here’s the catch - MLX is Mac-only. It relies on Apple’s Metal framework and Silicon architecture. This left a lot of you asking - “What about Windows and Linux users?” After all, not everyone has access to Apple Silicon, but everyone wants those same benefits: faster training times, lower memory usage, and a straightforward workflow that runs on your own hardware.

That’s where Unsloth comes in - bringing similar optimizations to NVIDIA GPUs, whether you’re on Windows, Linux, or even up in the cloud.

Let’s clear up something important, something we saw just now - Unsloth isn’t actually a fine-tuning tool itself. Think of it more like a performance optimizer that preps your model for fine-tuning. It works alongside tools like PEFT and transformers to make the fine-tuning process faster and more efficient.

The magic is in how it optimizes the model’s operations before you even start training. Unsloth rewrites certain PyTorch operations to be more efficient, particularly the attention mechanisms that are so crucial in transformer models. It’s like taking your car to a master mechanic who fine-tunes the engine for maximum performance - the car is still your car, but it runs much better.

What makes this special is that these optimizations work with your existing fine-tuning workflow, assuming you have one. If you’re already using Hugging Face’s trainer or PEFT, you don’t need to learn a whole new system. Unsloth just makes what you’re already doing faster and more memory-efficient. And unlike some other optimization tools, it doesn’t sacrifice accuracy for speed - you get both.

Before we go much further, if you’re finding value in this series on practical fine-tuning across different platforms, hit that subscribe button. I’m building a library of real-world AI engineering content - from MLX on Mac to Unsloth on Linux and Windows.

In our Discord community, we’re sharing tips on optimizing fine-tuning workflows, troubleshooting common issues, and discussing what works in production, not just in demos. Join us there if you want to compare notes on memory usage, training times, and model performance across different setups.

Drop a comment with your current fine-tuning challenges - whether it’s memory constraints, training speed, or getting models into production. I read every comment and use them to plan future videos.

Before we dive back in, there’s something crucial you need to know: Unsloth only works with NVIDIA GPUs. If you’re using AMD, unfortunately this isn’t going to work for you. This isn’t just a limitation of Unsloth - key components like bitsandbytes and flash attention require NVIDIA’s CUDA architecture.

For this demo, I’m using Linux with an NVIDIA GPU, more specifically, this machine has 8 H100’s, thanks to Brev. But here’s what you’ll need:

An NVIDIA GPU with CUDA Capability 7.0 or higher

Python 3.10-3.12 (not 3.13 - many ML libraries like PyTorch and their dependencies haven’t been updated for 3.13’s changes yet). Apparently 3.9 will work for some things, but 3.10 to 3.12 is better.

CUDA drivers and toolkit installed

At least 8GB of VRAM (though more is always better)

Quick note about Python versions: While Python 3.13 is out, it introduced some breaking changes that affect how C extensions work. Most ML frameworks and their dependencies haven’t caught up yet, so stick with 3.10-3.12 for now.

The NVIDIA requirement is non-negotiable because Unsloth relies on several NVIDIA-specific optimizations:

Triton compiler

Xformers optimizations

If you’ve got your NVIDIA GPU ready, let’s set up our environment. There are several popular options for managing Python environments:

You could use venv. It’s built into Python and its simple but basic. You run python -m venv unsloth-env and then source unsloth-env/bin/activate

Option 2 is to use Conda which offers better dependency resolution and is especially popular in Machine Learning. conda create -n unsloth-env python=3.10 and then conda activate unsloth-env

Finally there are all sorts of newer package managers like uv and others. They all have their own workflows but typically after you have setup the environments, you can use the standard toolsets to install everything.

For Windows users watching this: You have two options - WSL (recommended) or native Windows (experimental). there are more details on installing it on Windows on the Unsloth docs.

I’m using conda here because it handles CUDA dependencies better, but any of these will work. If you’re new to Python environments, stick with venv - it’s built into Python and gets the job done. And its always better to keep Python as simple as possible to avoid the inevitable problems that always pop up with Python environments.

Let’s go back to that Mistral notebook from before, go through each of the steps and convert it into a more usable Python script for you, pointing out what actually needs to be done along with some awkward choices in the original notebook.

The first main step is to install unsloth. I actually had to also install a few other things so there is a requirements text file you can find in the repo linked to in the description below. pip install -f requirements.txt to get those installed. This is of course after you setup the environment.

Now we move on to the unsloth block. We can go ahead and add the imports to our code: from unsloth import fast language model and then import torch. The fourbit models assignment does nothing, so skip down to initializing the model and tokenizer.

model and tokenizer equal fast language model dot from pretrained, specifying the model repo on Hugging Face and the max sequence length. The dtype, and load in 4bit are already the defaults and it seems we don’t need the token for this one.

Then we can scroll down to the next block. Almost everything set here is the default, so all we need to do is model equals fast language model dot get peft model, specifying the model that we defined above

Next we set up the dataset. With MLX we did this whole section by just renaming the fields using jq and it dealt with the template automatically. It’s a bit annoying we have to do this manually here. But we can get the chat templates and specify a remapping. Then load the dataset from HuggingFace and map through the data, generating a text field that takes the source data and applies it to the template.

So we will add the new imports, then get the chat template. Now the function it provides for the map seems a bit strange. Its going to get run over and over again, getting the conversations again and again. then output the text. I think it’s a bit more efficient to do it this way where we are setting the conversations once. And then do the map. It’s only going to save time and memory with a much bigger dataset, but it’s less annoying to look at the code this way.

Next we can skip down to get to the part about actually training the model. Notice again that this really has nothing to do with unsloth, other than a helper function to let us determine if bfloat16 is supported on our current hardware.

Unlike the other sections, the options shown here are good suggested options to use, rather than the defaults already set. So take this pretty much as is. I’ll add the imports and then the sftt trainer.

The last step is to actually run the fine tuning. So add trainer dot train.

Now we can scroll all the way down to the section on saving models. I think the piece that is most interesting here is model.save_pretrained_gguf. We just have to specify the path we want to save the model to, the tokenizer and the quantization to use and its going to do something pretty magical. You’ll see in a sec.

Ok, so I’ll run the script and it does its thing. The training takes just a few seconds on this beast of a machine even though unsloth only takes advantage of one of the eight H100’s. After the training it starts to save the model as gguf. It probably took another 2 minutes or so to save.

Let’s look in our gguf model folder. You see that? it’s a Modelfile. It’s gone ahead and created the modelfile specifically for Ollama. Now unfortunately its created it for the 16bit quantization, but we can easily change that to the q4km model in the same directory. Now run ollama create org/modeland the model has been added.

There you have it - Unsloth brings MLX-like speeds to Windows and Linux users. If any of this was confusing, check out this video that should get you up to speed on Fine Tuning in general. Whether you’re using a local RTX card or training in the cloud, you can now fine-tune models in a fraction of the time. Drop a comment with your fine-tuning results, and don’t forget to check out our Discord community for more tips and tricks. Thanks so much for watching, goodbye.