Fine-Tuning Local LLMs Efficiently

Fine-Tuning Local LLMs Efficiently

Contents hide

1 Fine-Tuning vs RAG

2 Speeding Up Fine-Tuning

3 Crafting The Dataset

4 Choosing The Right Target Model

5 Preparing The Setup

6 Fine-Tuning with LoRA and Unsloth

7 Final Thoughts

7.1 Manfredi

7.2 Related posts

I’m excited to share the third and final installment of the series I originally published on my LinkedIn profile. Since this blog also explores a variety of tech topics relevant to anyone – like me – who needs to deliver IT projects, I wanted to bring these insights to a broader audience beyond my LinkedIn network!

Greetings! In my previous post I demonstrated that RAG is an effective method for retrieving information from external documents when using a local LLM. For most personal use cases this approach is more than sufficient: it requires minimal effort and delivers fair results. That said, there are cases where it is instead preferable to fine-tune your local LLM.

Fine-Tuning vs RAG

Imagine you’re preparing for an exam. If you study the material thoroughly, you’ll be able to answer questions with confidence because you’ve specifically prepared for them. This is what happens to AI when the model is fine-tuned: it’s trained on specialized data, which increases the likelihood that it will provide accurate and relevant answers. At the end of the day, AI solutions are just probabilistic models (this is the simple truth, if we remove all the hype and mumble jumble) and they can make mistakes if facing unfamiliar situations or ambiguous data. Training the models is aimed precisely at lowering this risk.

With RAG, instead of training the model further, we simply make documents available for the model to reference as needed. Back to the previous analogy: imagine taking an exam you haven’t studied for, but being allowed to use a cheat sheet. You might answer some questions correctly by looking up information, but your answers may be less precise and you would be less confident about their quality. This is why RAG is sufficient for many personal use cases but may fall short when high accuracy and confidence are critical.

The issue with LLMs is that they do not possess intrinsic understanding of their subject matter. They just identify statistical patterns and make educated guesses (probabilistic predictions) based on available data and context provided to them. The more uncertainty a model faces, the less reliable its answers become. And such limitations apply just as much to agentic AI, so don’t be surprised when you read that Gartner predicts that over 40% of agentic AI projects will be cancelled by the end of 2027 and keep this in mind the next time you hear that AI can operate reliably without human supervision.

Speeding Up Fine-Tuning

To get more accurate answers, you might need to fine-tune your local LLM. The challenge with fine-tuning is that it’s not as straightforward as using RAG. It requires significant time, considerable computational resources, and specialized domain knowledge. While we might find ways to overcome the latter two (I’ll share a relatively easy approach in this post), the first constraint – time – remains, particularly the time needed for data preparation. But here’s the twist: what if we leverage a powerful LLM to accelerate this step, and then use this generated, high-quality training data to fine-tune a smaller LLM for a specific task or domain? Let me explain how I proceeded.

If you’ve read my previous post, you know that I keep my personal documentation in Obsidian vaults, using Markdown files. This structure is perfect for working with AI, since the content is already logically organized and well-formatted. However, to use this data for fine-tuning, I first had to split these articles into chunks and, for each chunk, generate input-output pairs in JSON format. As I also mentioned in my previous post, I already have Qwen3 with 32 billion parameters running locally on my machine. While it may not be the most powerful open-source model available, I’m satisfied with the results it produces. To accelerate data preparation, I used Qwen3 to privately generate JSON-formatted input-output pairs for all the different chunks of documentation I wanted to transfer to a smaller LLM. Of course, I supervised the process to ensure that Qwen3’s outputs were accurate. Although it still required some time, it was significantly faster than creating the dataset manually.

Crafting The Dataset

The JSON template I provided to Qwen3 for generating input-output pairs was quite simple, a common standard for fine-tuning pre-trained LLMs. It included three elements:

Instruction: Specifies the task the model should perform. This helps frame the context of the request and guides the model’s behaviour.
Input: While not all instructions require an input, it often provides additional context. For example, if the instruction is to summarize text, the input would contain the text to be summarized. Alternatively, it could be the question in a Q&A pair.
Output: The desired response the model should generate based on the instruction. This is what you would ideally want the model to return. In the previous examples, it could be the summarized version of the input text or the answer to a Q&A pair.

Here’s how the template would look in JSON format:

[
  {
    "instruction": "INSERT THE INSTRUCTIONS FOR THE TASK THE MODEL HAS TO EXECUTE",
    "input": "INSERT THE INPUT (OPTIONAL)",
    "output": "INSERT THE DESIRED OUTPUT"
  }
]

When it comes to fine-tuning, quality is more important than quantity but, unfortunately, quantity still matters… You should be prepared to create at least a few hundred input-output pairs, which can make the process tedious and time-consuming. That’s why I recommend offloading much of the heavy lifting (such as drafting and formatting these pairs) to a powerful LLM. This allows you to focus primarily on reviewing the generated JSON blocks and assembling the final dataset.

Choosing The Right Target Model

At this point, with the JSON training file ready, I had to decide which model to fine-tune. After careful comparison, I opted for Llama 3.1, the version with “only” 8 billion parameters. As counterintuitive as it may sound, smaller LLMs, when fine-tuned properly, can be more efficient than their larger counterparts. While more parameters generally enhance a model’s ability to generalize across a wide range of tasks, a smaller model that is well-trained on high-quality, task-specific data can outperform a larger, more generic model. Also, smaller LLMs are significantly faster, and if you’re fine-tuning on consumer hardware with a single GPU, you won’t have many alternatives anyway.

For those wondering why I didn’t simply use a smaller version of Qwen3: as a “thinking” LLM that supports chain of thought, Qwen3 requires training data in a different format, and I wanted to avoid that extra complexity. As for why I didn’t choose the more recent Llama 3.2, that’s because it doesn’t offer an 8-billion-parameter version. In short, Llama 3.1 gave me the right balance between capabilities, parameter count, and compatibility with my dataset. For your own AI experiments, you can choose what fits your use case (and hardware) best.

Preparing The Setup

At this point, we’ve successfully addressed one of the three major constraints – time – but what about the other two: computational resources and required knowledge? If you want to fine-tune a local LLM on consumer hardware without getting overwhelmed by the technicalities, you should definitely try LoRA training with Unsloth.

LoRA stands for Low-Rank Adaptation. It’s a technique that keeps the original model weights frozen and introduces a small set of new parameters to adapt the model to your data. This dramatically reduces the computational and memory requirements, making fine-tuning possible on standard consumer machines. Unsloth is a user-friendly library that streamlines the fine-tuning process with LoRA, making it accessible for non-experts.

Before you start, make sure to:

Have a recent version of Python installed on your machine
Have Pip installed as well
Create a dedicated Python virtual environment: while this is generally optional, it becomes mandatory if you’re using an operating system like Debian 12, which comes with Python pre-installed (you won’t be able to use the system Python environment)
Install PyTorch with CUDA support (optional, but recommended if you have a compatible GPU)
Install Unsloth
Install and launch Jupyter Notebook

Fine-Tuning with LoRA and Unsloth

With all prerequisites in place, you’re now ready to begin fine-tuning using your Jupyter Notebook. I recommend referring specifically to this one, as it’s a great starting point. While you’ll need to make a few adjustments (I’ll walk you through each section), you can also tweak certain parameters if you wish.

Installation: Skip this if you’ve already completed the steps in the previous paragraph.
Unsloth: Feel free to change the parameters taking into account your hardware’s capabilities; if in doubt, leave the default values. Just make sure to set “model_name” to the LLM you’ve chosen for your experiment (I used “unsloth/llama-3.1-8b”).
LoRA: My setup for LoRA was slightly different, but the default settings should work fine. To play around, you might consider increasing the rank parameter above 16, but for most computers, keeping it at 16 or 32 is probably the best compromise.
Data Preparation: The prompt format I described earlier is the Alpaca prompt referenced here. If your JSON dataset follows that format, you’re almost ready. Just make sure to update the line where you load your dataset and replace the reference to “yahma/alpaca-cleaned” with the path to your own .json file.
Training the Model: You may need to specify the “max_seq_length” parameter again (this was the case for me), but in most cases I imagine you can use the code block as provided in the notebook.

Once fine-tuning is complete, take some time to test your model on your dataset. If the results aren’t as good as you expected, the issue may lie in the quality or size of your dataset. In that case, consider expanding and refining your dataset, then repeat the process. At least with this workflow you now have an effective approach you can build on.

Final Thoughts

The approach I described is one of the most efficient to work around those three constraints that would otherwise make fine-tuning much less accessible. AI can significantly accelerate data preparation, while Unsloth and LoRA training can simplify the process even for those who aren’t AI experts. However, even with these shortcuts, fine-tuning clearly remains more complex and time-consuming than RAG. These are then my final recommendations:

Use fine-tuning when you have knowledge that doesn’t require frequent updates, and if you can justify the effort of preparing a large, high-quality dataset (target at least 1000 input-output pairs). Without a sufficiently large dataset, the fine-tuned LLM may not perform as desired, and you would have only wasted your time.
Use RAG for knowledge that changes over time, such as product documentation. RAG allows you to easily update the underlying documents without the need to re-fine-tune the LLM after every change. It’s also the better choice for simpler use cases where the extra effort of fine-tuning would be more difficult to justify.

That’s all for today, I hope you found this helpful! I’d love to hear your thoughts in the comments and if you have any questions, don’t hesitate to ask. Cheers!

Manfredi

+ posts

Italian-German cloud computing professional with a strong background in project management & several years of international work experience in IT & business consulting. His expertise lies in bridging the gap between business stakeholders & developers, ensuring seamless project delivery.

Be a Content Ambassador