local LLM

Running A Local LLM With Your Own Documents

I’m excited to share the second installment of a three-part series that I originally published on my LinkedIn profile. Since this blog also explores a variety of tech topics relevant to anyone – like me – who needs to deliver IT projects, I wanted to bring these insights to a broader audience beyond my LinkedIn network. I hope you enjoy this second chapter, and if you do, stay tuned for the next one!


Greetings! My previous post didn’t generate the interest I was hoping for… To be honest, that was somewhat expected, which is why I’m trying a different approach this time and I’ll focus on the topic that’s on everyone’s mind: AI. More precisely, I’ll explain how to run large language models (LLMs) locally and even integrate your own knowledge base to their responses. Before jumping into recommendations and step-by-step instructions, let’s start with some context.

Why Run an LLM Locally?

First of all, privacy. You shouldn’t share any personal or confidential information with any AI services like ChatGPT. In first place because you have no visibility of what a private company and its employees would do with your information, but also because governments could access your information as well (the recent court order to ChatGPT on data retention proves my point) and you also have no visibility on what a government would do with the information you freely give away.

By running your own AI platform locally, you keep your data private while still enjoying all the benefits of AI. Also, your LLM will always be available, even without an internet connection, and you will not have to deal with subscription costs (if using an open-source LLM). Still not convinced? Then let me tell you about the most convenient advantage and the actual focus of this post: you can extend the LLM’s capabilities and turn your personal knowledge base into part of the AI’s reasoning process.

My Setup for Local LLMs

In my previous post, I highlighted that experimenting with AI locally is a fantastic use case for a home server and it truly is. However, for my own experiments with AI, I prefer to use my desktop PC. My home server already handles other tasks and my desktop is the most powerful machine in my network: it also features a robust NVIDIA GPU, making it the obvious choice for AI workloads. Here’s my first recommendation: if possible, use a computer with a strong CPU and a dedicated GPU as this will dramatically increase performance.

A few additional notes on my setup:

  • Operative System: I run my local LLM on Debian 12. While the setup process may be similar on Windows, this guide is especially useful for Linux users.
  • LLM Platform: There are many options for running LLMs locally (PrivateGPT, GPT4All, LM Studio, etc.) but I personally chose Ollama.
  • LLM Model: There are also plenty of model options. Choose the one that best suits your needs and your computer’s capabilities. Personally, I opted for Qwen3 with 32 billion parameters. It actually “thinks” before responding and offers solid performance on my hardware.

Getting Started

Because I wanted to leverage my GPU, my first step was to install the official NVIDIA drivers – a process that can be tricky for many Linux users. While there’s no universal solution, this is what worked for me:

  1. Downloaded the official NVIDIA .run installer for my GPU and Linux version
  2. Switched to the console and stopped the display manager (in my case, lightdm)
  3. Blacklisted/disabled the Nouveau driver and rebooted
  4. Made the NVIDIA .run installer executable and ran it as root, allowing it to build the kernel module; completed the on-screen prompts and rebooted again
  5. Checked the drivers with nvidia-smi – luckily everything looked perfect

With the NVIDIA drivers installed, the next step was to set up an LLM platform. As already mentioned, I chose Ollama. To install it Docker was an option (one I might have considered on my home server) but I had no reason to use it on my desktop PC, which has a drive entirely dedicated to AI experiments. And since I already had curl installed on my Debian, all I actually needed to do was run the official Ollama installation script:

curl -fsSL https://ollama.com/install.sh | sh

At this point, my setup was almost complete, but to take advantage of my GPU, I also had to install the CUDA toolkit. For a detailed list of commands to install Ollama and CUDA, you can refer to the documentation. Once both were set up, I downloaded Qwen3 by running:

ollama run qwen3:32b

These steps are sufficient to get your LLM up and running, as you can interact with it from the command line, but I understand this isn’t ideal for everyone. This is why I also recommend installing Open WebUI, which will be helpful for future steps as well. You can run Open WebUI either in a container or directly on your system: again, I chose the latter and followed the commands in the documentation.

Adding Your Own Data: A Short Digression

Before I explain how to integrate your own data, let’s take a short detour. I have both good and bad news for you all.

Let’s start with the bad news: as an individual with consumer hardware, you can’t actually fully retrain an LLM from scratch all by yourself. Most of us lack the computational power, massive datasets or expertise required for such a task. And even if you could, without careful data curation, you would even risk degrading the original model’s capabilities. You might have better luck at fine-tuning the original LLM, but even that requires familiarity with libraries like PyTorch or TensorFlow, significant GPU resources and a considerable investment of time. Not only to wait for the fine-tuning to complete, but also to perform tedious tasks like formatting input-output pairs. While I’m no AI expert, I can safely say that for most of us, the effort probably isn’t worth it.

Now, the good news. There’s a much better option for our use case: Retrieval-Augmented Generation (RAG). With RAG, you don’t need to change the underlying model. Instead, you provide external data for the LLM to reference at inference time. As a proud documentation rat who has built a personal, multidisciplinary knowledge base in various Obsidian vaults over the years, this approach feels like a godsend. Rather than manually searching through my notes, I can simply ask the LLM to consider my .md documents when generating responses. Again: all while keeping everything local and never having to share my personal documents with ChatGPT or similar services.

Article content
An example: open graph of my personal AWS documentation, all in Markdown documents

RAG Your Local LLM!

Let’s start by opening Open WebUI and asking Qwen3 a very simple question:

Do you know Manfredi Pomar? If yes, please give me a brief summary of his profile

The assumptions the model made during the thinking phase are all perfectly valid:

The name might refer to a private individual, a fictional character, or a misspelling of another name.

The result isn’t surprising – give me time to become Chancellor and things might change…! 😄

Article content
LLM without RAG

One of the advantages of using Open WebUI is its built-in feature to configure RAG. Here’s how you can set it up:

  1. Navigate to the sidebar and select Workspace
  2. Go to the Knowledge tab and create a new Knowledge section
  3. Upload relevant documents to the newly created Knowledge section (make sure your documents are in a format that allows the text to be parsed)
  4. Navigate to the admin settings
  5. Edit your chosen LLM model to include the newly created Knowledge section

To demonstrate the validity of this approach,, I uploaded my LinkedIn profile, CV, and GitHub profile page to a Knowledge section. When I open a new chat and ask the exact same question as before, I receive a completely different response. It contains some minor imprecisions, but overall it looks great.

Article content
LLM with RAG

This works because the LLM now also takes into account the documents I uploaded locally, as demonstrated by the fact that these documents are also referenced in the response.

Final Words

RAG is a solid approach for incorporating personal documents into AI responses while running an LLM locally and Open WebUI makes its configuration surprisingly easy. That said, I have the impression that using a front-end can alter or limit the model’s capabilities (after all Open WebUI comes with its own settings). This is why I personally prefer the “raw” command-line interaction with LLMs.

Of course, you can set up RAG without Open WebUI (for example, by parsing documents with tools like docling) but I wanted to keep things simple for demonstration purposes. Said so, I hope this post has inspired you to start using a local LLM and easily integrate your own documents. Let me know your thoughts in the comments!

Picture of Manfredi Pomar
+ posts

Italian-German cloud computing professional with a strong background in project management & several years of international work experience in IT & business consulting. His expertise lies in bridging the gap between business stakeholders & developers, ensuring seamless project delivery.

Be a Content Ambassador
Skip to content