tl;dr: You can run small LLMs locally on your consumer PC and with ollama that’s very easy to set up. It is fun to chat with an LLM locally, but it gets really interesting when you build RAG-systems or agents with your local LLM, there is great synergy. I show you an example of a RAG-System built with ollama and llama-index.

Running small LLMs locally with quantization

Large language models are large, mindboggingly large. Even if we had the source code and the weights of ChatGPTs GPT-4o model, with its probably 1760b parameters - that is b for billion - it would be about 3 TB in size if every paramter is stored as a 16 bit float. Difficult to fit into your RAM!

<rant> We could use proper SI notation, ‘1800G’ or ‘1.8T’ instead of ‘1800b’, since ‘billion’ means different things in different languages, but here we are 😞. </rant>

But nevermind, we don’t have the code and weights anyway. So what about open source models? While the flagships are still too large, there is a vibrant community on the HuggingFace platform that makes and improves models that have only 8b to 30b parameters, and those models are not useless. Meta has recently released a language model llama-3.2 with only 3b parameters. While you cannot expect the same detailed knowledge about the world and attention span as the flagship models, these models still produce coherent text and you can have decent short conversations with them. I would recommend to use at least an 8b model, because the smaller models likely won’t follow your prompt very well.

An 8b model is 200 times smaller than GPT-4o, but still has a size of about 15 GB. It fits into your CPU RAM, but you want it to fit onto your GPU. If it does not fit completely onto the GPU, a part of the calculation has to be done with the CPU, and that will slow down the generation dramatically. Memory transfer speed is the bottleneck.

Fortunately, one can quantize the parameters quite strongly without loosing much. It turns out one can go down to 4 or 5 bits per parameter without loosing much - about one percent in benchmarks compared to the original model ref1, ref2, ref3. This finally brings these models down to a size that fits onto consumer GPUs. You need some extra memory for the code and context window as well.

If you are interested in this sort of thing and plan to buy a GPU soon, take one with at least 16 GB of RAM. GPU speed does not really matter.

There are a couple of libraries which allow you to run these quantized models, but the best one is Ollama in my experience. Ollama is really easy to install and use. It successfully hides a lot of the complexity from you, and gives you easy start into the world of runnig local LLMs.

I had a lot of fun trying out different models. There are leaderboards (Open LLM Leaderboard and Chatbot Arena) which help to select good candidates. I noticed large differences in perceived quality among models with the same size. Generally, I recommend finetuned versions of the llama-3.1:8b and gemma2:9b models by the community. If you want to skip over that, then try out mannix/gemma2-9b-simpo.

Great, I have a local LLM running, now what?

Having an LLM running locally is nice and all, but for programming and asking questions about the world, the free tiers of ChatGPT and Claude are better. The real interesting use case for local LLMs is to chat with your documents using retrieval augmented generation (RAG).

There is great synergy in running a RAG-System with a local LLM.

You can keep your local documents private. Nothing will ever be transferred to the cloud.
No additional costs. If you want to use the API of ChatGPT or Claude, you have to pay eventually. That’s especially annoying while you are still developing, when you will run the LLMs over and over to test your application.
Local LLMs lack detailed world knowledge, but the RAG-System complements that lack of knowledge. Without RAG, local LLMs hallucinate a lot, but with RAG they will provide factual knowledge.

A general advantage of RAG is that you can look into the text pieces that the LLM used to formulate its answer, which turns the LLM from a black box into a (nearly) white box.

Building a simple RAG System with llama-index

For a RAG system, you need to convert your documents into plain text or Markdown, and an index to pull up relevant pieces from this corpus according to your query. There is currently gold-rush around developing converters for all kinds of documents into LLM-readable text, especially when it comes to PDFs. People try to make you to pay for this service. For PDFs, a free alternative that runs locally is pymupdf4llm. If your documents contain images, you can also run a multi-model LLM like llama-3.2-vision to make text descriptions for these images automatically.

Once you have your documents in plain text, you can split them into mouth-sized pieces (mouth-sized for your LLM, so that multiple pieces fit into its small context window) and use an embedding model to compute semantic vectors for each piece. These vectors magically encode semantic meaning of text, and can be used to find pieces that are relevant to a query using cosine similiarity - that’s essentially a dot-product of the vectors. It is hard to imagine that this works, but it actually does (more or less). Search via embeddings can be superior to keyword search, but in my experience it is not a silver bullet. The best RAG-Systems combine searches via keywords with embeddings in some way. Using a good embedding model is key. If you use a model trained solely for English text on German text, for example, it won’t perform well, or if your documents contain lots of technical language that the embedding model was not trained on.

Thankfully, Ollama also offers embedding models, so you can run these locally as well. I found that mxbai-embed-large works well for both english and German text.

Writing a RAG from scratch with Ollama is not too hard, but it usually pays off to use a well-designed library to do the grunt work, and then start to improve from there. I compared many libraries, and can confidently recommend llama-index as the best one by far. It is feature-rich and well designed: little boilerplate code for simple things, yet easy to extend. The workflow system especially is really well designed. Just their (good) documentation is annoyingly difficult to find, they try to push you to their paid cloud services (did I mention, there is a gold rush…). I review some other libraries in the appendix to this post.

Below, I show you a RAG demo system, where I pull in Wikipedia pages about the seven antique world wonders, I then ask some questions about the Rhodes statue and the Hanging Gardens. As I am German, I wanted to see how well this works with German queries on German documents. That is not trivial, because both the LLM and the embedding model then have to understand German. I compare the result with and with RAG. Without RAG, the model will hallucinate details. With RAG, it follows the facts in the source documents closely. It is really impressive.

To run this, you need to install a couple of Python packages:

ollama
llama-index
llama-index-llms-ollama
llama-index-embeddings-ollama
llama-index-readers-wikipedia
wikipedia
mistune
ipython

Mistune renders Markdown to HTML.

from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings, VectorStoreIndex
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter
import textwrap
import mistune
from IPython.display import display_html


def wrap(s):
    return "\n".join(textwrap.wrap(s, replace_whitespace=False))


# logging.basicConfig(stream=sys.stdout, level=logging.INFO)

Settings.embed_model = OllamaEmbedding(model_name="mxbai-embed-large")

Settings.llm = Ollama(model="mannix/gemma2-9b-simpo", request_timeout=1000)

Settings.text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=128)

# Load data from the German Wikipedia
documents = WikipediaReader().load_data(
    pages=[
        "Zeus-Statue des Phidias",
        "Tempel der Artemis in Ephesos",
        "Pyramiden von Gizeh",
        "Pharos von Alexandria",
        "Mausoleum von Halikarnassos",
        "Koloss von Rhodos",
        "Hängende Gärten der Semiramis",
    ], lang_prefix="de"
)

# Lots of stuff is happening here. This splits the seven pages into chunks of texts, 
# and computes an embedding vector for each chunk, in the end we have 76 chunks of text
# that the LLM can use. We don't need to pass the embedding model or the text splitter
# explicitly, they are pulled from the Settings object.
index = VectorStoreIndex.from_documents(documents, show_progress=True)

c:\Users\HansDembinski\source\blog\.pixi\envs\default\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 7/7 [00:00<00:00, 143.06it/s]
Generating embeddings: 100%|██████████| 76/76 [00:07<00:00,  9.78it/s]

# Some questions for the LLM about facts regarding two of the seven wonders
question = (
    "Aus welchen Materialien wurde der Koloss von Rhodos konstruiert?",
    "Beschreibe die Pose des Koloss von Rhodos.",
    "War der Koloss von Rhodos als nackte oder bekleidete Figur dargestellt?",
    "In welcher Stadt befanden sich die Hängenden Gärten?"
)

# A lot of stuff is happening here behind the scenes: a query engine is constructed
# from the document index. The query engine computes an embedding for the query, and
# selects 10 text pieces that are most similar to the query. It then prompts the 
# LLM with our question and provides the text pieces as context.
engine = index.as_query_engine(similarity_top_k=10)

show_sources = False

# Now we ask our questions. Set show_sources=True to see which text pieces were used.
# For reference, we compare the RAG answer ("RAG") with a plain LLM query ("Ohne RAG").
# If you don't speak German, no problem, I discuss the results further below in english.
for q in question:
    q2 = q + " Antworte detailliert auf Deutsch."
    
    response = Settings.llm.complete(q2)
    rag = engine.query(q2)

    s = f"# {q}\n\n## Ohne RAG\n\n{wrap(response.text)}\n\n## RAG\n\n{wrap(rag.response)}"

    if show_sources:
        s += "\n\n## Sources\n\n"
        for node in rag.source_nodes:
            s += f"### Score {node.score}\n{wrap(node.text)}\n\n"

    s = mistune.html(s)
    display_html(s, raw=True)

Aus welchen Materialien wurde der Koloss von Rhodos konstruiert?

Ohne RAG

Der Koloss von Rhodos, eines der sieben Weltwunder der Antike, wurde hauptsächlich aus Stein und Bronze errichtet. Die genaue Zusammensetzung und Bauweise sind jedoch aufgrund fehlender erhaltener Baupläne und archäologischer Funde nicht vollständig rekonstruierbar. Historiker und Archäologen gehen jedoch basierend auf damaligen Bautechniken und schriftlichen Quellen wie folgt vor:

Fundament und Sockel: Das gewaltige Fundament des Kolosses wurde aus massivem regionalem Kalkstein gebaut, der für seine Widerstandsfähigkeit bekannt war. Dieser Stein lieferte die stabile Basis auf der die Bronzestatur stand.
Skulptur selbst: Der eigentliche Koloss bestand aus Bronze, einem Material, das in der damaligen Zeit für große Skulpturen bevorzugt wurde. Die Bronze wurde wahrscheinlich durch Gussverfahren in mehreren Teilen hergestellt, die anschließend zusammengefügt wurden. Vermutlich wurden komplexe Riegel und Gelenke aus Bronze verwendet, um die beweglichen Körperteile des Kolosses zu ermöglichen und Stabilität zu gewährleisten.
Innenausbau: Es wird vermutet, dass der innere Aufbau des Kolosses aus Holz und weiteren Baumaterialien bestand, um Gewicht zu sparen und gleichzeitig eine stabile Struktur zu schaffen. Möglicherweise enthielt er auch Verstärkungen aus Eisen an entscheidenden Stellen.
Dekoration: Details wie Gesichtszüge, Kleidung und Waffen des Kolosses wurden höchstwahrscheinlich aus poliert und vergoldeter Bronze geschaffen, um einen beeindruckenden Glanz zu erzeugen. Zusammenfassend lässt sich sagen: Der Koloss von Rhodos war eine Meisterleistung aus Stein und Bronze, die durch innovative Bautechniken und kunstvolle Verarbeitungstechniken zum Leben erweckt wurde. Obwohl uns genaue Details fehlen, zeichnet sich das Bild eines monumentalen Werkes aus, das die technischen und künstlerischen Fähigkeiten der Antike exemplarisch demonstriert.

RAG

Die kolossale Statue des Helios bestand aus Bronze.

Beschreibe die Pose des Koloss von Rhodos.

Ohne RAG

Der Koloss von Rhodos, eines der Sieben Weltwunder der Antike, war nicht in einer typischen, statischen Pose dargestellt, sondern in einer dynamischen und majestätischen Haltung, die seine Größe und Macht unterstrich.

Detaillierte Beschreibung:

Stehende Figur: Der Koloss stand aufrecht und blickte über die Hafenstadt Rhodos. Er war somit keine sitzende oder kniende Figur, sondern eine beeindruckende, erhebende Präsenz.
Gesten und Haltung: Während genaue Darstellungen der Hände fehlen, wird angenommen, dass er mit ausgestreckten Armen stand, die möglicherweise leicht nach vorne geneigt waren, um eine offene und gastfreundliche Geste zu vermitteln, passend zu Rhodos als bedeutender Handelshafen. Sein Körper zeigte eine natürliche, leicht nach vorne geneigte Haltung, welche Dynamik und Bewegung suggerierte, anstatt starres Stehen.
Gesichtsausdruck: Die Darstellung des Gesichts war wahrscheinlich seren und majestätisch, vielleicht mit einem Hauch von stolzem Überlegenheit, passend zu seiner Funktion als Symbol der Stadt und ihrer Stärke. Es fehlte vermutlich an einem aggressiven oder kämpferischen Ausdruck.
Gewandtheit: Obwohl aus Bronze gefertigt, sollte der Koloss trotz seiner Größe elegant und geschmeidig wirken, nicht schwerfällig oder statisch. Die Proportionen und die Modellierung des Körpers sollten Bewegung und Leichtigkeit suggerieren, ähnlich einem modernen Skulpturenkonzept der "dynamischen Haltung".
Blickrichtung: Der Blick des Kolosses war vermutlich über den Hafen gerichtet, symbolisch die Stadt und ihre Schifffahrt beschützend und gleichzeitig die Ankunft von Handelsgütern und Reisenden willkommen heißen.

Zusammenfassend lässt sich sagen, dass die Pose des Kolosses von Rhodos eine bewusste Kombination aus Kraft, Offenheit und majestätischer Würde darstellte, welche die Werte und den Status der Stadt Rhodos perfekt widerspiegelte.

RAG

Obwohl keine direkte, antike Darstellung existiert, wird vermutet, der Koloss stand in einer majestätischen, aufrechten Haltung als nackter junger Mann mit langem, lockigem Haar und einem Strahlenkranz. Details über eine besonders dynamische oder ausgeprägte Pose sind jedoch unklar.

War der Koloss von Rhodos als nackte oder bekleidete Figur dargestellt?

Ohne RAG

Der Koloss von Rhodos wurde traditionell als bekleidete Figur dargestellt. Obwohl genaue Darstellungen und Details aufgrund des Untergangs der Statue verloren gingen, liefern uns antike Quellen und Interpretationen Hinweise auf ihre Kleidung:

Heiligtumsfunktion: Der Koloss stand als Tribut an den Gott Helios im Heiligtum auf der Insel Rhodos. Götterdarstellungen in dieser Zeit waren oft mit Kleidung geschmückt, die ihre Macht, Stellung und den Kontext ihrer Verehrung symbolisierten. Es ist daher logisch anzunehmen, dass auch der Koloss entsprechend bekleidet war.
Antike Beschreibungen: Plinius der Ältere, ein römischer Geschichtsschreiber und Naturforscher, beschreibt den Koloss als eine "herrliche Gestalt" mit "kleidernder Pracht". Obwohl nicht spezifisch, deutet dies auf Kleidung hin.
Stilistische Parallelen: Vergleichbare gigantische Figuren aus der griechischen Kunst, wie z.B. der Zeus von Olympia, waren oft bekleidet, um ihre majestätische Erscheinung zu unterstreichen. Der Koloss würde in diesem Kontext wahrscheinlich einem ähnlichen Stil folgen.

Es gibt keine zuverlässigen Quellen, die einen nackten Koloss suggerieren. Die Annahme einer bekleideten Figur basierend auf historischem Kontext und künstlerischen Praktiken der Zeit erscheint daher am wahrscheinlichsten.

RAG

Basierend auf Annahmen und Vergleichsmöglichkeiten mit Kunstwerken der damaligen Zeit, wird vermutet, dass die Figur wahrscheinlich als nackter junger Mann dargestellt wurde.

In welcher Stadt befanden sich die Hängenden Gärten?

Ohne RAG

Die legendären Hängenden Gärten wurden im antiken Babylon vermutet, der Hauptstadt des babylonischen Reiches in Mesopotamien (heute im Irak).

Obwohl ihre Existenz historisch nicht zweifelsfrei belegt ist und ihre genaue Lage weiterhin diskutiert wird, stützt sich die traditionelle Zuschreibung auf:

Griechische Quellen: Geschichtsschreiber wie Diodor von Sizilien und Strabo aus dem 1. Jahrhundert v. Chr. beschrieben prächtige Gartenanlagen in Babylon, die von künstlichen Terrassen mit Bewässerungssystemen hoch in die Luft ragten. Diese Beschreibungen passen zum Bild der "Hängenden Gärten".
Archäologische Hinweise: Während definitive Beweise für die Gärten selbst noch fehlen, fanden Archäologen in Babylon Überreste komplexer Bewässerungssysteme und künstlicher Terrassen, die die technische Grundlage für solche monumentalen Anlagen unterstützten.
Babylonische Kultur: Das babylonische Reich war bekannt für seine beeindruckende Architektur, Ingenieurskunst und Liebe zur Gartenkultur. Es erscheint somit plausibel, dass eine solche prachtvolle Konstruktion in dieser Stadt entstanden sein könnte. Obwohl die genaue geographische Position innerhalb Bablons noch nicht vollständig geklärt ist, wird die Nähe zum Euphratfluss und die städtische Umgebung als wahrscheinlichster Standort angesehen, der den Bedürfnissen der komplexen Bewässerungssysteme und der damaligen Stadtstruktur entsprach.

RAG

Die genaue Stadt, in der sich die Hängenden Gärten befanden, ist in den historischen Aufzeichnungen nicht eindeutig identifiziert.

Discussion

Both the embedding model and the LLM handle German without issues. The answers without RAG are much nicer to read, but contain halluciations, while the RAG answers are dull, brief, but factually correct. The behavior of the LLM without RAG is a consequence of human preference optimization. The LLM generates answers by default that look nice to humans.

The RAG answer is very short, because the internal prompt of llama-index asks the LLM to only use information provided by the RAG system and not use its internal knowledge. It is therefore not a bug but a feature: the LLM faithfully tries to only make statements that are covered by the text pieces. The LLM is not confused by irregularities in the text snippets that the reader did not filter out, like Wiki-Markup.

Question is about the materials used to construct the Rhodes statue.

The standard LLM claims that wood was used in the construction of the Rhodes statue, but there are no records in the Wikipedia about that. The RAG answer is factual correct, it mentions bronce. The Wikipedia article also mentions other materials, but the LLM seemed to miss those here. In earlier tries I got the system to list all four materials mentioned in the Wikipedia article by varying the question and perhaps simply because of the random seed used.

Question is about the pose of the Rhodes statue.

The standard LLM gives a lot of hallucinated detail. We don’t know much about the pose, and the short RAG answer summarises that.

Question is about whether the statue was clothed or naked.

The standard LLM says it was clothed, probably because a lot of the antique statues were clothed, but the RAG answer is correct, the statue was probably naked according to the Wikipedia. We can see here that RAG indeed overrides what the LLM would normally say.

Question is about the city in which the Hanging Gardens were supposed to be located.

The standard LLM gives the correct answer in this case, Babylon. The RAG answer speaks about the location relative to the palace, but does not mention the city. This is the only case where the RAG answer is worse, although not factually incorrect. The failure in this case is related to the index, which fails to retrieve the right text piece with the information.

Conclusions

RAG works very well even with small local LLMs. The caveats of small LLMs (lack of world knowledge) are compensated by RAG. The RAG answers are faithful to the sources in our example and contain no hallucinations. The use of local LLMs allows us to avoid additional costs and keeps our documents private.

The main challenge in setting up a RAG is the index. Finding all relevant pieces of information, without adding too many irrelevant pieces, is a hard problem. There are multiple ways to refine the basic RAG formula:

Getting more relevant pieces by augmenting the source documents with metadata like tags or LLM-generated summaries for larger sections, and cross-references to other snippets.
Smarter text segmentation based on semantic similarity or logical document structure.
Postprocessing the retrieved documents, by letting a LLM rerank them according to their relevance for the query.
Asking the LLM to critique its answer, and then to refine it based on the critique.
Generate multiple responses from the LLM and then let the LLM summarize them.
… and many other ways, it is an open and active field.

Have a look into the llama-index documentation for more advanced RAG workflows.

Appendix: RAG-libraries that I explored

There are numerous libraries available for RAG, but many have significant drawbacks from the point of view of my requirements:

I don’t want to send documents to cloud services (data privacy).
I don’t want to pay fees for cloud services.
I want to use a well-designed library that is easy to use and extend.
I want many read-made components, so that I get good value for my time investment.

Candidates Reviewed

The Github stars indicate popularity.

LangChain: 96k stars
LlamaIndex: 37k stars
Autogen: 35k stars
Haystack: 18k stars
Txtai: 9.6k stars
AutoChain: 1.8k stars

Problems Identified

Push to use cloud services
- All these libraries are open source, but most of them try to push you towards using paid cloud services to get essential functionalities; and in that case, data privacy cannot be guaranteed
- Examples of paid cloud-based services include:
  - Document databases
  - PDF converters
  - Web search providers
Dependencies and Installation
- Too many dependencies
- Difficult installation (e.g., requires Docker, incompatible libraries, etc.)
Design Flaws
- Poor and/or bloated design
- Volatile APIs
- Bad cost/benefit ratio compared to custom-written software
Excluded Libraries
- Autogen: No focus on RAG functionality
- AutoChain: Projects seems to dead, codebase has not been maintained for a year

Candidate shortlist

Out of the intial contenders, Haystack and LlamaIndex survived my requirements. I installed and tried out examples with Haystack and LlamaIndex. Both are easy to install and have moderate dependencies. Both were designed for RAG, but also support agentic workflows. Both have good documentation.

Haystack

Pros
- API inspired by functional design principles, leading to clear information flow
- Claimed to be used by Netflix, Nvidia, Apple, Airbus, etc.
Cons
- Excessive boilerplate code
- Inconvenient to extend
- Limited functionality

LlamaIndex

Pros
- Strong community support (e.g., llamahub.ai) offering components
- Minimal boilerplate code
- Elegant design with good defaults, see, for example, the Workflow class
- Many subpackages with specific functionality, so you only install what you really need
Cons
- Documentation is not easy to find when you land on their webpage, they try to push you to use their cloud services
- Information flow in the API is not always easy to follow, because configuration is done via a global Settings object and not passed call-by-call, that is the caveat of a design with minimal boilerplate code.