Build a RAG App with a Local LLM in Python (2026)

Retrieval-Augmented Generation, or RAG, is how you get an AI model to answer questions using your own data instead of guessing. It is the technique behind most document chatbots and internal knowledge assistants. The best part: you can build a working RAG app for free, entirely on your own machine, with a local model and no API bills. This 2026 guide shows you how, step by step.

By the end you will have a Python script that takes a question, finds the most relevant pieces of your text, and asks a local model to answer using only that context.

What Is RAG?

A language model only knows what it was trained on. RAG fixes that by adding a retrieval step: before the model answers, you search your own documents for the most relevant passages and hand them to the model as context. The model then answers from that context, which keeps replies grounded in your data and cuts down on made-up answers.

Three pieces make it work: an embedding model that turns text into vectors, a vector database that finds similar vectors fast, and a language model that writes the final answer. We will use free, local tools for all three.

What You Will Build

A small script that stores a handful of facts, finds the ones relevant to a question, and answers using a local model. The same pattern scales from three sentences to thousands of documents.

Prerequisites

Python 3.10 or newer.
Ollama installed to run a local model and embeddings — it is free and open-source.
Basic comfort running a Python script.

Step 1: Install Ollama and Pull Models

Install Ollama from its site, then pull a small chat model and an embedding model from your terminal:

ollama pull llama3.2
ollama pull nomic-embed-text

The first is a compact language model that runs on most laptops. The second turns text into the vectors RAG needs. Both run locally and cost nothing.

Step 2: Install the Python Libraries

pip install ollama chromadb

The ollama library talks to your local models, and chromadb is a free, embedded vector database that needs no separate server.

Step 3: Prepare Your Documents

Real apps load text from files or a database. To keep the focus on RAG, we will use a short list of facts. Each string is one "chunk" the model can retrieve:

documents = [
    "Open-Meteo is a free weather API that needs no API key.",
    "The Cat Facts API returns random facts about cats as JSON.",
    "DeepSeek offers a low-cost, OpenAI-compatible chat API.",
    "The REST Countries API returns country data like capitals and currencies.",
]

Step 4: Create Embeddings and Store Them

Turn each document into a vector and store it in Chroma:

import ollama
import chromadb

client = chromadb.Client()
collection = client.create_collection(name="docs")

for i, doc in enumerate(documents):
    vector = ollama.embeddings(model="nomic-embed-text", prompt=doc)["embedding"]
    collection.add(ids=[str(i)], embeddings=[vector], documents=[doc])

print("Stored", collection.count(), "documents.")

Each document now lives in the database next to its vector, ready to be searched by meaning rather than exact words.

Step 5: Retrieve the Relevant Chunks

When a question comes in, embed it the same way and ask Chroma for the closest documents:

question = "Which API gives weather data for free?"

q_vector = ollama.embeddings(model="nomic-embed-text", prompt=question)["embedding"]
results = collection.query(query_embeddings=[q_vector], n_results=2)
context = "\n".join(results["documents"][0])

print("Retrieved context:\n", context)

Notice we never searched for the word "weather" directly. The embedding match finds the Open-Meteo fact because it is the closest in meaning, which is the whole point of vector search.

Step 6: Generate the Answer with a Local Model

Hand the retrieved context to the local language model and tell it to answer from that context only:

prompt = f"""Answer the question using only the context below.
If the answer is not in the context, say you do not know.

Context:
{context}

Question: {question}"""

response = ollama.chat(model="llama3.2", messages=[{"role": "user", "content": prompt}])
print(response["message"]["content"])

The model now answers using your data, grounded in the retrieved facts. Swap in your own documents and the same code becomes a private assistant for your notes, docs, or support content.

Where RAG Is Used in the Real World

This small pipeline is the same shape that powers serious products. Support teams use RAG to answer customer questions from their own help docs, so the bot never invents a policy. Companies build internal assistants over wikis and handbooks, letting staff ask plain-language questions instead of digging through pages. Developers wire RAG into documentation sites so visitors get direct answers with citations. Researchers use it to query large collections of papers or notes. In every case the value is the same: the model speaks from your trusted content rather than its general training, which makes the answers accurate and easy to verify. Once you have the basic loop working, scaling up is mostly a matter of better chunking, more documents, and a sturdier vector database.

Common Mistakes to Avoid

Chunks that are too big. Split long documents into paragraphs so retrieval stays precise. One giant chunk drowns the useful part.
Mismatched embedding models. Use the same embedding model for storing and for querying, or the vectors will not line up.
No fallback. Tell the model to say "I do not know" when the context lacks the answer, so it does not invent one.
Forgetting to persist. The in-memory client resets each run; use a persistent Chroma path once you move past testing.

Frequently Asked Questions

Is this RAG app really free?

Yes. Ollama, the models used here, and Chroma are all free and run locally, so there are no API costs at all.

Can I use a hosted model instead?

Yes. Swap the generation step for a low-cost API like DeepSeek or a free tier such as Gemini if you prefer not to run a model locally. The retrieval steps stay the same.

How many documents can it handle?

Chroma handles thousands of chunks comfortably for local projects. For very large datasets, move to a scalable vector database when you outgrow it.

How do I load my own files?

Read text from files or a database, split it into paragraph-sized chunks, and feed those chunks into the same embedding and storage loop shown above.

Wrapping Up

You built a complete RAG pipeline — embeddings, vector search, and grounded generation — running free on your own machine. That is the same architecture behind serious document assistants, just scaled down so you can learn it end to end.

Want data sources to feed your next RAG project? Browse the free API directory at Free API Hub and pull in live content to index.

#RAG#retrieval augmented generation#Ollama#embeddings#vector database#Python#local LLM

Share this article:

Free API Hub Team

Editorial at FreeAPIHub

The FreeAPIHub editorial team tests every API endpoint, runs every code example, and verifies free tiers before publishing. Corrections and suggestions welcome via GitHub.

Build a RAG App with Free APIs and a Local LLM (2026)