If you are a developer in 2026, running LLMs locally has gone from experimental hobby to genuinely practical. Open-source models like Llama 3.1, Mistral, Gemma 3, and Qwen3 now match or exceed GPT-4 on many benchmarks, and they run on consumer hardware with zero per-token cost, full data privacy, and complete offline capability.
The tool that made this mainstream is Ollama — an open-source runtime that treats local LLMs like Docker treats containers. One command pulls a model, one command runs it, and a built-in REST API makes it trivially easy to plug into Python, JavaScript, or any HTTP-capable language.
This guide walks you through everything — installation, model management, the HTTP API, Python and JavaScript integration, and building custom models with Modelfiles. Every example is copy-paste ready and tested on the current Ollama release.
Quick Reference Table
Here is the fast view of Ollama’s core commands and concepts. Keep this as a cheat sheet while you work through the guide.
| Command / Concept | What It Does | Example |
|---|---|---|
ollama pull |
Download a model locally | ollama pull llama3.2 |
ollama run |
Start an interactive chat | ollama run mistral |
ollama list |
Show installed models | ollama list |
ollama rm |
Delete a model from disk | ollama rm gemma3 |
ollama serve |
Start the API server manually | ollama serve |
ollama create |
Build a custom model from a Modelfile | ollama create mybot -f Modelfile |
| REST API port | Default HTTP endpoint | http://localhost:11434 |
Why Run LLMs Locally?
Three reasons make local LLMs compelling for developers in 2026. First, cost — OpenAI’s GPT-5.4 API charges $15 per million input tokens, while a local Llama 3.1 8B costs exactly zero per token after your hardware is already paid for.
Second, privacy. Healthcare, legal, and financial teams cannot send customer data to third-party APIs without compliance headaches. Running models on your own machine keeps every byte on your hardware — no terms of service, no data retention, no vendor lock-in.
Third, offline capability. Once models are downloaded, they work on a plane, in a remote cabin, or during an internet outage. For developers prototyping prompts at rate-limit-free speed, this alone is worth the setup.
Installing Ollama
Ollama runs natively on macOS, Linux, and Windows (including native ARM64 as of 2026). Installation is a one-command process on every platform.
Install Commands
# macOS (Homebrew)
brew install ollama
# Linux (one-line installer)
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download the installer from ollama.com and run it
# Docker (any platform with NVIDIA GPU)
docker run -d --gpus all -v ollama:/root/.ollama \
-p 11434:11434 --name ollama ollama/ollama
After installing, verify everything works by checking the version and listing models (which will be empty on a fresh install):
ollama --version
ollama list
On macOS the daemon starts automatically. On Linux you may need to run ollama serve once, or set up a systemd service for auto-start. Windows users get a system-tray app that handles the daemon for you.
Running Your First Model
Ollama maintains a public model library at ollama.com/library with hundreds of open-source models. You pull by name, and Ollama handles quantization, GPU detection, and memory allocation automatically.
Popular Models to Start With
- llama3.2 (3B) — Meta’s latest small model, runs on 8GB RAM, great for general chat.
- llama3.1:8b — Balanced performance model, needs ~8GB RAM.
- mistral (7B) — Fast and high-quality for most tasks.
- gemma3 — Google’s open model, strong for reasoning.
- qwen3 — Alibaba’s model, excellent multilingual support.
- codellama — Code-optimized variant of Llama for programming tasks.
- deepseek-coder — Coding model that outperforms CodeLlama on many benchmarks.
Pull and Run
# Download a model
ollama pull llama3.2
# Start an interactive chat
ollama run llama3.2
>>> Hello! Explain quantum computing in 3 sentences.
# One-shot prompt (no interactive mode)
ollama run llama3.2 "Write a Python function to reverse a string"
# Exit the interactive chat
/bye # or press Ctrl+D
The first run downloads several gigabytes, so expect a wait. Subsequent runs start in a second or two because the model is cached on disk at ~/.ollama/models/.
Managing Multiple Models
You can keep dozens of models installed and switch between them instantly. Here are the management commands you will use most often as a developer.
# List installed models with size and modified date
ollama list
# Show detailed info for a model
ollama show llama3.2
# See which models are currently loaded in RAM
ollama ps
# Delete a model to free disk space
ollama rm gemma3
# Copy a model to experiment with variants
ollama cp llama3.2 my-llama-variant
# Update all installed models to latest versions
ollama pull llama3.2 # re-run pull on any model
Ollama keeps recently used models in RAM for a few minutes by default, so switching back and forth between two or three models is nearly instant. For long-running setups, use ollama ps to check memory pressure.
Using the HTTP API
This is where Ollama becomes genuinely useful for developers. When the Ollama daemon is running, it exposes a REST API on http://localhost:11434 that any HTTP client can hit — curl, Postman, Python, JavaScript, Go, whatever.
If the server is not running, start it manually:
ollama serve
Quick Test with curl
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Why is the sky blue?" }
],
"stream": false
}'
You will get a JSON response with the model’s full answer. Setting "stream": true streams tokens as they are generated — ideal for chat UIs where you want text to appear word by word.
Python Integration
Ollama ships with an official Python SDK that wraps the HTTP API cleanly. Install it with pip:
pip install ollama
Basic Chat Example
from ollama import chat
response = chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Explain Docker in 3 sentences'}
]
)
print(response.message.content)
Streaming Responses
For real-time output in chatbots or CLIs, stream the response token by token:
from ollama import chat
stream = chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a haiku about Python'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
OpenAI-Compatible Mode
One of Ollama’s best 2026 features: it speaks the OpenAI API protocol. If you already have code built for OpenAI, you can point it at Ollama by changing two lines:
from openai import OpenAI
client = OpenAI(
api_key='ollama', # dummy key, not checked
base_url='http://localhost:11434/v1' # point at local Ollama
)
response = client.chat.completions.create(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Hello, how are you?'}
]
)
print(response.choices[0].message.content)
This drop-in compatibility means every LangChain, LlamaIndex, and OpenAI-based tool on the market works with local Ollama models without code changes.
JavaScript and Node.js Integration
Ollama also has an official JavaScript SDK that works in Node.js and any modern runtime. Install it with npm:
npm install ollama
Basic Example
import ollama from 'ollama';
const response = await ollama.chat({
model: 'llama3.2',
messages: [
{ role: 'user', content: 'Why is the sky blue?' }
]
});
console.log(response.message.content);
Fetch API (No Dependencies)
If you want zero dependencies, the built-in fetch API works fine:
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.2',
messages: [
{ role: 'user', content: 'Summarize Kubernetes in one paragraph' }
],
stream: false
})
});
const data = await response.json();
console.log(data.message.content);
Customizing Models with Modelfiles
This is where Ollama gets really powerful. A Modelfile lets you define a custom model based on any existing one, with a baked-in system prompt, temperature, context window, and other parameters — no retraining required.
Think of it as giving a base model a specific personality or role, and then saving that configuration as a reusable custom model.
Example: A SQL Expert Assistant
Create a file named Modelfile in any directory:
FROM llama3.2
# Lower temperature for more deterministic answers
PARAMETER temperature 0.2
# Larger context window for reading schemas
PARAMETER num_ctx 4096
SYSTEM """
You are a senior database engineer specializing in SQL optimization.
When given a query or schema:
- Analyze performance implications
- Suggest appropriate indexes
- Provide optimized alternatives
- Explain your reasoning clearly
Always format SQL with proper indentation and line breaks.
"""
Now build it as a custom model:
ollama create sql-expert -f Modelfile
ollama run sql-expert
>>> I have a users table with 5 million rows. Queries
filtering by email are slow. What should I do?
The model now answers as a SQL expert every time, with low temperature for precise recommendations. You can also hit it through the API exactly like any other model — just use "model": "sql-expert" in your request.
Example: A Mario Character Bot
For a fun example, here is a Modelfile that gives the model a personality:
FROM llama3.2
PARAMETER temperature 0.8
SYSTEM """
You are Mario from Super Mario Bros. Respond in Mario's
enthusiastic, Italian-accented voice. Sprinkle in
catchphrases like "Mamma mia!", "Let's-a go!", and
"It's-a me, Mario!" when appropriate. Keep responses fun
and in character, no matter what is asked.
"""
ollama create mario -f Modelfile
ollama run mario "What is recursion?"
The same system works for customer support bots, code review personas, technical writing assistants, or any role-specific AI you need. Once built, the custom model is available everywhere — CLI, API, Python, and JavaScript.
Hardware Tips and Performance
Model size determines your hardware needs more than anything else. As a rough guide: 3B models need ~4GB RAM, 8B models need ~8GB RAM, 13B models need ~16GB RAM, and 70B models need ~48GB RAM or serious GPU horsepower.
For most developer laptops, stick with quantized models — variants ending in q4_K_M or q4_0 use about half the memory with minimal quality loss. The default Ollama pulls are already quantized sensibly.
If you have an NVIDIA GPU, Ollama uses it automatically. Apple Silicon gets Metal acceleration out of the box. AMD GPUs work on Linux with ROCm 6.x. With a decent GPU, expect 50 to 300 tokens per second on 7B to 13B models — faster than most cloud APIs.
Common Mistakes to Avoid
The biggest trap is downloading a model too big for your RAM. A 70B model on an 8GB laptop will either crash or swap to disk so hard your machine freezes. Start small — llama3.2 (3B) or mistral (7B) — and scale up only when you know your hardware handles it.
The second mistake is exposing the API to the public internet. By default Ollama binds to 127.0.0.1, which is safe. Setting OLLAMA_HOST=0.0.0.0 exposes it everywhere with no authentication — do not do this without a firewall, VPN, or reverse proxy with auth in front.
The third is forgetting that the daemon holds models in RAM for a few minutes after use. If you are swapping between big models in a script, either unload them explicitly or restart the service to free memory.
Final Take
Ollama is the single easiest way in 2026 to run powerful, open-source LLMs on your own hardware. The combination of a clean CLI, a drop-in OpenAI-compatible API, official Python and JavaScript SDKs, and Modelfile customization makes it genuinely production-ready for internal tools, prototypes, and privacy-sensitive workloads.
Start with one small model today — ollama run llama3.2 gets you chatting in under five minutes. Once you feel how fast and private local LLMs are, you will find it hard to go back to paying per token for basic tasks.
The future of AI development is a hybrid stack — local models for iteration, privacy, and cost-sensitive work, and cloud APIs for the hardest reasoning tasks. Ollama is how you build the local half of that stack without fighting CUDA drivers or quantization math. Spin it up tonight and see what you can ship by the weekend.

