FreeAPIHub
HomeAPIsAI ModelsAI ToolsComing SoonBlog
Favorites
FreeAPIHub

The central hub for discovering, testing, and integrating the world's best AI models and APIs.

Platform

  • Categories
  • AI Models
  • APIs

Company

  • About Us
  • Contact
  • FAQ

Help

  • Terms of Service
  • Privacy Policy
  • Cookies

© 2026 FreeAPIHub. All rights reserved.

GitHubTwitterLinkedIn

Table of Contents

  1. 1Quick Reference Table
  2. 2Why Run LLMs Locally?
  3. 3Installing Ollama
  4. 4Install Commands
  5. 5Running Your First Model
  6. 6Popular Models to Start With
  7. 7Pull and Run
  8. 8Managing Multiple Models
  9. 9Using the HTTP API
  10. 10Quick Test with curl
  11. 11Python Integration
  12. 12Basic Chat Example
  13. 13Streaming Responses
  14. 14OpenAI-Compatible Mode
  15. 15JavaScript and Node.js Integration
  16. 16Basic Example
  17. 17Fetch API (No Dependencies)
  18. 18Customizing Models with Modelfiles
  19. 19Example: A SQL Expert Assistant
  20. 20Example: A Mario Character Bot
  21. 21Hardware Tips and Performance
  22. 22Common Mistakes to Avoid
  23. 23Final Take

Table of Contents

23 sections

  1. 1Quick Reference Table
  2. 2Why Run LLMs Locally?
  3. 3Installing Ollama
  4. 4Install Commands
  5. 5Running Your First Model
  6. 6Popular Models to Start With
  7. 7Pull and Run
  8. 8Managing Multiple Models
  9. 9Using the HTTP API
  10. 10Quick Test with curl
  11. 11Python Integration
  12. 12Basic Chat Example
  13. 13Streaming Responses
  14. 14OpenAI-Compatible Mode
  15. 15JavaScript and Node.js Integration
  16. 16Basic Example
  17. 17Fetch API (No Dependencies)
  18. 18Customizing Models with Modelfiles
  19. 19Example: A SQL Expert Assistant
  20. 20Example: A Mario Character Bot
  21. 21Hardware Tips and Performance
  22. 22Common Mistakes to Avoid
  23. 23Final Take

Article Stats

Reading Time7 min 🐇
Word Count1,925
Reading LevelMedium Read
PublishedFebruary 24, 2026
Technology
February 24, 2026165 viewsFeatured

How to Run and Customize Open-Source LLMs Locally With Ollama

Run powerful open-source LLMs on your own machine for free with Ollama — complete privacy, zero API costs, full customization. This developer-focused guide covers install, model management, the REST API, Python and JS integration, and custom Modelfiles.

Illustration of running open source large language models locally using AMA software

Illustration of running open source large language models locally using AMA software

FreeAPIHub

If you are a developer in 2026, running LLMs locally has gone from experimental hobby to genuinely practical. Open-source models like Llama 3.1, Mistral, Gemma 3, and Qwen3 now match or exceed GPT-4 on many benchmarks, and they run on consumer hardware with zero per-token cost, full data privacy, and complete offline capability.

The tool that made this mainstream is Ollama — an open-source runtime that treats local LLMs like Docker treats containers. One command pulls a model, one command runs it, and a built-in REST API makes it trivially easy to plug into Python, JavaScript, or any HTTP-capable language.

This guide walks you through everything — installation, model management, the HTTP API, Python and JavaScript integration, and building custom models with Modelfiles. Every example is copy-paste ready and tested on the current Ollama release.

Quick Reference Table

Here is the fast view of Ollama’s core commands and concepts. Keep this as a cheat sheet while you work through the guide.

Command / Concept What It Does Example
ollama pull Download a model locally ollama pull llama3.2
ollama run Start an interactive chat ollama run mistral
ollama list Show installed models ollama list
ollama rm Delete a model from disk ollama rm gemma3
ollama serve Start the API server manually ollama serve
ollama create Build a custom model from a Modelfile ollama create mybot -f Modelfile
REST API port Default HTTP endpoint http://localhost:11434

Why Run LLMs Locally?

Three reasons make local LLMs compelling for developers in 2026. First, cost — OpenAI’s GPT-5.4 API charges $15 per million input tokens, while a local Llama 3.1 8B costs exactly zero per token after your hardware is already paid for.

Second, privacy. Healthcare, legal, and financial teams cannot send customer data to third-party APIs without compliance headaches. Running models on your own machine keeps every byte on your hardware — no terms of service, no data retention, no vendor lock-in.

Third, offline capability. Once models are downloaded, they work on a plane, in a remote cabin, or during an internet outage. For developers prototyping prompts at rate-limit-free speed, this alone is worth the setup.

Installing Ollama

Ollama runs natively on macOS, Linux, and Windows (including native ARM64 as of 2026). Installation is a one-command process on every platform.

Install Commands

# macOS (Homebrew)
brew install ollama

# Linux (one-line installer)
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download the installer from ollama.com and run it

# Docker (any platform with NVIDIA GPU)
docker run -d --gpus all -v ollama:/root/.ollama \
  -p 11434:11434 --name ollama ollama/ollama

After installing, verify everything works by checking the version and listing models (which will be empty on a fresh install):

ollama --version
ollama list

On macOS the daemon starts automatically. On Linux you may need to run ollama serve once, or set up a systemd service for auto-start. Windows users get a system-tray app that handles the daemon for you.

Running Your First Model

Ollama maintains a public model library at ollama.com/library with hundreds of open-source models. You pull by name, and Ollama handles quantization, GPU detection, and memory allocation automatically.

Popular Models to Start With

  • llama3.2 (3B) — Meta’s latest small model, runs on 8GB RAM, great for general chat.
  • llama3.1:8b — Balanced performance model, needs ~8GB RAM.
  • mistral (7B) — Fast and high-quality for most tasks.
  • gemma3 — Google’s open model, strong for reasoning.
  • qwen3 — Alibaba’s model, excellent multilingual support.
  • codellama — Code-optimized variant of Llama for programming tasks.
  • deepseek-coder — Coding model that outperforms CodeLlama on many benchmarks.

Pull and Run

# Download a model
ollama pull llama3.2

# Start an interactive chat
ollama run llama3.2
>>> Hello! Explain quantum computing in 3 sentences.

# One-shot prompt (no interactive mode)
ollama run llama3.2 "Write a Python function to reverse a string"

# Exit the interactive chat
/bye    # or press Ctrl+D

The first run downloads several gigabytes, so expect a wait. Subsequent runs start in a second or two because the model is cached on disk at ~/.ollama/models/.

Managing Multiple Models

You can keep dozens of models installed and switch between them instantly. Here are the management commands you will use most often as a developer.

# List installed models with size and modified date
ollama list

# Show detailed info for a model
ollama show llama3.2

# See which models are currently loaded in RAM
ollama ps

# Delete a model to free disk space
ollama rm gemma3

# Copy a model to experiment with variants
ollama cp llama3.2 my-llama-variant

# Update all installed models to latest versions
ollama pull llama3.2    # re-run pull on any model

Ollama keeps recently used models in RAM for a few minutes by default, so switching back and forth between two or three models is nearly instant. For long-running setups, use ollama ps to check memory pressure.

Using the HTTP API

This is where Ollama becomes genuinely useful for developers. When the Ollama daemon is running, it exposes a REST API on http://localhost:11434 that any HTTP client can hit — curl, Postman, Python, JavaScript, Go, whatever.

If the server is not running, start it manually:

ollama serve

Quick Test with curl

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Why is the sky blue?" }
  ],
  "stream": false
}'

You will get a JSON response with the model’s full answer. Setting "stream": true streams tokens as they are generated — ideal for chat UIs where you want text to appear word by word.

Python Integration

Ollama ships with an official Python SDK that wraps the HTTP API cleanly. Install it with pip:

pip install ollama

Basic Chat Example

from ollama import chat

response = chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain Docker in 3 sentences'}
    ]
)

print(response.message.content)

Streaming Responses

For real-time output in chatbots or CLIs, stream the response token by token:

from ollama import chat

stream = chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about Python'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

OpenAI-Compatible Mode

One of Ollama’s best 2026 features: it speaks the OpenAI API protocol. If you already have code built for OpenAI, you can point it at Ollama by changing two lines:

from openai import OpenAI

client = OpenAI(
    api_key='ollama',                       # dummy key, not checked
    base_url='http://localhost:11434/v1'    # point at local Ollama
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Hello, how are you?'}
    ]
)

print(response.choices[0].message.content)

This drop-in compatibility means every LangChain, LlamaIndex, and OpenAI-based tool on the market works with local Ollama models without code changes.

JavaScript and Node.js Integration

Ollama also has an official JavaScript SDK that works in Node.js and any modern runtime. Install it with npm:

npm install ollama

Basic Example

import ollama from 'ollama';

const response = await ollama.chat({
  model: 'llama3.2',
  messages: [
    { role: 'user', content: 'Why is the sky blue?' }
  ]
});

console.log(response.message.content);

Fetch API (No Dependencies)

If you want zero dependencies, the built-in fetch API works fine:

const response = await fetch('http://localhost:11434/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.2',
    messages: [
      { role: 'user', content: 'Summarize Kubernetes in one paragraph' }
    ],
    stream: false
  })
});

const data = await response.json();
console.log(data.message.content);

Customizing Models with Modelfiles

This is where Ollama gets really powerful. A Modelfile lets you define a custom model based on any existing one, with a baked-in system prompt, temperature, context window, and other parameters — no retraining required.

Think of it as giving a base model a specific personality or role, and then saving that configuration as a reusable custom model.

Example: A SQL Expert Assistant

Create a file named Modelfile in any directory:

FROM llama3.2

# Lower temperature for more deterministic answers
PARAMETER temperature 0.2

# Larger context window for reading schemas
PARAMETER num_ctx 4096

SYSTEM """
You are a senior database engineer specializing in SQL optimization.
When given a query or schema:
- Analyze performance implications
- Suggest appropriate indexes
- Provide optimized alternatives
- Explain your reasoning clearly

Always format SQL with proper indentation and line breaks.
"""

Now build it as a custom model:

ollama create sql-expert -f Modelfile
ollama run sql-expert

>>> I have a users table with 5 million rows. Queries
    filtering by email are slow. What should I do?

The model now answers as a SQL expert every time, with low temperature for precise recommendations. You can also hit it through the API exactly like any other model — just use "model": "sql-expert" in your request.

Example: A Mario Character Bot

For a fun example, here is a Modelfile that gives the model a personality:

FROM llama3.2

PARAMETER temperature 0.8

SYSTEM """
You are Mario from Super Mario Bros. Respond in Mario's
enthusiastic, Italian-accented voice. Sprinkle in
catchphrases like "Mamma mia!", "Let's-a go!", and
"It's-a me, Mario!" when appropriate. Keep responses fun
and in character, no matter what is asked.
"""
ollama create mario -f Modelfile
ollama run mario "What is recursion?"

The same system works for customer support bots, code review personas, technical writing assistants, or any role-specific AI you need. Once built, the custom model is available everywhere — CLI, API, Python, and JavaScript.

Hardware Tips and Performance

Model size determines your hardware needs more than anything else. As a rough guide: 3B models need ~4GB RAM, 8B models need ~8GB RAM, 13B models need ~16GB RAM, and 70B models need ~48GB RAM or serious GPU horsepower.

For most developer laptops, stick with quantized models — variants ending in q4_K_M or q4_0 use about half the memory with minimal quality loss. The default Ollama pulls are already quantized sensibly.

If you have an NVIDIA GPU, Ollama uses it automatically. Apple Silicon gets Metal acceleration out of the box. AMD GPUs work on Linux with ROCm 6.x. With a decent GPU, expect 50 to 300 tokens per second on 7B to 13B models — faster than most cloud APIs.

Common Mistakes to Avoid

The biggest trap is downloading a model too big for your RAM. A 70B model on an 8GB laptop will either crash or swap to disk so hard your machine freezes. Start small — llama3.2 (3B) or mistral (7B) — and scale up only when you know your hardware handles it.

The second mistake is exposing the API to the public internet. By default Ollama binds to 127.0.0.1, which is safe. Setting OLLAMA_HOST=0.0.0.0 exposes it everywhere with no authentication — do not do this without a firewall, VPN, or reverse proxy with auth in front.

The third is forgetting that the daemon holds models in RAM for a few minutes after use. If you are swapping between big models in a script, either unload them explicitly or restart the service to free memory.

Final Take

Ollama is the single easiest way in 2026 to run powerful, open-source LLMs on your own hardware. The combination of a clean CLI, a drop-in OpenAI-compatible API, official Python and JavaScript SDKs, and Modelfile customization makes it genuinely production-ready for internal tools, prototypes, and privacy-sensitive workloads.

Start with one small model today — ollama run llama3.2 gets you chatting in under five minutes. Once you feel how fast and private local LLMs are, you will find it hard to go back to paying per token for basic tasks.

The future of AI development is a hybrid stack — local models for iteration, privacy, and cost-sensitive work, and cloud APIs for the hardest reasoning tasks. Ollama is how you build the local half of that stack without fighting CUDA drivers or quantization math. Spin it up tonight and see what you can ship by the weekend.

Related Topics

open source llmsama toolllama modelslocal ai modelsllm customizationama http apilarge language modelsai developer tools

Found this helpful?

Share this article with fellow developers or save it for later reference. Your support helps us create more quality content.

Continue Learning

Dive deeper into related topics with these recommended articles

Mastering Automated Deployment with GitHub Actions CI/CD

Mastering Automated Deployment with GitHub Actions CI/CD

Stop breaking production with manual deploys. Learn how to automate your entire CI/CD pipeline with GitHub Actions in 2026 — from your first YAML workflow to secure SSH deploys, secrets management, and local testing with Act. Real examples included.

Technology7 read