lang: vi slug: ollama title: ‘Ollama: 137K+ Stars — Run LLMs Locally with One Command’ description: ‘Ollama is the simplest way to run Llama, DeepSeek, Mistral, and other LLMs locally. Compatible with LangChain, OpenWebUI, Continue.dev, and Dify. Covers Docker setup, Modelfile customization, REST API, production hardening, and performance benchmarks.’ tags: [“guide”, “local”, “offline”, “open-source”, “privacy”, “reference”, “tutorial”] date: 2026-05-19 00:00:00+08:00 lastmod: 2026-05-19 00:00:00+08:00 tech_stack: [] application_domain: Llm Frameworks source_version: ’' licensing_model: Open Source license_type: MIT file_size: ’' file_md5: ’' download_url: ’' backup_url: ’' github_repo: ‘https://github.com/ollama/ollama' last_maintained: ‘2026-05-19’ draft: false categories: [’llm-frameworks’] aliases:- /posts/ollama/

/resources/llm-frameworks/ollama-local-llm-guide/
/posts/ollama-local-llm-guide/ faqs:
- q: ‘How much VRAM do I need to run a 7B parameter model with Ollama?’ a: ‘A Q4_K_M quantized 7B model needs roughly 4.5-5 GB of VRAM, while Q8 quantization requires 7-8 GB. CPU-only inference works with 16 GB system RAM but runs about 3-5x slower.’
- q: ‘Can I run Ollama without a GPU?’ a: ‘Yes. Ollama automatically falls back to CPU inference via llama.cpp. An Intel i7-13700K generates around 8-12 tok/s on a 7B Q4 model, and an Apple M3 Pro reaches about 25 tok/s using the CPU/Neural Engine.’
- q: ‘Does Ollama have built-in API key authentication?’ a: ‘No. Ollama assumes a trusted local network and ships no built-in authentication. For internet-facing deployments you must add an auth layer such as a reverse proxy, API gateway, or VPN.’
- q: ‘Can I use my own fine-tuned model with Ollama?’ a: ‘Yes. Convert the model to GGUF format, then write a Modelfile with FROM ./your-model.gguf and run ollama create my-model -f Modelfile. It then becomes available through the standard REST API.’
- q: ‘How does Ollama compare to vLLM for serving many concurrent users?’ a: ‘Ollama processes requests sequentially with a FIFO queue, so at 50 concurrent users p99 latency hits about 25 seconds versus vLLM’’s roughly 3 seconds. vLLM’’s continuous batching delivers around 6x the aggregate throughput, so choose vLLM once you serve more than about 5 concurrent users.’

featureImage: /images/articles/ollama-137k-stars-run-llms-locally-with.png —{{< resource-info >}} Ollama vs LM Studio vs llama.cpp vs vLLM 2026 • act: 70,410 GitHub StarsRunning large language models used to mean wrestling with Python environments, CUDA drivers, and gigabytes of dependencies. In 2026, that friction is gone. Ollama lets you pull, configure, and serve production-grade LLMs with a single command — no PyTorch installation, no manual GPU tuning, no Docker mandatory. With 173,950+ GitHub stars and a thriving ecosystem of integrations, Ollama has become the default runtime for developers who want local inference without operational headaches.This guide walks through the complete Ollama setup: installation, Docker deployment, Modelfile customization, API integration with popular frameworks, production hardening, and honest benchmarks against alternatives. Whether you are building a coding assistant, a RAG pipeline, or a self-hosted ChatGPT alternative, this tutorial gives you the commands and configs to go from zero to running models in under five minutes.

This Ollama tutorial covers the complete setup from installation to production deployment in a single guide.## What Is Ollama?Ollama is an open-source runtime for running large language models locally. It wraps the inference engines (llama.cpp for CPU/GPU, MLX on Apple Silicon, ROCm on AMD) behind a simple CLI and REST API, so developers can focus on building applications instead of managing model weights, quantization formats, and hardware acceleration. Think of it as Docker for LLMs: pull a model, run it, done.Created by Jeffrey Morgan and the Ollama team in 2023, the project reached 173,950+ stars on GitHub by mid-2026. It supports hundreds of models including Llama 3, DeepSeek R1, Mistral, Qwen, Gemma, and CodeLlama — all available through the Ollama model library.

## How Ollama WorksOllama’s architecture follows a client-server model. A background daemon (ollama serve) manages model downloads, memory allocation, and inference. The CLI and REST API are thin clients that communicate with this daemon over HTTP on port 11434.### Core Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐ │   Client    │────▶│ ollama serve │────▶│  llama.cpp/MLX  │ │  (CLI/API)  │     │   (port     │     │  (inference     │ │             │◄────│   11434)    │◄────│   backend)      │ └─────────────┘     └──────────────┘     └─────────────────┘ │ ┌──────┴──────┐ │  ~/.ollama/ │ │  (models,   │ │   blobs)    │ └─────────────┘

Key components:- Model Hub: Curated GGUF models pulled from ollama.com. Each model is identified by a name:tag pair (e.g., llama3.2:8b).

Modelfile: A declarative config (like Dockerfile) specifying base model, system prompt, parameters, and chat templates.
Inference Backends: Automatic selection of llama.cpp (CUDA/ROCm/CPU), MLX (Apple Silicon), or Metal based on available hardware.
REST API: OpenAI-compatible endpoints at /api/generate, /api/chat, /api/embed, and /v1/chat/completions.### Model StorageModels are stored in ~/.ollama/models/ as content-addressable blobs (SHA-256 digests). A manifest file tracks which blobs belong to which model tag. This deduplication means two models sharing the same base weights only store one copy on disk.## Installation & Setup### macOS``` bas h

Using Homebrew (recommended) #

brew install ollama# Or download the native app from ollama.com/download ### Linux (One-Line Installer) bas h curl -fsSL https://ollama.com/install.sh | sh

h
i
s
installs the binary, registers a systemd service, and auto-detects GPU capabilities (NVIDIA CUDA, AMD ROCm, or CPU-only).### WindowsDownload the installer from [ollama.com/download](https://ollama.com/download). Windows 11/12 with WSL2 is recommended for full compatibility.### Verify Instal```
bas
h
# Using Homebrew (recommended)
brew install ollama

# Or download the native app from ollama.com/download
```P
u
l
l
and run your first model
ollama run llama3.2:8b
```T
h
e
first time you run a model, Ollama downloads it. A quantized 8B parameter model li```
bas
h
curl -fsSL https://ollama.com/install.sh | sh
```a
c
e
and runs comfortably on 8 GB VRAM.### Quick Model Selection by Hardware| Hardware | Recommended Model | Command |
|----------|------------------|---------|
| 6–8 GB VRAM | Qwen3 8B | `ollama run qwen3:8b` |
| 10–12 GB VRAM | Llama 3.1 8B Q4 | `ollama run llama3.1:8b` |
| 16+ GB VRAM | DeepSeek-R1 14B | `ollama run deepseek-r1:14b` |
| CPU only, 16 GB R```
bas
h
ollama --version
# ollama version 0.6.7

# Start the daemon (if not already running)
ollama serve

# Pull and run your first model
ollama run llama3.2:8b
```U
I
(ChatGPT-Style Interface)[Open WebUI](https://github.com/open-webui/open-webui) is the most popular frontend for Ollama, providing a ChatGPT-like web interface with RAG, voice input, and multi-user support.```
bas
h
# Run Open WebUI with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main
```Acc
e
s
s
at `http://localhost:3000`. Open WebUI auto-discovers your Ollama instance at `http://host.docker.internal:11434`.### LangChain (Python)```
pytho
n
# Install
pip install langchain-ollama# Chat model
from langchain_ollama import ChatOllamallm = ChatOllama(
    model="llama3.2:8b",
    temperature=0.7,
    base_url="http://localhost:11434"
)response = llm.invoke("Explain quantum computing in one paragraph.")
print(response.content)# Embeddings
from langchain_ollama import OllamaEmbeddingsembeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_qu```
bas
h
# Run Open WebUI with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main
```la
m
a
",
      "model": "llama3.2:8b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "CodeQwen",
    "provider": "ollama",
    "model": "codeqwen:7b-code"
  }
}
```### Dify (Self-Hosted AI Workflow Platform)In Dify's **Settings > Model Provider > Ollama**, configure:```
Model Name: llama3.2:8b
Base URL: http://host.docker.```
pytho
n
# Install
pip install langchain-ollama

# Chat model
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2:8b",
    temperature=0.7,
    base_url="http://localhost:11434"
)

response = llm.invoke("Explain quantum computing in one paragraph.")
print(response.content)

# Embeddings
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("Hello world")
# Returns a 768-dimensional float vector
```434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["The sky is blue", "Grass is green"]
}'
```## Docker Setup for Production![Ollama Docker deployment](https://raw.githubusercontent.com/ollama/ollama/main/docs/images/logo.png)### Basic Docker Compose```
yam
l
# docker-compose.yml
version: "3.8"services:
  ollama:
    image: ollama/ollama:0.6.7
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    restar```
jso
n
{
  "models": [
    {
      "title": "Llama 3.2",
      "provider": "ollama",
      "model": "llama3.2:8b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "CodeQwen",
    "provider": "ollama",
    "model": "codeqwen:7b-code"
  }
}
```b
u
i
ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - openwebui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stoppedvolumes:
  ollama_data:
  openwebui_data:
```St
a
r
t
with `docker compose up -d`.### NVIDIA GPU Setup```
bas
h
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.```
Model Name: llama3.2:8b
Base URL: http://host.docker.internal:11434
Context Window: 8192
```oolk
i
t
-keyring.gpgcurl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | se```
bas
h
# Generate text
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:8b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "llama3.2:8b",
  "messages": [{"role": "user", "content": "Hello!"}],
  "temperature": 0.7
}'

# Generate embeddings
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["The sky is blue", "Grass is green"]
}'
```  - HSA_OVERRIDE_GFX_VERSION=11.0.0
```### Multi-Model Concurrent Serving```
yam
l
services:
  ollama:
    image: ollama/ollama:0.6.7
    environment:
      - OLLAMA_NUM_PARALLEL=4      # 4 concurrent requests
      - OLLAMA_MAX_LOADED_MODELS=2  # Keep 2 models in VRAM
      - OLLAMA_KEEP_ALIVE=30m      # Unload after 30min idle
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```## Modelfile: Customizing ModelsA Modelfile is Ollama's declarative configuration format. It defines how a model behaves: system prompt, sampling parameters, context window, and chat template.### Ba```
yam
l
# docker-compose.yml
version: "3.8"

services:
  ollama:
    image: ollama/ollama:0.6.7
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    restart: unless-stopped
    # NVIDIA GPU support
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - openwebui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  openwebui_data:
```i
o
r
-dev --modelfile
```### Advanced: Code Review Assistant```
dockerfil
e
# Modelfile.code-review
FROM codellama:7b-codeSYSTEM """You are a code review assistant. Analyze the provided code for:
1. Bugs and logic errors
2. Security vulnerabilities (SQL injection, XSS, buffer overflow)
3. Performance issues (N+1 queries, unnecessary allocations)
4. Style and readabilityFormat your response as:
- [CRITICAL] for bugs/security
- [WARN] for performance
- [INFO] for style suggestionsAlways suggest a fix for [CRITICAL] and [WARN] items."""PARAMETER temperature 0.1
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

bas h ollama create code-reviewer -f Modelfile.code-review ### Creating from a Local GGUF File dockerfil e

Modelfile.local #

FROM ./my-fine-tuned-model-q4_k_m.ggufPARAMETER temperature 0.7 PARAMETER num_ctx 4096SYSTEM “You are a helpful assistant specialized in medical terminolog``` bas h

Install NVIDIA Container Toolkit #

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey
| sudo gpg –dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list
| sed ’s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g’
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure –runtime=docker sudo systemctl restart docker

i
n
|
| llama.cpp (CLI) | Q4_K_M | ~65 tok/s | ~5 min |
| LocalAI | Q4_K_M | ~38 tok/s | ~15 min |*Source: SitePoint benchmark, March 2026. Single-stream generation, 256-token output.*### Concurrent Load (50 Users, RTX 4090)| Tool | Aggregate tok/s | p99 Latency | Architecture |
|------|----------------|-------------|--------------|
| Ollama | ~155 tok/s | ~24.7s | FIFO queue |
| vLLM | ~920 tok/s | ~2.8s | Continuous batching |
| llama.cpp server | ~140 tok/s | ~26s | FIFO queue |
| LocalAI | ~130 tok/s | ~28s | FIFO queue |*Ollama processes requests sequentially; vLLM's continuous batching provides 6x throughput under concurrent load. For single-user develo```
yam
l
services:
  ollama:
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0
```2s
|
| vLLM | 400 MB | 5.5 GB | 5s |
| LocalAI | 400 MB | 5.5 GB | 8s |
| LM Studio | 800 MB | 5.8 GB | 5s |### Real-World Deployment Patterns1. **Individual Developer**: Ollama + Continue.dev for AI-assisted coding. Lat```
yam
l
services:
  ollama:
    image: ollama/ollama:0.6.7
    environment:
      - OLLAMA_NUM_PARALLEL=4      # 4 concurrent requests
      - OLLAMA_MAX_LOADED_MODELS=2  # Keep 2 models in VRAM
      - OLLAMA_KEEP_ALIVE=30m      # Unload after 30min idle
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```n
Hardening### Environment Variables```
bas
h
# Core settings
OLLAMA_HOST=0.0.0.0:11434          # Bind to all interfaces
OLLAMA_KEEP_ALIVE=24h               # Keep models loaded for 24 hours
OLLAMA_NUM_PARALLEL=4               # Max concurrent requests
OLLAMA_MAX_LOADED_MODELS=2          # Max models in VRAM simultaneously
OLLAMA_FLASH_ATTENTION=1            # Enable Flash Attention (faster inference)# Performance tuning
OLLAMA_GPU_OVERHEAD=200MB           # Reserve VRAM headroom
OLLAMA_DEBUG=1                      # Verbose logging
```### Reverse Proxy with Nginx```
ngin
x
server {
    listen 443 ssl http2;
    server_na```
dockerfil
e
# Modelfile
FROM llama3.2:8b

# System prompt defines personality
SYSTEM """You are a senior software engineer. Be concise, 
practical, and always include working code examples."""

# Parameter tuning
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"

# Custom template (optional — inherits from base if omitted)
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
``` inference
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
    }
}
```### API Key Authentication (No Native Support)Ollama does not include built-in API key authentication. Add it via a reverse proxy:```
pytho
n
# ollama-auth-proxy.py (Flask example)
from flask import Flask, request, Response
import requestsapp = Flask(__name__)
OLLAMA_URL = "http://localhost:11434"
VALID_KEYS = {"sk-your-api-key-here"}@app.route('/', defaults={'path': ''}, methods=['GET', 'POST', 'PUT', 'DELETE'])
@app.route('/<path:path>', methods=['GET', 'POST', 'PUT', 'DELETE'])
def proxy(path):
    api_key = request.headers.get('Authorization', '').replace('Bearer ', '')
    if api```
bas
h
# Create the custom model
ollama create senior-dev -f Modelfile

# Run it
ollama run senior-dev

# View the effective Modelfile
ollama show senior-dev --modelfile
```        headers={k: v for k, v in request.headers if k != 'Host'},
        data=request.get_data(),
        stream=True
    )
    return Response(resp.iter_content(chunk_size=1024), status=resp.status_code```
dockerfil
e
# Modelfile.code-review
FROM codellama:7b-code

SYSTEM """You are a code review assistant. Analyze the provided code for:
1. Bugs and logic errors
2. Security vulnerabilities (SQL injection, XSS, buffer overflow)
3. Performance issues (N+1 queries, unnecessary allocations)
4. Style and readability

Format your response as:
- [CRITICAL] for bugs/security
- [WARN] for performance
- [INFO] for style suggestions

Always suggest a fix for [CRITICAL] and [WARN] items."""

PARAMETER temperature 0.1
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
```, wrap the `api/ps` endpoint with a Prometheus exporter or use the [ollamaMQ](https://github.com/Chleba/ollamaMQ) proxy with built-in metrics.### Systemd Service (Linux)```
in
i
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=24h"[Install]
WantedBy=default.target

bas h sudo systemct``` bas h ollama create code-reviewer -f Modelfile.code-review

ollama
```## Comparison with Alternatives| Feature | Ollama | llama.cpp | vLLM | LocalAI ```
dockerfil
e
# Modelfile.local
FROM ./my-fine-tuned-model-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM "You are a helpful assistant specialized in medical terminology."
```~65 (Q4) | ~71 (FP16) | ~38 (Q4) |
| **Multi-User Batching** | FIFO queue | FIFO queue | Continuous | FIFO queue |
| **50-User Aggregate** | ~155 tok/s | ~140 tok/s | ~920 tok/s | ~130 tok/s |```
bas
h
ollama create med-assistant -f Modelfile.local
```c
k
e
r
|
| **Modelfile/Dockerfile** | Yes | No | No | No |
| **OpenAI API Compatible**```
bas
h
# Show model details and Modelfile
ollama show llama3.2:8b --modelfile

# Show parameters only
ollama show llama3.2:8b --parameters

# Show system prompt
ollama show llama3.2:8b --system

# List all local models
ollama list

# Show running models
ollama ps
```a
3.1 8B on RTX 4090, March 2026. Sources: SitePoint, TowardsAI, LocalAI Master.*### When to Choose What- **Ollama**: Start here. Best developer experience, fastest setup, excellent single-user performance. Use for local development, small-team deployments, and edge devices.
- **llama.cpp**: Choose if you need maximum control over inference parameters, custom kernels, or low-level optimizations. Good for embedded systems where you compile from source.
- **vLLM**: Choose when serving 5+ concurrent users with SLA requirements. Continuous batching and PagedAttention deliver production-grade throughput that Ollama cannot match at scale.
- **LocalAI**: Choose if you need a drop-in OpenAI replacement supporting image generation (Stable Diffusion), speech-to-text (Whisper), and full API parity in a container.## Limitations / Honest Assessment**No built-in authentication.** Ollama assumes a trusted local network. For internet-facing deployments, you must add an authentication layer (reverse proxy, API gateway, or VPN). This is the most common production oversight.**No continuous batching.** Under concurrent load, Ollama processes requests sequentially. At 50 concurrent users, p99 latency hits ~25 seconds compared to vLLM's ~3 seconds. Do not use Ollama as a multi-user production server without load testing.**GGUF-only format.** Ollama only supports GGUF-quantized models. If you need FP16 inference, AWQ, or GPTQ formats, use vLLM or Transformers directly.**No built-in model quantization.** You cannot quantize a model within Ollama. Convert models to GGUF externally (using `llama.cpp/convert_hf_to_gguf.py` or similar), then import via `ollama create`.**Memory management is static.** `OLLAMA_MAX_LOADED_MODELS` controls how many models stay resident, but there is no dynamic VRAM balancing. On a 12 GB GPU, loading a 70B model (even Q4) will OOM — Ollama does not automatically offload layers to CPU.**Limited tool calling support.** While tool calling is available for compatible models (Llama 3.1+, Mistral), the implementatio```
bas
h
# Core settings
OLLAMA_HOST=0.0.0.0:11434          # Bind to all interfaces
OLLAMA_KEEP_ALIVE=24h               # Keep models loaded for 24 hours
OLLAMA_NUM_PARALLEL=4               # Max concurrent requests
OLLAMA_MAX_LOADED_MODELS=2          # Max models in VRAM simultaneously
OLLAMA_FLASH_ATTENTION=1            # Enable Flash Attention (faster inference)

# Performance tuning
OLLAMA_GPU_OVERHEAD=200MB           # Reserve VRAM headroom
OLLAMA_DEBUG=1                      # Verbose logging
```tomatical
l
y
. Performance depends on your CPU: an Intel i7-13700K generates ~8–12 tok/s with a 7B Q4 model. Apple Silicon M3 Pro achieves ~25 tok/s on CPU/Neural Engine.**Q: How do I update Ollama to the latest version?**
A: On macOS, run `brew upgrade ollama`. On Linux, re-run the install script: `curl -fsSL https://ollama.com/install.sh | sh`. The script preserves your downloaded models in `~/.ollama/models/`.**Q: Is Ollama suitable for production use?**
A: For single-purpose deployments (one model, one user, predictable```
ngin
x
server {
    listen 443 ssl http2;
    server_name ollama.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ollama.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:11434;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # WebSocket support for streaming
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        
        # Timeouts for long-running inference
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
    }
}
```vis
i
o
n
models for image analysis?**
A: Yes. Vision models like LLaVA 1.7, Qwen2-VL, and InternVL2.5 are supported. Pass image data as base64 in the chat API request. Note that vision models require significantly more VRAM (add ~2–4 GB overhead).### Self-Hosting NoteRunning this on your own VPS? Try DigitalOcean with $200 free credit
 — enough for 2 months of moderate self-hosting to test the setup risk-free. Best for low-medium traffic; scale to dedicated when you outgrow it.## ConclusionOllama removes the friction from local LLM deployment. One command installs it, one command pulls a model, and one command runs it. The Modelfile system gives you reproducible model customization. The OpenAI-compatible API means your existing LangChain, Open WebUI, and Continue.dev integrations work with a single URL change.For solo developers and small teams, Ollama is the pragmatic starting poi```
pytho
n
# ollama-auth-proxy.py (Flask example)
from flask import Flask, request, Response
import requests

app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434"
VALID_KEYS = {"sk-your-api-key-here"}

@app.route('/', defaults={'path': ''}, methods=['GET', 'POST', 'PUT', 'DELETE'])
@app.route('/<path:path>', methods=['GET', 'POST', 'PUT', 'DELETE'])
def proxy(path):
    api_key = request.headers.get('Authorization', '').replace('Bearer ', '')
    if api_key not in VALID_KEYS:
        return {"error": "Invalid API key"}, 401
    
    resp = requests.request(
        method=request.method,
        url=f"{OLLAMA_URL}/{path}",
        headers={k: v for k, v in request.headers if k != 'Host'},
        data=request.get_data(),
        stream=True
    )
    return Response(resp.iter_content(chunk_size=1024), status=resp.status_code,
                   content_type=resp.headers.get('Content-Type'))

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=11435)
``` If you purchase services through these links, dibi8 may earn a commission at no additional cost to you.*







## Recommended Hosting & InfrastructureBefore you deploy any of the tools above into production, you'll need solid infrastructure. Two options dibi8 actually uses and recommends:- **DigitalOcean
** — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
- **HTStack
** — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.*Affiliate links — they don't cost you extra and they help keep dibi8.com running.*## Sources & Further Reading- Ollama Official Documentation: https://docs.ollama.com
- Ollama GitHub Repository: https://github.com/ollama/ollama
- Ollama Model Library: https://ollama.com/search
- Ollama REST API Reference: https://docs.ollama.com/api
- Modelfile Reference: https://docs.```
bas
h
# List running models with memory usage
curl http://localhost:11434/api/ps

# Expected output:
# {
#   "models": [
#     {
#       "name": "llama3.2:8b",
#       "model": "llama3.2:8b",
#       "size": 5137025024,
#       "size_vram": 5137025024,
#       "expires_at": "2026-05-20T10:00:00Z"
#     }
#   ]
# }
```Oll
a
m
a
vs LocalAI: https://zenvanriel.com/ai-engineer-blog/ollama-vs-localai-comparison-local-model-deployment/
- Open WebUI GitHub: https://github.com/open-webui/open-webui
- LangChain Ollama Integration: https://python.langchain.com/docs/integrations/chat/ollama
- Continue.dev Documentation: https://docs.continue.dev<!--auto-references-->
## References & Sources- [Ollama](https://github.com/ollama/ollama)
- [Open WebUI](https://github.com/open-webui/open-webui)
- [llama.cpp](https://github.com/ggml-org```
in
i
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=24h"

[Install]
WantedBy=default.target

bas h sudo systemctl daemon-reload sudo systemctl enable ollama sudo systemctl start ollama

Modelfile.local #

Install NVIDIA Container Toolkit #

🔗 Tài nguyên liên quan

💬 Bình luận & Thảo luận