Ollama: 137K+ Stars — Run LLMs Locally with One Command, Complete Setup Guide 2026 | dibi8

Running large language models used to mean wrestling with Python environments, CUDA drivers, and gigabytes of dependencies. In 2026, that friction is gone. Ollama lets you pull, configure, and serve production-grade LLMs with a single command — no PyTorch installation, no manual GPU tuning, no Docker mandatory. With 137,000+ GitHub stars and a thriving ecosystem of integrations, Ollama has become the default runtime for developers who want local inference without operational headaches.

This guide walks through the complete Ollama setup: installation, Docker deployment, Modelfile customization, API integration with popular frameworks, production hardening, and honest benchmarks against alternatives. Whether you are building a coding assistant, a RAG pipeline, or a self-hosted ChatGPT alternative, this tutorial gives you the commands and configs to go from zero to running models in under five minutes.

This Ollama tutorial covers the complete setup from installation to production deployment in a single guide.

What Is Ollama? #

Ollama is an open-source runtime for running large language models locally. It wraps the inference engines (llama.cpp for CPU/GPU, MLX on Apple Silicon, ROCm on AMD) behind a simple CLI and REST API, so developers can focus on building applications instead of managing model weights, quantization formats, and hardware acceleration. Think of it as Docker for LLMs: pull a model, run it, done.

Created by Jeffrey Morgan and the Ollama team in 2023, the project reached 137,000+ stars on GitHub by mid-2026. It supports hundreds of models including Llama 3, DeepSeek R1, Mistral, Qwen, Gemma, and CodeLlama — all available through the Ollama model library.

How Ollama Works #

Ollama’s architecture follows a client-server model. A background daemon (ollama serve) manages model downloads, memory allocation, and inference. The CLI and REST API are thin clients that communicate with this daemon over HTTP on port 11434.

Core Architecture #

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Client    │────▶│ ollama serve │────▶│  llama.cpp/MLX  │
│  (CLI/API)  │     │   (port     │     │  (inference     │
│             │◄────│   11434)    │◄────│   backend)      │
└─────────────┘     └──────────────┘     └─────────────────┘
                           │
                    ┌──────┴──────┐
                    │  ~/.ollama/ │
                    │  (models,   │
                    │   blobs)    │
                    └─────────────┘

Key components:

Model Hub: Curated GGUF models pulled from ollama.com. Each model is identified by a name:tag pair (e.g., llama3.2:8b).
Modelfile: A declarative config (like Dockerfile) specifying base model, system prompt, parameters, and chat templates.
Inference Backends: Automatic selection of llama.cpp (CUDA/ROCm/CPU), MLX (Apple Silicon), or Metal based on available hardware.
REST API: OpenAI-compatible endpoints at /api/generate, /api/chat, /api/embed, and /v1/chat/completions.

Model Storage #

Models are stored in ~/.ollama/models/ as content-addressable blobs (SHA-256 digests). A manifest file tracks which blobs belong to which model tag. This deduplication means two models sharing the same base weights only store one copy on disk.

Installation & Setup #

macOS #

# Using Homebrew (recommended)
brew install ollama

# Or download the native app from ollama.com/download

Linux (One-Line Installer) #

curl -fsSL https://ollama.com/install.sh | sh

This installs the binary, registers a systemd service, and auto-detects GPU capabilities (NVIDIA CUDA, AMD ROCm, or CPU-only).

Windows #

Download the installer from ollama.com/download. Windows 11/12 with WSL2 is recommended for full compatibility.

Verify Installation #

ollama --version
# ollama version 0.6.7

# Start the daemon (if not already running)
ollama serve

# Pull and run your first model
ollama run llama3.2:8b

The first time you run a model, Ollama downloads it. A quantized 8B parameter model like llama3.2:8b requires approximately 4.9 GB of disk space and runs comfortably on 8 GB VRAM.

Quick Model Selection by Hardware #

Hardware	Recommended Model	Command
6–8 GB VRAM	Qwen3 8B	`ollama run qwen3:8b`
10–12 GB VRAM	Llama 3.1 8B Q4	`ollama run llama3.1:8b`
16+ GB VRAM	DeepSeek-R1 14B	`ollama run deepseek-r1:14b`
CPU only, 16 GB RAM	Phi-4 Mini 3.8B	`ollama run phi4-mini`
Apple M3/M4 36 GB	Llama 3.1 70B Q4	`ollama run llama3.1:70b`

Integration with Popular Tools #

Open WebUI (ChatGPT-Style Interface) #

Open WebUI is the most popular frontend for Ollama, providing a ChatGPT-like web interface with RAG, voice input, and multi-user support.

# Run Open WebUI with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access at http://localhost:3000. Open WebUI auto-discovers your Ollama instance at http://host.docker.internal:11434.

LangChain (Python) #

# Install
pip install langchain-ollama

# Chat model
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2:8b",
    temperature=0.7,
    base_url="http://localhost:11434"
)

response = llm.invoke("Explain quantum computing in one paragraph.")
print(response.content)

# Embeddings
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("Hello world")
# Returns a 768-dimensional float vector

Continue.dev (VS Code/Cursor AI Coding Assistant) #

Add to ~/.continue/config.json:

{
  "models": [
    {
      "title": "Llama 3.2",
      "provider": "ollama",
      "model": "llama3.2:8b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "CodeQwen",
    "provider": "ollama",
    "model": "codeqwen:7b-code"
  }
}

Dify (Self-Hosted AI Workflow Platform) #

In Dify’s Settings > Model Provider > Ollama, configure:

Model Name: llama3.2:8b
Base URL: http://host.docker.internal:11434
Context Window: 8192

cURL / REST API Direct Usage #

# Generate text
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:8b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "llama3.2:8b",
  "messages": [{"role": "user", "content": "Hello!"}],
  "temperature": 0.7
}'

# Generate embeddings
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["The sky is blue", "Grass is green"]
}'

Docker Setup for Production #

Basic Docker Compose #

# docker-compose.yml
version: "3.8"

services:
  ollama:
    image: ollama/ollama:0.6.7
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    restart: unless-stopped
    # NVIDIA GPU support
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - openwebui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  openwebui_data:

Start with docker compose up -d.

NVIDIA GPU Setup #

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

AMD ROCm GPU Setup #

Use the ROCm-specific image tag:

services:
  ollama:
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0

Multi-Model Concurrent Serving #

services:
  ollama:
    image: ollama/ollama:0.6.7
    environment:
      - OLLAMA_NUM_PARALLEL=4      # 4 concurrent requests
      - OLLAMA_MAX_LOADED_MODELS=2  # Keep 2 models in VRAM
      - OLLAMA_KEEP_ALIVE=30m      # Unload after 30min idle
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Modelfile: Customizing Models #

A Modelfile is Ollama’s declarative configuration format. It defines how a model behaves: system prompt, sampling parameters, context window, and chat template.

Basic Modelfile Example #

# Modelfile
FROM llama3.2:8b

# System prompt defines personality
SYSTEM """You are a senior software engineer. Be concise, 
practical, and always include working code examples."""

# Parameter tuning
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"

# Custom template (optional — inherits from base if omitted)
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

Build and run:

# Create the custom model
ollama create senior-dev -f Modelfile

# Run it
ollama run senior-dev

# View the effective Modelfile
ollama show senior-dev --modelfile

Advanced: Code Review Assistant #

# Modelfile.code-review
FROM codellama:7b-code

SYSTEM """You are a code review assistant. Analyze the provided code for:
1. Bugs and logic errors
2. Security vulnerabilities (SQL injection, XSS, buffer overflow)
3. Performance issues (N+1 queries, unnecessary allocations)
4. Style and readability

Format your response as:
- [CRITICAL] for bugs/security
- [WARN] for performance
- [INFO] for style suggestions

Always suggest a fix for [CRITICAL] and [WARN] items."""

PARAMETER temperature 0.1
PARAMETER num_ctx 8192
PARAMETER num_predict 2048

ollama create code-reviewer -f Modelfile.code-review

Creating from a Local GGUF File #

# Modelfile.local
FROM ./my-fine-tuned-model-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM "You are a helpful assistant specialized in medical terminology."

ollama create med-assistant -f Modelfile.local

Inspecting Existing Models #

# Show model details and Modelfile
ollama show llama3.2:8b --modelfile

# Show parameters only
ollama show llama3.2:8b --parameters

# Show system prompt
ollama show llama3.2:8b --system

# List all local models
ollama list

# Show running models
ollama ps

Benchmarks / Real-World Use Cases #

Single-User Throughput (RTX 4090, Llama 3.1 8B) #

Tool	Format	Tokens/sec	Setup Time
Ollama	Q4_K_M	~62 tok/s	< 2 min
vLLM	FP16	~71 tok/s	~10 min
llama.cpp (CLI)	Q4_K_M	~65 tok/s	~5 min
LocalAI	Q4_K_M	~38 tok/s	~15 min

Source: SitePoint benchmark, March 2026. Single-stream generation, 256-token output.

Concurrent Load (50 Users, RTX 4090) #

Tool	Aggregate tok/s	p99 Latency	Architecture
Ollama	~155 tok/s	~24.7s	FIFO queue
vLLM	~920 tok/s	~2.8s	Continuous batching
llama.cpp server	~140 tok/s	~26s	FIFO queue
LocalAI	~130 tok/s	~28s	FIFO queue

Ollama processes requests sequentially; vLLM’s continuous batching provides 6x throughput under concurrent load. For single-user development, the gap is only ~13%.

Memory Footprint (7B Parameter Model) #

Tool	Idle RAM	Loaded RAM	Cold Start
Ollama	150 MB	5.2 GB	2s
vLLM	400 MB	5.5 GB	5s
LocalAI	400 MB	5.5 GB	8s
LM Studio	800 MB	5.8 GB	5s

Real-World Deployment Patterns #

Individual Developer: Ollama + Continue.dev for AI-assisted coding. Latency < 50ms for autocomplete suggestions.
Small Team (5–10 users): Ollama on a shared GPU workstation + Open WebUI. Handles ~50 requests/hour comfortably.
Edge/Raspberry Pi 5: Ollama CPU-only with Phi-4 Mini (3.8B). ~8 tok/s, runs entirely offline.
CI/CD Pipeline: Ollama in Docker for automated code review. Pulls code-reviewer model, processes PR diffs via API.

Advanced Usage / Production Hardening #

Environment Variables #

# Core settings
OLLAMA_HOST=0.0.0.0:11434          # Bind to all interfaces
OLLAMA_KEEP_ALIVE=24h               # Keep models loaded for 24 hours
OLLAMA_NUM_PARALLEL=4               # Max concurrent requests
OLLAMA_MAX_LOADED_MODELS=2          # Max models in VRAM simultaneously
OLLAMA_FLASH_ATTENTION=1            # Enable Flash Attention (faster inference)

# Performance tuning
OLLAMA_GPU_OVERHEAD=200MB           # Reserve VRAM headroom
OLLAMA_DEBUG=1                      # Verbose logging

Reverse Proxy with Nginx #

server {
    listen 443 ssl http2;
    server_name ollama.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ollama.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:11434;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # WebSocket support for streaming
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        
        # Timeouts for long-running inference
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
    }
}

API Key Authentication (No Native Support) #

Ollama does not include built-in API key authentication. Add it via a reverse proxy:

# ollama-auth-proxy.py (Flask example)
from flask import Flask, request, Response
import requests

app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434"
VALID_KEYS = {"sk-your-api-key-here"}

@app.route('/', defaults={'path': ''}, methods=['GET', 'POST', 'PUT', 'DELETE'])
@app.route('/<path:path>', methods=['GET', 'POST', 'PUT', 'DELETE'])
def proxy(path):
    api_key = request.headers.get('Authorization', '').replace('Bearer ', '')
    if api_key not in VALID_KEYS:
        return {"error": "Invalid API key"}, 401
    
    resp = requests.request(
        method=request.method,
        url=f"{OLLAMA_URL}/{path}",
        headers={k: v for k, v in request.headers if k != 'Host'},
        data=request.get_data(),
        stream=True
    )
    return Response(resp.iter_content(chunk_size=1024), status=resp.status_code,
                   content_type=resp.headers.get('Content-Type'))

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=11435)

Monitoring with Prometheus #

Ollama exposes basic metrics via the API:

# List running models with memory usage
curl http://localhost:11434/api/ps

# Expected output:
# {
#   "models": [
#     {
#       "name": "llama3.2:8b",
#       "model": "llama3.2:8b",
#       "size": 5137025024,
#       "size_vram": 5137025024,
#       "expires_at": "2026-05-20T10:00:00Z"
#     }
#   ]
# }

For production monitoring, wrap the api/ps endpoint with a Prometheus exporter or use the ollamaMQ proxy with built-in metrics.

Systemd Service (Linux) #

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=24h"

[Install]
WantedBy=default.target

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

Comparison with Alternatives #

Feature	Ollama	llama.cpp	vLLM	LocalAI
GitHub Stars	137K+	75K+	45K+	35K+
Setup Time	< 2 min	~5 min	~10 min	~15 min
Single-User tok/s	~62 (Q4)	~65 (Q4)	~71 (FP16)	~38 (Q4)
Multi-User Batching	FIFO queue	FIFO queue	Continuous	FIFO queue
50-User Aggregate	~155 tok/s	~140 tok/s	~920 tok/s	~130 tok/s
Apple Silicon	Native (MLX)	Native	No	Via Docker
Modelfile/Dockerfile	Yes	No	No	No
OpenAI API Compatible	Yes (partial)	No	Yes (full)	Yes (full)
Embeddings API	Yes	No	Yes	Yes
Image Generation	No	No	No	Yes (Stable Diffusion)
Best For	Dev/Prototyping	Power users	Production serving	Multi-modal APIs

Benchmarks: Llama 3.1 8B on RTX 4090, March 2026. Sources: SitePoint, TowardsAI, LocalAI Master.

When to Choose What #

Ollama: Start here. Best developer experience, fastest setup, excellent single-user performance. Use for local development, small-team deployments, and edge devices.
llama.cpp: Choose if you need maximum control over inference parameters, custom kernels, or low-level optimizations. Good for embedded systems where you compile from source.
vLLM: Choose when serving 5+ concurrent users with SLA requirements. Continuous batching and PagedAttention deliver production-grade throughput that Ollama cannot match at scale.
LocalAI: Choose if you need a drop-in OpenAI replacement supporting image generation (Stable Diffusion), speech-to-text (Whisper), and full API parity in a container.

Limitations / Honest Assessment #

No built-in authentication. Ollama assumes a trusted local network. For internet-facing deployments, you must add an authentication layer (reverse proxy, API gateway, or VPN). This is the most common production oversight.

No continuous batching. Under concurrent load, Ollama processes requests sequentially. At 50 concurrent users, p99 latency hits ~25 seconds compared to vLLM’s ~3 seconds. Do not use Ollama as a multi-user production server without load testing.

GGUF-only format. Ollama only supports GGUF-quantized models. If you need FP16 inference, AWQ, or GPTQ formats, use vLLM or Transformers directly.

No built-in model quantization. You cannot quantize a model within Ollama. Convert models to GGUF externally (using llama.cpp/convert_hf_to_gguf.py or similar), then import via ollama create.

Memory management is static. OLLAMA_MAX_LOADED_MODELS controls how many models stay resident, but there is no dynamic VRAM balancing. On a 12 GB GPU, loading a 70B model (even Q4) will OOM — Ollama does not automatically offload layers to CPU.

Limited tool calling support. While tool calling is available for compatible models (Llama 3.1+, Mistral), the implementation is less robust than OpenAI’s function calling. Complex multi-step tool workflows may require fallback handling.

Frequently Asked Questions #

Q: How much VRAM do I need to run a 7B parameter model? A: A Q4_K_M quantized 7B model requires approximately 4.5–5 GB of VRAM. For Q8 quantization, plan for 7–8 GB. CPU-only inference works with 16 GB system RAM but at roughly 3–5x slower token generation.

Q: Can I run Ollama without a GPU? A: Yes. Ollama falls back to CPU inference via llama.cpp automatically. Performance depends on your CPU: an Intel i7-13700K generates ~8–12 tok/s with a 7B Q4 model. Apple Silicon M3 Pro achieves ~25 tok/s on CPU/Neural Engine.

Q: How do I update Ollama to the latest version? A: On macOS, run brew upgrade ollama. On Linux, re-run the install script: curl -fsSL https://ollama.com/install.sh | sh. The script preserves your downloaded models in ~/.ollama/models/.

Q: Is Ollama suitable for production use? A: For single-purpose deployments (one model, one user, predictable load), yes. For multi-user production serving, consider adding a queuing proxy or switching to vLLM. Always add authentication and monitoring before exposing to a network.

Q: Can I use my own fine-tuned models with Ollama? A: Yes. Convert your model to GGUF format, then create a Modelfile pointing to it with FROM ./your-model.gguf. Run ollama create my-model -f Modelfile and it becomes available through the standard API.

Q: How does Ollama compare to OpenAI’s API in terms of output quality? A: For equivalent base models (Llama 3.1 vs GPT-3.5), output quality is competitive on coding and reasoning tasks. The gap is larger on creative writing and multi-step reasoning where GPT-4 and Claude 3.5 Sonnet still lead. Local inference eliminates latency from network round-trips.

Q: Does Ollama support vision models for image analysis? A: Yes. Vision models like LLaVA 1.7, Qwen2-VL, and InternVL2.5 are supported. Pass image data as base64 in the chat API request. Note that vision models require significantly more VRAM (add ~2–4 GB overhead).

Conclusion #

Ollama removes the friction from local LLM deployment. One command installs it, one command pulls a model, and one command runs it. The Modelfile system gives you reproducible model customization. The OpenAI-compatible API means your existing LangChain, Open WebUI, and Continue.dev integrations work with a single URL change.

For solo developers and small teams, Ollama is the pragmatic starting point. When concurrent load exceeds ~5 users, evaluate vLLM. When you need multi-modal support beyond text, evaluate LocalAI. But start with Ollama — the 137,000+ stars reflect a tool that genuinely delivers on its promise.

Next steps:

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Run your first model: ollama run llama3.2:8b
Deploy Open WebUI for a team chat interface
Join the dibi8 developer community on Telegram for local LLM deployment tips and troubleshooting

For cloud GPU hosting when local hardware is insufficient, DigitalOcean GPU Droplets provide NVIDIA A100/H100 instances that pair well with Ollama Docker deployments. For dedicated bare-metal GPU servers, HTStack offers competitive pricing on RTX 4090 and A100 nodes.

Affiliate Disclosure: This article contains affiliate links to DigitalOcean and HTStack. If you purchase services through these links, dibi8 may earn a commission at no additional cost to you.

Recommended Hosting & Infrastructure #

Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Sources & Further Reading #

Ollama Official Documentation: https://docs.ollama.com
Ollama GitHub Repository: https://github.com/ollama/ollama
Ollama Model Library: https://ollama.com/search
Ollama REST API Reference: https://docs.ollama.com/api
Modelfile Reference: https://docs.ollama.com/modelfile
SitePoint — Ollama vs vLLM Benchmark 2026: https://www.sitepoint.com/ollama-vs-vllm-performance-benchmark-2026/
TowardsAI — Ollama vs vLLM vs llama.cpp: https://pub.towardsai.net/i-tested-ollama-vs-vllm-vs-llama-cpp-the-easiest-one-collapses-at-5-concurrent-users-d4f8e0e84886
LocalAI Master — Ollama vs LocalAI: https://zenvanriel.com/ai-engineer-blog/ollama-vs-localai-comparison-local-model-deployment/
Open WebUI GitHub: https://github.com/open-webui/open-webui
LangChain Ollama Integration: https://python.langchain.com/docs/integrations/chat/ollama
Continue.dev Documentation: https://docs.continue.dev