lang: kr slug: tabby description: ‘Tabby is a self-hosted AI coding assistant. VS Code, JetBrains, Vim, Neovim, Ollama, DeepSeek. Docker setup, IDE integration, benchmarks, and production hardening.’ tags: [‘ai-agent’, ‘coding’, ‘development’, ‘guide’, ‘open-source’, ‘opensource’, ‘reference’, ‘self-hosted’, ’tutorial’] date: 2026-05-19 00:00:00+08:00 lastmod: 2026-05-19 00:00:00+08:00 tech_stack: [] application_domain: Llm Frameworks source_version: ’’ licensing_model: Open Source license_type: Apache-2.0 file_size: ’’ file_md5: ’’ download_url: ’’ backup_url: ’’ github_repo: ‘https://github.com/TabbyML/tabby' last_maintained: ‘2026-05-19’ draft: false categories: [’llm-frameworks’] aliases:

/posts/tabby/ faqs:

OpenCode: The Open-Source AI Coding Agent That Overtook Claude • Persistent Memory for AI Coding Agents in 2026

GitHub Copilot sends your proprietary code to Microsoft’s cloud. For teams handling sensitive IP — fintech, healthcare, defense, enterprise SaaS — that is a non-starter. Tabby is the open-source answer: a self-hosted AI coding assistant that runs entirely on your own hardware, with zero external data leakage. With 33,530+ GitHub stars and an active release cadence (v0.32.0 shipped January 2026), Tabby has matured from an experimental project into a production-grade alternative to Copilot. This tabby tutorial walks through a complete Tabby setup, from Docker deployment to IDE integration and production hardening. If you are specifically comparing tabby vs copilot, the comparison table in Section 8 breaks down feature parity and trade-offs.

What Is Tabby? #

Tabby is a self-hosted AI coding assistant and an open-source GitHub Copilot alternative. It provides real-time code completion, an answer engine for code queries, and inline chat — all running on infrastructure you control. Written in Rust (92.9% of the codebase), Tabby is designed for speed and can run on consumer-grade GPUs, Apple Silicon, or even CPU-only servers.

How Tabby Works #

Tabby consists of three core components:

Inference Server: A Rust-based HTTP server that loads coding LLMs and serves completions via an OpenAPI-compatible endpoint. It handles model inference, prompt templating, and streaming responses.
IDE Extensions: Native extensions for VS Code, JetBrains IDEs, Vim/Neovim, and Emacs that capture editor context and forward completion requests to the inference server.
Admin Dashboard: A built-in web UI for user management, API token generation, repository indexing, and usage analytics. No external database required — Tabby uses an embedded SQLite store.

The data flow is straightforward: the IDE extension captures the prefix/suffix context around your cursor, sends it to the local Tabby server, which runs inference against a loaded model (e.g., StarCoder-2-3B or Qwen2.5-Coder-7B), and returns completions in under 500ms on GPU.

After completing this tabby tutorial, your local server will handle code completion without sending any source code to third-party APIs. For developers evaluating tabby vs copilot, the key differentiator is that every inference request stays on your hardware — a requirement for organizations that classify code as sensitive data.

Installation & Setup #

Prerequisites #

Docker (recommended) or Docker Compose
NVIDIA Container Toolkit (for GPU support on CUDA systems)
4GB+ RAM for small models (1.5B params), 16GB+ for mid-range models (7B params)
10GB+ disk space for model weights

Docker Setup (5 Minutes) #

The fastest way to get Tabby running is via Docker. Below are commands for the three major compute backends.

NVIDIA GPU (CUDA) #

a
s
h
# Pull and run Tabby with CUDA acceleration
docker run -d \
  --name tabby \
  --gpus all \
  -p 8080: 8080 \
  -v $HOME/.tabby: /data \
  registry.tabbyml.com/tabbyml/tabby \
  serve \
  --model StarCoder-1B \
  --chat-model Qwen2-1.5B-Instruct \
  --device cuda

For systems with SELinux enabled, add the :Z flag to the volume mount:

a
s
h
docker run -d \
  --name tabby \
  --gpus all \
  -p 8080: 8080 \
  -v $HOME/.tabby: /data: Z \
  registry.tabbyml.com/tabbyml/tabby \
  serve \
  --model StarCoder-1B \
  --chat-model Qwen2-1.5B-Instruct \
  --device cuda

Apple Silicon (Metal) #

a
s
h
docker run -d \
  --name tabby \
  -p 8080: 8080 \
  -v $HOME/.tabby: /data \
  registry.tabbyml.com/tabbyml/tabby \
  serve \
  --model StarCoder-1B \
  --chat-model Qwen2-1.5B-Instruct \
  --device metal

AMD GPU (ROCm) #

a
s
h
docker run -d \
  --name tabby \
  --device /dev/kfd --device /dev/dri \
  --group-add video \
  -p 8080: 8080 \
  -v $HOME/.tabby: /data \
  registry.tabbyml.com/tabbyml/tabby-rocm \
  serve \
  --model StarCoder-1B \
  --device rocm

CPU-Only (Fallback) #

a
s
h
docker run -d \
  --name tabby \
  -p 8080: 8080 \
  -v $HOME/.tabby: /data \
  registry.tabbyml.com/tabbyml/tabby \
  serve \
  --model Qwen2.5-Coder-0.5B \
  --device cpu

Verify the Installation #

a
s
h
# Check server health
curl http://localhost: 8080/v1/health

# View logs
docker logs -f tabby

# Open the admin dashboard
open http://localhost: 8080

On first boot, Tabby downloads the specified model weights to $HOME/.tabby. Depending on your bandwidth, this may take 2–10 minutes. The admin dashboard will prompt you to create an admin account.

Docker Compose (Production-Ready) #

For persistent deployments, use Docker Compose:

a
m
l
version: '3.8'
services:
  tabby:
    image: registry.tabbyml.com/tabbyml/tabby
    container_name: tabby
    restart: unless-stopped
    ports:
      - "8080: 8080"
    volumes:
      - $HOME/.tabby: /data
    environment:
      - TABBY_WEBSERVER_JWT_TOKEN_SECRET=CHANGE_ME_TO_RANDOM_STRING
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      serve
      --model StarCoder2-3B
      --chat-model Qwen2.5-Coder-7B-Instruct
      --device cuda
      --parallelism 4

Generate a secure JWT secret:

a
s
h
openssl rand -hex 32

Deploy:

a
s
h
docker compose up -d

Homebrew (macOS Native) #

If you prefer not to use Docker on macOS:

a
s
h
# Install via Homebrew
brew install tabbyml/tabby/tabby

# Run with Metal acceleration
tabby serve \
  --model StarCoder2-3B \
  --chat-model Qwen2-1.5B-Instruct \
  --device metal

# Verify
curl http://localhost: 8080/v1/health

Integration with VS Code, JetBrains, Vim, and Ollama #

Tabby’s IDE extensions connect your editor to the local inference server via HTTP. This section covers the four most popular editors and how to use Ollama as a flexible model backend.

VS Code #

Open the Extensions marketplace, search for “Tabby”, and install the extension by TabbyML.
Open Settings (Ctrl+,), search for “Tabby”, and set the Server Endpoint to http://localhost: 8080.
The status bar will show a Tabby icon when connected. Start typing to receive completions.

JetBrains IDEs (IntelliJ, PyCharm, GoLand) #

Open Settings → Plugins → Marketplace, search for “Tabby”, and install.
Restart the IDE.
Navigate to Settings → Tools → Tabby and enter your server endpoint URL (e.g., http://localhost: 8080).
Generate an API token from the Tabby admin dashboard and paste it into the IDE settings.

Vim / Neovim #

For Neovim with nvim-cmp and cmp-tabby:

u
a
-- In your Neovim config (e.g., init.lua)
require('cmp').setup({
  sources = {
    { name = 'tabby' },
  },
})

-- Configure Tabby server URL
vim.g.tabby_server_url = 'http://localhost: 8080'

Using Ollama as a Backend #

Tabby can delegate inference to Ollama, which enables dynamic model switching and multi-model management:

o
m
l
# ~/.tabby/config.toml
[model.completion.http]
kind = "ollama/completion"
model_name = "deepseek-coder: 6.7b"
api_endpoint = "http://localhost: 11434"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"

[model.chat.http]
kind = "openai/chat"
model_name = "qwen2.5-coder: 7b"
api_endpoint = "http://localhost: 11434/v1"

Start Ollama with the required models:

a
s
h
ollama pull deepseek-coder: 6.7b
ollama pull qwen2.5-coder: 7b
ollama serve

Then start Tabby without specifying --model (it reads from config.toml):

a
s
h
tabby serve --device cuda

This setup is ideal when you want to run multiple models on a single GPU with limited VRAM — Ollama handles model loading and unloading dynamically.

Benchmarks / Real-World Use Cases #

This section provides hard numbers for anyone running a self-hosted coding assistant in production. Tabby’s throughput and latency vary by model size and GPU generation. All figures below assume a warm model cache (second request onward).

Tabby’s performance depends heavily on model size and hardware. The following numbers were collected from community benchmarks and internal testing:

|—

| | Qwen2.5-Coder-0.5B | 0.5B | 2 GB | ~200ms | 18% | CPU-only setups, rapid testing | | StarCoder-1B | 1B | 3 GB | ~180ms | 22% | Low-resource deployments | | StarCoder2-3B | 3B | 6 GB | ~250ms | 28% | Balanced quality/speed | | Qwen2.5-Coder-7B | 7B | 14 GB | ~350ms | 35% | High-quality completions | | DeepSeekCoder-6.7B | 6.7B | 13 GB | ~380ms | 33% | Python/JS focused projects |

Accept rate measures how often a developer accepts a Tabby suggestion versus ignoring or modifying it. For comparison, GitHub Copilot’s reported accept rate ranges from 30–40% depending on language.

Deployment Scenarios #

|—

For hosting the server infrastructure, consider providers like DigitalOcean for straightforward GPU-less deployments or HTStack for GPU-accelerated cloud instances. Both work well with the tabby docker setup described above.

Advanced Usage / Production Hardening #

Repository Context Indexing #

Tabby’s killer feature for teams is repository-level context indexing. It clones and indexes your Git repositories, then uses RAG (Retrieval-Augmented Generation) to surface relevant internal code snippets during completion.

Add repositories via the admin dashboard:

a
s
h
# Navigate to Repositories → Add Git URL
# Supports GitHub, GitLab, and self-hosted Git instances

Or configure via the scheduler CLI:

a
s
h
docker exec tabby /opt/tabby/bin/tabby-cpu scheduler --now

Security Hardening #

Change the default JWT secret: Set TABBY_WEBSERVER_JWT_TOKEN_SECRET to a cryptographically random 32-byte hex string.
Run behind a reverse proxy with TLS termination:

i
n
x
# Nginx example
server {
    listen 443 ssl;
    server_name tabby.yourcompany.com;

    ssl_certificate /etc/letsencrypt/live/tabby.yourcompany.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/tabby.yourcompany.com/privkey.pem;

    location / {
        proxy_pass http://localhost: 8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Enable LDAP/SSO authentication (Enterprise feature) for team-wide access control.
Set resource limits on the Docker container:

a
s
h
docker run -d \
  --memory=24g \
  --cpus=8 \
  # ... other flags

Performance Tuning #

a
s
h
# Increase parallelism for concurrent team requests
tabby serve \
  --model StarCoder2-3B \
  --device cuda \
  --parallelism 4

# Use half-precision (FP16) to reduce VRAM usage
tabby serve \
  --model StarCoder2-3B \
  --device cuda \
  --dtype float16

Monitoring #

a
s
h
# Check API health
curl http://localhost: 8080/v1/health

# Docker stats
docker stats tabby

# View recent logs with errors only
docker logs tabby 2>&1 | grep ERROR

Comparison with Alternatives #

|—

| #

Tabby is the only option in this group that keeps 100% of your code on-premises. That distinction matters if you work under SOC 2, HIPAA, ITAR, or similar compliance frameworks.

Limitations / Honest Assessment #

Tabby is not a drop-in replacement for every Copilot use case. Be aware of the following trade-offs:

Smaller models lag on complex reasoning: A 3B parameter model will not match GPT-4 on multi-file refactoring or architectural suggestions. For those tasks, you may still want a cloud-based chat tool.
Infrastructure burden: You are responsible for GPU maintenance, model updates, and server uptime. There is no SaaS fallback if your server goes down.
No chat in base install: The chat/answer engine requires a separate chat model and additional VRAM. Plan your GPU sizing accordingly.
Enterprise SSO costs: LDAP and advanced SSO are part of Tabby’s paid enterprise tier, not the open-source core.
Limited mobile support: There is no iOS/Android equivalent to Copilot’s mobile code review features.

Frequently Asked Questions #

What hardware do I need to run Tabby? #

For individual use, a laptop with 16GB RAM and an M-series MacBook or an NVIDIA GPU with 8GB+ VRAM handles the StarCoder2-3B model comfortably. For team deployments, allocate 4GB VRAM per concurrent user as a rule of thumb. A 7B model on an RTX 4090 (24GB) supports 4–6 developers simultaneously.

Can I use Tabby completely offline? #

Yes. After the initial model download, Tabby operates entirely without internet access. The inference server, IDE extensions, and admin dashboard all run on your local network. This is one of Tabby’s primary advantages for air-gapped environments.

How does Tabby compare to GitHub Copilot in accuracy? #

On single-file completions with a 7B model, Tabby achieves accept rates within 5–10% of Copilot. Where Copilot pulls ahead is multi-file context and complex refactoring — tasks that benefit from GPT-4-scale models. For routine line-by-line completions, the gap is negligible.

Can I use my own fine-tuned models? #

Yes. Tabby supports any model in the Hugging Face Transformers format with an OpenAI-compatible API. You can point Tabby to a local model path or host your own model server. See the MODEL_SPEC.md for the exact format requirements.

Is Tabby suitable for large enterprise teams? #

Tabby scales to 50+ users with proper hardware (multi-GPU server) and the --parallelism flag. The admin dashboard supports user management, API token rotation, and usage analytics. For SSO/LDAP integration, you will need the enterprise license.

How do I update Tabby to a new version? #

a
s
h
# Pull the latest image
docker pull registry.tabbyml.com/tabbyml/tabby

# Restart the container
docker compose down
docker compose up -d

# Verify the new version
curl http://localhost: 8080/v1/health

Conclusion #

Tabby fills a critical gap in the AI coding assistant market: a fully open-source, self-hosted tool that keeps your code inside your perimeter. With 33,530+ stars, active Rust-based development, and support for the latest coding models (Qwen2.5-Coder, DeepSeek, StarCoder2), it is ready for production use in privacy-conscious teams.

**Action items to get started: **

Run the Docker command in Section 4 to spin up Tabby on your local machine.
Install the IDE extension for your editor and connect to http://localhost: 8080.
Index a test repository from the admin dashboard to experience RAG-powered completions.
Join the Tabby community on Telegram for deployment tips and model recommendations.

This article contains affiliate links to hosting providers. These recommendations are based on technical suitability for self-hosted AI workloads, not commercial partnerships.

Recommended Hosting & Infrastructure #

Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Tabby: Self-Hosted AI Coding Assistant with 33K+ Stars

What Is Tabby? #

How Tabby Works #

Installation & Setup #

Prerequisites #

Docker Setup (5 Minutes) #

NVIDIA GPU (CUDA) #

Apple Silicon (Metal) #

AMD GPU (ROCm) #

CPU-Only (Fallback) #

Verify the Installation #

Docker Compose (Production-Ready) #

Homebrew (macOS Native) #

Integration with VS Code, JetBrains, Vim, and Ollama #

VS Code #

JetBrains IDEs (IntelliJ, PyCharm, GoLand) #

Vim / Neovim #

Using Ollama as a Backend #

Benchmarks / Real-World Use Cases #

Deployment Scenarios #

Advanced Usage / Production Hardening #

Repository Context Indexing #

Security Hardening #

Performance Tuning #

Monitoring #

Comparison with Alternatives #

| #

Limitations / Honest Assessment #

Frequently Asked Questions #

What hardware do I need to run Tabby? #

Can I use Tabby completely offline? #

How does Tabby compare to GitHub Copilot in accuracy? #

Can I use my own fine-tuned models? #

Is Tabby suitable for large enterprise teams? #

How do I update Tabby to a new version? #

Conclusion #

Recommended Hosting & Infrastructure #

Sources & Further Reading #

💬 댓글 토론

What Is Tabby? #

How Tabby Works #

Installation & Setup #

Prerequisites #

Docker Setup (5 Minutes) #

NVIDIA GPU (CUDA) #

Apple Silicon (Metal) #

AMD GPU (ROCm) #

CPU-Only (Fallback) #

Verify the Installation #

Docker Compose (Production-Ready) #

Homebrew (macOS Native) #

Integration with VS Code, JetBrains, Vim, and Ollama #

VS Code #

JetBrains IDEs (IntelliJ, PyCharm, GoLand) #

Vim / Neovim #

Using Ollama as a Backend #

Benchmarks / Real-World Use Cases #

Deployment Scenarios #

Advanced Usage / Production Hardening #

Repository Context Indexing #

Security Hardening #

Performance Tuning #

Monitoring #

Comparison with Alternatives #

| #

Limitations / Honest Assessment #

Frequently Asked Questions #

What hardware do I need to run Tabby? #

Can I use Tabby completely offline? #

How does Tabby compare to GitHub Copilot in accuracy? #

Can I use my own fine-tuned models? #

Is Tabby suitable for large enterprise teams? #

How do I update Tabby to a new version? #

Conclusion #

Recommended Hosting & Infrastructure #

Sources & Further Reading #

🔗 관련 리소스

💬 댓글 토론