Weights & Biases (W&B): Track Every Experiment Like a Pro — ML Experiment Platform 2026

Weights & Biases (wandb/wandb) is the AI developer platform for tracking, comparing, and deploying ML experiments. Supports PyTorch, TensorFlow, Hugging Face, and LLM fine-tuning. Covers experiment tracking, dataset versioning, model registry, and production monitoring.

  • ⭐ 11114
  • Updated 2026-06-09

Weights & Biases Dashboard

W&B Sweeps

W&B Artifacts

Introduction #

Training a machine learning model without experiment tracking is like driving blindfolded. You might reach your destination eventually, but you’ll never know which turn was the right one. Weights & Biases (W&B) solves this by providing a unified platform that logs, visualizes, and compares every experiment — from hyperparameter sweeps to LLM fine-tuning runs. With 11,000+ GitHub stars and integration with PyTorch, TensorFlow, Hugging Face, and 200+ frameworks, W&B has become the de facto standard for ML teams that take their experiments seriously.

What Is W&B? #

Weights & Biases is an end-to-end ML development platform that covers the entire experiment lifecycle. At its core is the logger — a lightweight library you add to your training script that automatically tracks metrics, configurations, artifacts, and even model checkpoints. Beyond logging, W&B provides a web dashboard for visualizing runs, comparing experiments side by side, sharing results with your team, and managing models from training to deployment.

┌───────────────────────────────────────────────┐
│           W&B Platform Architecture            │
├───────────────────────────────────────────────┤
│                                               │
│  SDK (pip install wandb)                      │
│    ├─ Logger (metrics, params, tables)        │
│    ├─ Artifact Tracker (datasets, models)     │
│    ├─ Sweeps (hyperparameter tuning)          │
│    ├─ Reports (visual dashboards)             │
│    └─ Model Registry (production models)      │
│                                               │
│  Cloud Dashboard                              │
│    ├─ Run comparison (up to 100 runs)         │
│    ├─ Project-level statistics                │
│    ├─ Artifact lineage graph                  │
│    └─ Team collaboration & sharing            │
│                                               │
│  Integrations                                 │
│    ├─ PyTorch, TensorFlow, JAX                │
│    ├─ Hugging Face Transformers               │
│    ├─ PyTorch Lightning, FastAI               │
│    └─ Ray Tune, Optuna, Ax                    │
└───────────────────────────────────────────────┘

How W&B Works #

W&B works by instrumenting your training loop. You initialize a run, log metrics at each step, and W&B sends the data to the cloud dashboard in real time. The SDK is designed to have minimal overhead — logging a metric takes roughly 0.1ms, and the network calls are batched and compressed to reduce bandwidth usage.

import wandb

# Initialize a new run with your configuration
wandb.init(
    project="my-nlp-finetune",
    config={
        "learning_rate": 2e-5,
        "batch_size": 32,
        "epochs": 3,
        "model": "bert-base-uncased",
    }
)

for epoch in range(config.epochs):
    for batch in train_dataloader:
        loss = model.train_step(batch)
        # Log metrics — W&B handles the rest
        wandb.log({"train_loss": loss, "lr": config.learning_rate})

The platform distinguishes between three types of tracked data: metrics (scalar values like loss and accuracy logged over time), artifacts (versioned files like datasets and model checkpoints), and media (images, audio, text samples visualized directly in the dashboard).

Installation & Setup #

Option 1: pip install (standard)

pip install wandb

Option 2: Authenticate with W&B

wandb login
# Paste your API key from https://wandb.ai/authorize

Option 3: Docker

docker pull wandb/launch
docker run -e WANDB_API_KEY=$WANDB_API_KEY \
  -v /path/to/code:/app wandb/launch python train.py

Option 4: Hugging Face Integration

pip install wandb transformers
# W&B is pre-configured for Hugging Face Trainer

Integration with PyTorch, Hugging Face, and Ray Tune #

W&B integrates with virtually every popular ML framework. Here are the most common setups.

PyTorch Lightning

import pytorch_lightning as pl
from pytorch_lightning.callbacks import WandbCallback

class MyModel(pl.LightningModule):
    def training_step(self, batch, batch_idx):
        loss = self.forward(batch)
        self.log("train_loss", loss)
        return loss

# W&B callback auto-logs everything
trainer = pl.Trainer(callbacks=[WandbCallback()])
trainer.fit(model)

Hugging Face Transformers

from transformers import Trainer, TrainingArguments
import wandb

training_args = TrainingArguments(
    output_dir="./results",
    report_to="wandb",  # Enable W&B reporting
    num_train_epochs=3,
    per_device_train_batch_size=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Ray Tune for Hyperparameter Sweeps

import ray
from ray import tune
import wandb

ray.init()

def train_model(config):
    # W&B automatically captures the sweep config
    wandb.init(config=config)
    score = my_training_function(config)
    wandb.log({"score": score})

sweep = tune.run(
    train_model,
    config={
        "learning_rate": tune.choice([1e-4, 2e-5, 5e-5]),
        "batch_size": tune.choice([16, 32, 64]),
    },
    metric="score",
    mode="max",
)

Benchmarks / Real-World Use Cases #

W&B’s logging performance has been benchmarked across various training scales. At typical training workloads, the overhead is negligible:

ScenarioLogging OverheadNetwork BandwidthDashboard Load Time
Small model (10K params)0.5%<1 MB/run<1s
Medium model (100M params)1.2%<5 MB/run<2s
Large model (1B params)2.1%<20 MB/run<3s
LLM fine-tuning (7B params)3.5%<50 MB/run<5s
Distributed training (8 GPUs)4.0%<100 MB/run<3s

Real-world usage examples:

# Log a confusion matrix as a W&B table
import numpy as np
import wandb

conf_matrix = compute_confusion_matrix(y_true, y_pred)
wandb.log({
    "confusion_matrix": wandb.plot.confusion_matrix(
        probs=None,
        preds=conf_matrix,
        class_names=["Cat", "Dog", "Bird"]
    )
})

# Version a dataset as an artifact
artifact = wandb.Artifact("training_data", type="dataset")
artifact.add_file("dataset.csv")
wandb.log_artifact(artifact)

Advanced Usage / Production Hardening #

Artifact Versioning and Lineage

# Log a model checkpoint as an artifact
model_artifact = wandb.Artifact("best_model", type="model")
model_artifact.add(model, "model.pt")

# Tag it for easy retrieval
wandb.log_artifact(model_artifact, aliases=["best", "v1.0"])

# Later, retrieve the exact version
run = wandb.init()
art = run.use_artifact("project/model:v1", type="model")
path = art.download()

Custom Reports and Dashboards

# Create a report with custom panels
report = wandb.Report(
    title="Experiment Results",
    slides=[
        ("Overview", wandb.plot.line_series(
            xs="step", ys="loss", keys=["train_loss", "val_loss"],
            title="Training Loss Curve"
        )),
        ("Comparison", wandb.plot.bar(
            wandb.Table(columns=["run", "accuracy"],
            data=[["run-a", 0.92], ["run-b", 0.89]])
        )),
    ]
)
report.save("experiment-report")

Sweeps Configuration

# sweeps.yaml
name: nlp-sweep
program: train.py
metric:
  name: val_accuracy
  goal: maximize
parameters:
  learning_rate:
    values: [1e-5, 2e-5, 5e-5, 1e-4]
  optimizer:
    values: [adamw, adam]
  warmup_ratio:
    min: 0.0
    max: 0.1
command:
  - python
  - train.py

Run the sweep:

wandb sweep sweeps.yaml
wandb agent $SWEEP_ID

Model Registry for Deployment

# Log model to the registry
run.log_model(
    path="./fine_tuned_model",
    name="my-nlp-model",
    aliases=["staging", "production"]
)

# Transition model to production
api = wandb.Api()
model = api.model("my-nlp-model:staging")
model.change_alias("production")

W&B SDK for Custom Training Loops

import wandb
import torch
from torch.optim import AdamW

wandb.init(project="custom-training")
config = wandb.config
config.learning_rate = 1e-4
config.batch_size = 64

model = MyModel()
optimizer = AdamW(model.parameters(), lr=config.learning_rate)

for epoch in range(config.epochs):
    model.train()
    epoch_loss = 0
    for i, (x, y) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(x)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        
        # Log every 100 steps
        if i % 100 == 0:
            wandb.log({
                "train_loss": loss.item(),
                "learning_rate": config.learning_rate,
                "epoch": epoch
            })
    
    # Log epoch summary
    wandb.log({
        "epoch_loss_avg": epoch_loss / len(train_loader),
        "epoch": epoch,
    })

Artifact Versioning for Dataset Workflows

# Create and log a dataset artifact
dataset_artifact = wandb.Artifact(
    name="cleaned_dataset",
    type="dataset",
    description="Cleaned and tokenized training data",
)

# Add files
dataset_artifact.add_file("data/train.csv")
dataset_artifact.add_file("data/val.csv")
dataset_artifact.add_file("data/test.csv")

# Log with alias for versioning
wandb.log_artifact(dataset_artifact, aliases=["latest", "v1.2"])

# Later, use this exact dataset version
run = wandb.init()
clean_data = run.use_artifact("project/cleaned_dataset:v1.2", type="dataset")
data_path = clean_data.download()

Comparison with Alternatives #

FeatureW&BMLflowWeights & BiasesTensorBoardNeptune.ai
Experiment tracking
Hyperparameter sweeps✓ (Native)No
Dataset versioning✓ (Artifacts)✓ (MLflow)No
Model registryNo
LLM fine-tuning supportLimited
Distributed training
Free tier5 usersUnlimited1 userUnlimited3 users
Self-hostableNo (Cloud)✓ (Open)No✓ (Local)✓ (Open)
Custom reportsNoNoNo
Team collaborationNo
API qualityExcellentGoodGoodBasicGood
Active GitHub stars11K+20K+2K+10K+1K+

Limitations / Honest Assessment #

W&B is the most polished ML tracking platform available, but it has some trade-offs:

  1. Cloud-first model: W&B’s free tier requires using their cloud platform. While they offer self-hosted W&B Enterprise for teams that need on-premise deployment, the free tier does not support self-hosting. If your organization requires all data to stay within your infrastructure, this may be a dealbreaker.
  2. Free tier limits: The free tier is limited to 1 team member. For larger research teams, the paid plans start at a significant cost, especially when you factor in the additional storage required for large model artifacts.
  3. Learning curve for advanced features: Basic logging is straightforward, but features like sweeps, artifact versioning, and custom reports require understanding W&B’s data model. New users may need 1-2 hours to get comfortable with the platform’s full capabilities.
  4. Limited offline capability: If your training environment has intermittent internet connectivity, W&B syncs data when the connection is restored. However, the SDK does support wandb.init(mode="offline") for fully disconnected environments, with manual sync later.
  5. Vendor lock-in risk: While W&B exports data in standard formats (JSON, CSV), building a migration to another platform after committing to W&B for hundreds of experiments can be time-consuming.

Frequently Asked Questions #

Q: Is W&B free for academic research?

Yes. W&B offers a generous free tier for individual researchers and small teams. Academic institutions can also apply for W&B Pro credits through the W&B for Academia program, which provides additional team members and storage at no cost.

Q: Can I use W&B with Jupyter Notebooks?

Absolutely. W&B works seamlessly in Jupyter notebooks. Initialize your run with wandb.init() at the top of a notebook cell, and all subsequent wandb.log() calls will stream to the dashboard. Use wandb.jupyter for automatic integration with Jupyter widgets.

Q: How does W&B handle large model artifacts?

W&B uploads model artifacts incrementally — only changed bytes are transferred on subsequent uploads. For very large models (multi-gigabyte checkpoints), W&B uses a content-addressable storage system that deduplicates uploads and supports resuming interrupted transfers.

Q: Can I track LLM fine-tuning experiments?

Yes. W&B has excellent support for LLM fine-tuning workflows. It integrates with Hugging Face Transformers, PyTorch Lightning, and frameworks like Axolotl and Unsloth. You can log training loss, evaluation metrics, generated samples, and even benchmark scores for each fine-tuning run.

Q: How do I collaborate with my team on experiments?

W&B supports team workspaces where all runs, artifacts, and reports are shared by default. Team members can comment on specific runs, share direct links to experiment comparisons, and use the built-in chat feature to discuss results without switching to a separate communication platform.

Conclusion #

Weights & Biases has transformed how ML teams approach experiment tracking. By combining real-time logging, intuitive visualization, and powerful collaboration features, W&B turns the chaos of model training into a structured, reproducible workflow. Whether you’re fine-tuning a 7B-parameter LLM or running a small hyperparameter sweep, W&B provides the visibility you need to make better decisions faster.

The platform’s deep integrations with PyTorch, Hugging Face, and Ray Tune mean you can start tracking experiments with a single line of code (report_to="wandb"). For teams building ML applications at scale, DigitalOcean offers affordable GPU instances that pair well with W&B’s tracking infrastructure.

For teams deploying ML pipelines: WebShare provides reliable proxy infrastructure for distributed training workflows.

For ML teams exploring algorithmic strategies: Binance offers advanced API access for quantitative research and backtesting.

For enterprise data pipelines: HTStack delivers the network performance needed for large-scale data transfers.

For AI finance teams: OKX provides comprehensive market data and trading tools.

Explore more guides on MLflow vs W&B Comparison and LLM Fine-Tuning Best Practices for comprehensive technical deep-dives.

Join the DIBI8 community on Telegram for ongoing discussions about ML tools, experiment tracking, and MLOps practices.


Sources & Further Reading:

Disclosure: This article contains affiliate links. If you sign up through our links, we may earn a small commission at no additional cost to you. This helps support independent tech journalism and keeps resources like dibi8.com free and ad-free.

💬 Discussion