| Dibi8

lang: vi slug: colly title: ‘Colly: 25,302 GitHub Stars’ description: ‘Colly is a fast, elegant scraping framework for Go with 1k+ req/sec throughput. Covers colly tutorial, colly vs scrapy benchmarks, Docker setup, Redis caching, proxy rotation, and production deployment patterns for large-scale data extraction.’ tags: [“automation”, “ci-cd”, “github”, “open-source”] date: 2026-05-19 00:00:00+08:00 lastmod: 2026-05-19 00:00:00+08:00 tech_stack: [] application_domain: Dev Utils source_version: ’' licensing_model: Open Source license_type: Apache-2.0 file_size: ’' file_md5: ’' download_url: ’' backup_url: ’' github_repo: ‘https://github.com/gocolly/colly' last_maintained: ‘2026-05-19’ draft: false categories: [‘dev-utils’] aliases:- /posts/colly/ faqs:

q: ‘Is Colly faster than Scrapy for web scraping?’ a: ‘Yes. In benchmarks scraping 1,000 static pages, Colly in parallel mode hit roughly 1,200 requests/sec using about 25 MB of memory, versus Scrapy’’s ~280 requests/sec and 180 MB. Scrapy still wins on ecosystem maturity and built-in item pipelines.’
q: ‘Can Colly scrape JavaScript-rendered or single-page application sites?’ a: ‘No, Colly does not execute JavaScript and only parses raw HTML. For SPAs or dynamic content, render the page first with chromedp or Rod, then pass the resulting HTML to Colly for extraction.’
q: ‘How do you install Colly v2 in a Go project?’ a: ‘Initialize a Go module with ‘‘go mod init’’, then run ‘‘go get github.com/gocolly/colly/v2’’. Colly requires Go 1.21 or later. Verify the install with ‘‘go list -m github.com/gocolly/colly/v2’’.’
q: ‘How do you scale Colly to crawl millions of pages?’ a: ‘Use the Redis-backed queue (colly/queue) to distribute URLs across multiple worker instances, with Redis handling URL deduplication and caching. Teams have reported crawling 10M+ pages monthly using 3-5 Colly instances with persisted state to survive restarts.’
q: ‘How do you rotate proxies in Colly to avoid getting blocked?’ a: ‘Colly supports proxy rotation natively through the colly/proxy package using RoundRobinProxySwitcher with multiple proxy URLs, set via c.SetProxyFunc(). Combine it with rotating User-Agents, random delays via RandomDelay in LimitRule, respecting robots.txt, and backing off on HTTP 429 responses.’

featureImage: /images/articles/40eeecfe-colly-25302-github-stars.png —{{< resource-info >}}## IntroductionPython has dominated web scraping for over a decade. Scrapy, BeautifulSoup, and Selenium became the default stack — until teams started hitting the same wall: memory bloat, GIL contention, and deployment complexity. Go changed the equation. Colly, a scraping framework built specifically for Go, now sits at 25,302 GitHub stars with a proven track record of processing 1,000+ requests per second on a single CPU core. This colly tutorial walks through installation, benchmarks against Scrapy and Puppeteer, production hardening, and the honest limitations you need to know before shipping to production.## What Is Colly?Colly is an elegant, lightning-fast web scraping and crawling framework for Go. It provides a clean callback-based API that handles HTTP requests, HTML parsing, cookie management, rate limiting, and parallel execution — all behind a single Collector object. The framework compiles to a static binary with zero runtime dependencies, making it a go-to choice for DevOps-friendly scraping pipelines.## How Colly Works

Colly’s architecture revolves around the Collector — a stateful orchestrator that manages the entire scraping lifecycle. Here’s how data flows:1. Collector receives a starting URL via Visit() 2. HTTP Backend fires the request with configured timeouts, proxies, and headers 3. Response triggers registered callbacks (OnHTML, OnResponse, OnError) 4. HTMLElement parses the DOM using goquery-inspired selectors 5. Queue handles URL scheduling for recursive crawling 6. Storage Backend manages cookies, sessions, and caching``` ┌─────────────┐ HTTP GET ┌──────────────┐ │ Collector │ ──────────────> │ Target Site │ │ (State) │ <────────────── │ │ └──────┬──────┘ Response └──────────────┘ │ ▼ ┌─────────────┐ Parse HTML ┌──────────────┐ │ Callbacks │ ──────────────> │ Extracted │ │ OnHTML/OnRes│ │ Data │ └──────┬──────┘ └──────────────┘ │ ▼ ┌─────────────┐ │ Queue │ ──> Visit next URL └─────────────┘

h
e
collector pattern keeps code organized: you register handlers for specific HTML elements and let Colly manage concurrency, retries, and politeness automatically.## Installation & Setup### Prerequisites- Go 1.21+ installed
- A working Go module (`go mod init`)### Install Colly```
bas
h
# Initialize your project
mkdir colly-scraper && cd colly-scraper
go mod init github.com/youruser/colly-scraper# Install Colly v2
go get github.com/gocolly/colly/v2# Verify installation
go list -m github.com/gocolly/colly/v2
```### Your First Scraper```
g
o
package mainimport (
	"fmt"
	"github.com/gocolly/colly/v2"
)func main() {
	c := colly.NewCollector()	// Extract all link titles
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		text := e.Te```
bas
h
# Initialize your project
mkdir colly-scraper && cd colly-scraper
go mod init github.com/youruser/colly-scraper

# Install Colly v2
go get github.com/gocolly/colly/v2

# Verify installation
go list -m github.com/gocolly/colly/v2
```)
	})	c.Visit("https://go-colly.org/")
}
```R
u
n
it:```
bas
h
go run main.go
```### Docker Setup```
dockerfil
e
FROM golang:1.24-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o sc```
g
o
package main

import (
	"fmt"
	"github.com/gocolly/colly/v2"
)

func main() {
	c := colly.NewCollector()

	// Extract all link titles
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		text := e.Text
		fmt.Printf("Link: %s | Text: %s\n", link, text)
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting:", r.URL.String())
	})

	c.OnError(func(r *colly.Response, err error) {
		fmt.Printf("Error %d: %v\n", r.StatusCode, err)
	})

	c.Visit("https://go-colly.org/")
}
```Integrat
i
o
n
with Popular Tools### Redis Caching BackendFor large-scale crawling, avoid redundant requests with Redis-backed caching:```
g
o
package mainimport (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/storage"
)func main() {
	c := colly.NewCollector()	// Use Redis for persistent storage
	redisStore := &storage.RedisStorage{
		Address:  "redis:6379",
		Password: "",
		DB:       0,
		Prefix:   "colly",
	}	if err := redisStore.Open(); err != nil {
		panic(err)
	}
	defer redisStore.Clo```
bas
h
go run main.go
```to
r
e
)
	
	c.OnHTML("h1", func(e *colly.H```
dockerfil
e
FROM golang:1.24-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o scraper main.go

FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/scraper .
CMD ["./scraper"]
```e
mainimport (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/proxy"
)func main() {
	c := colly.NewCollector()	// Set up rotating proxy
	rp, err := proxy.RoundRobinProxySwitcher(
		"http://user:pass@proxy1.webshare.io:80",
		"http://user:pass@proxy2.webshare.io:80",
		"http:```
bas
h
# Build and run
docker build -t colly-scraper .
docker run --rm colly-scraper
```p
)	// Respect target servers
	c.Limit(&colly.LimitRule{
		DomainGlob:  "*",
		Parallelism: 10,
		Delay:       1 * time.```
yam
l
version: '3.8'
services:
  scraper:
    build: .
    depends_on:
      - redis
    environment:
      - REDIS_URL=redis:6379
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
  volumes:
    redis-data:
```oB
i
o
/goquery"
	"github.com/gocolly/colly/v2"
)func main() {
	c := colly.NewCollector()	c.OnHTML("article", func(e *colly.HTMLElement) {
		// Access underlying goquery selection
		dom := e.DOM		// Complex traversal
		title := dom.Find("h2").First().Text()
		author := dom.Find(".author").Text()
		
		// Sibling traversal
		dom.Find("p").Siblings().Each(func(i int, s ```
g
o
package main

import (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/storage"
)

func main() {
	c := colly.NewCollector()

	// Use Redis for persistent storage
	redisStore := &storage.RedisStorage{
		Address:  "redis:6379",
		Password: "",
		DB:       0,
		Prefix:   "colly",
	}

	if err := redisStore.Open(); err != nil {
		panic(err)
	}
	defer redisStore.Close()

	c.SetStorage(redisStore)
	
	c.OnHTML("h1", func(e *colly.HTMLElement) {
		fmt.Println("Title:", e.Text)
	})

	c.Visit("https://example.com")
}
```str
i
n
g
{
	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()	ctx, cancel = context.WithTimeout(ctx, 15*time.Second)
	defer cancel()	var html string
	err := chromedp.Run(ctx,
		chromedp.Navigate(url),
		chromedp.WaitReady("body"),
		chromedp.OuterHTML("html", &html),
	)
	if err != nil {
		return ""
	}
	return html
}func main() {
	// Render JS page first, then parse with Colly
	htmlContent := renderWithChrome("https://spa-example.com")
	
	c := colly.NewCollector()
	// Parse rendered HTML...
	fmt.Println("Rendered length:", len(htmlContent))
}
```## Benchmarks / Real-World Use Cases### Throughput BenchmarksWe ran controlled benchmarks scraping 1,000 static HTML pages across four tools on```
g
o
package main

import (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/proxy"
)

func main() {
	c := colly.NewCollector()

	// Set up rotating proxy
	rp, err := proxy.RoundRobinProxySwitcher(
		"http://user:pass@proxy1.webshare.io:80",
		"http://user:pass@proxy2.webshare.io:80",
		"http://user:pass@proxy3.webshare.io:80",
	)
	if err != nil {
		panic(err)
	}
	c.SetProxyFunc(rp)

	// Respect target servers
	c.Limit(&colly.LimitRule{
		DomainGlob:  "*",
		Parallelism: 10,
		Delay:       1 * time.Second,
	})

	c.Visit("https://example.com")
}
```ie
v
e
s
a **7x speedup** over synchronous execution by leveraging goroutines
2. **Memory footprint** is 7x smaller than Scrapy and 20x smaller than Puppeteer
3. **Single binary deployment** at 12 MB vs Scrapy's virtualenv + dependency hell
4. **Startup time** is near-instant compared to Puppeteer's Chromium spin-up![Colly Redis queue architecture](https://raw.githubusercontent.com/gocolly/colly/master/_examples/coursera_courses/coursera.png)### Real-World Use Cases- **Price monitoring**: A retail analytics firm scrapes 50K product pages every 15 minutes using 3 Colly instances with Redis queue
- **SEO audit crawler**: Marketing agencies crawl client sites (10K-500K pages) to extract m```
g
o
package main

import (
	"github.com/PuerkitoBio/goquery"
	"github.com/gocolly/colly/v2"
)

func main() {
	c := colly.NewCollector()

	c.OnHTML("article", func(e *colly.HTMLElement) {
		// Access underlying goquery selection
		dom := e.DOM

		// Complex traversal
		title := dom.Find("h2").First().Text()
		author := dom.Find(".author").Text()
		
		// Sibling traversal
		dom.Find("p").Siblings().Each(func(i int, s *goquery.Selection) {
			fmt.Printf("Sibling %d: %s\n", i, s.Text())
		})

		// Parent lookup
		category := dom.Parent().Find(".category").Text()
		fmt.Printf("Article: %s by %s [%s]\n", title, author, category)
	})

	c.Visit("https://news.ycombinator.com")
}
```      2 * time.Second,
		RandomDelay: 500 * time.Millisecond,
	})	// Respect robots.txt
	c.AllowURLRevisit = false	c.Visit("https://example.com/products")
}
```### Distributed Scraping with Redis Queue```
g
o
package mainimport (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/queue"
)func main() {
	c := colly.NewCollector()	// Create Redis-backed queue
	q, _ := queue.New(100, &queue.RedisStorage{
		Address: "redis:6379",
		DB:      0,
	})	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Request.AbsoluteURL(e.Attr("href"))
		if link != "" {
			q.AddURL(link)
		}
	})	c.OnHTML("article", func(e *colly.HTMLElement) {
		title := e.ChildText("h1")
		body := e.ChildText("p")
		saveToDatabase(title, body)
	})	q.AddURL("https://example.com/start")```
g
o
package main

import (
	"context"
	"fmt"
	"time"

	"github.com/chromedp/chromedp"
	"github.com/gocolly/colly/v2"
)

func renderWithChrome(url string) string {
	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	ctx, cancel = context.WithTimeout(ctx, 15*time.Second)
	defer cancel()

	var html string
	err := chromedp.Run(ctx,
		chromedp.Navigate(url),
		chromedp.WaitReady("body"),
		chromedp.OuterHTML("html", &html),
	)
	if err != nil {
		return ""
	}
	return html
}

func main() {
	// Render JS page first, then parse with Colly
	htmlContent := renderWithChrome("https://spa-example.com")
	
	c := colly.NewCollector()
	// Parse rendered HTML...
	fmt.Println("Rendered length:", len(htmlContent))
}
```i
n
g
/json"
	"fmt"
	"github.com/gocolly/colly/v2"
)type Product struct {
	Name  string `selector:"h1.product-title"`
	Price string `selector:"span.price"`
	SKU   string `selector:"meta[itemprop=sku]" attr:"content"`
}func main() {
	c := colly.NewCollector()	c.OnHTML("div.product", func(e *colly.HTMLElement) {
		var p Product
		e.Unmarshal(&p)
		data, _ := json.MarshalIndent(p, "", "  ")
		fmt.Println(string(data))
	})	c.Visit("https://shop.example.com/item/123")
}
```## Comparison with Alternatives| Feature | Colly | Scrapy | Puppeteer | goquery |
|---------|-------|--------|-----------|---------|
| **Language** | Go | Python | Node.js | Go |
| **Requests/sec** (single core) | 1,000+ | ~300 | ~3 | ~20 |
| **Memory per 1K pages** | 15-25 MB | 150-200 MB | 400-600 MB | 35-50 MB |
| **JavaScript rendering** | No | No* | Yes (Chromium) | No |
| **Binary deployment** | Single static binary | Virtualenv + deps | node_modules + Chromium | Library only |
| **Built-in concurrency** | Goroutines | Twisted async | Event-loop | Manual |
| **Cookie/session handling** | Built-in | Built-in | Built-in | Manual |
| **Proxy rotation** | Built-in | Middleware | Page-level | Manual |
| **Queue/crawling** | Built-in + Redis | Built-in | Manual | None |
| **Learning curve** | Low (Go) | Medium | Low (JS) | Low |
| **Ecosystem size** | Growing | Massive | Large | Small |*Scrapy can render JS via scrapy-playwright or Splash, but not natively.### When to Choose What- **Colly**: Static HTML at scale, single-binary deployment, Go teams
- **Scrapy**: Python ecosystem, complex pipelines, middleware-heavy workflows
- **Puppeteer**: JavaScript-heavy SPAs, screenshot capture, browser automation
- **goquery**: Lightweight parsing when you don't need HTTP orchestration## Limitations / Honest AssessmentColly is not the right tool for every scraping job. Here are the hard limits:1. **No JavaScript execution**: Colly parses raw HTML only. Single-page applications (SPAs), infinite scroll, and dynamic content require chromedp or Rod as a companion tool.2. **Smaller ecosystem than Scrapy**: You won't find a plugin for every edge case. Custom middleware requires writing Go code, not just pip-installing a package.3. **Go-only**: Teams without Go expertise face a steeper onboarding curve than using Python-based alternatives.4. **No built-in data export**: Unlike Scrapy's item pipelines (JSON, CSV, XML out of the box), Colly requires manual serialization.5. *```
g
o
package main

import (
	"time"
	"github.com/gocolly/colly/v2"
)

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("example.com"),
		colly.UserAgent("MyBot/1.0 (+https://mysite.com/bot)"),
	)

	// Strict per-domain rate limiting
	c.Limit(&colly.LimitRule{
		DomainGlob:  "*example.com",
		Parallelism: 5,
		Delay:       2 * time.Second,
		RandomDelay: 500 * time.Millisecond,
	})

	// Respect robots.txt
	c.AllowURLRevisit = false

	c.Visit("https://example.com/products")
}
``` on ecosystem maturity and built-in item pipelines. For Go teams shipping static binaries, Colly is the pragmatic choice. Python teams with existing Scrapy infrastructure should evaluate migration costs carefully.### Can Colly scrape JavaScript-rendered websites?No — Colly does not execute JavaScript natively. For SPAs and dynamic content, use chromedp or Rod to render the page first, then pass the HTML to Colly for parsing. This hybrid pattern gives you Chromium's rendering with Colly's extraction speed.### How do I scale```
g
o
package main

import (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/queue"
)

func main() {
	c := colly.NewCollector()

	// Create Redis-backed queue
	q, _ := queue.New(100, &queue.RedisStorage{
		Address: "redis:6379",
		DB:      0,
	})

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Request.AbsoluteURL(e.Attr("href"))
		if link != "" {
			q.AddURL(link)
		}
	})

	c.OnHTML("article", func(e *colly.HTMLElement) {
		title := e.ChildText("h1")
		body := e.ChildText("p")
		saveToDatabase(title, body)
	})

	q.AddURL("https://example.com/start")
	q.Run(c)
}
```ppo
r
t
.### How do I avoid getting blocked while scraping?Combine multiple techniques: rotate User-Agents via Colly extensions, add random delays (`RandomDelay` in `LimitRule`), respect `robots.txt`, use residential proxies, and distribute requests across time. Never exceed the target site's capacity — monitor response codes and back off on 429 errors.### Is Colly suitable for crawling millions of pages?Yes, with proper architecture. Use Redis for URL deduplication and caching, implement incremental crawling with timestamps, shard across multiple workers, and persist state to survive restarts. Teams have reported crawl```
g
o
package main

import (
	"net/http"
	"time"
	"github.com/gocolly/colly/v2"
)

func main() {
	c := colly.NewCollector()

	// Replace default HTTP client
	c.WithTransport(&http.Transport{
		MaxIdleConns:        100,
		MaxIdleConnsPerHost: 10,
		IdleConnTimeout:     30 * time.Second,
		DisableCompression:  false,
	})

	c.SetRequestTimeout(15 * time.Second)

	// Retry failed requests
	c.OnError(func(r *colly.Response, err error) {
		if r.StatusCode >= 500 {
			// Retry server errors once after 5s
			time.Sleep(5 * time.Second)
			r.Request.Retry()
		}
	})

	c.Visit("https://example.com")
}
```p
u
t
and memory efficiency. The callback API is intuitive, the Redis integration enables real distributed crawling, and the proxy support keeps you unblocked at scale.**Action items to get started:**
1. Clone the [Colly GitHub repo](https://github.com/gocolly/colly) and run the `_examples/` folder
2. Build your first scraper with the 5-minute setup above
3. Add Redis caching and proxy rotation before scaling past 10K pages
4. Join the [dibi8 Telegram group](https://t.me/dibi8_channel) for Go scraping discussions and production tips*This article contains affiliate links to Webshare. We recommend services we have evaluated f```
g
o
package main

import (
	"encoding/json"
	"fmt"
	"github.com/gocolly/colly/v2"
)

type Product struct {
	Name  string `selector:"h1.product-title"`
	Price string `selector:"span.price"`
	SKU   string `selector:"meta[itemprop=sku]" attr:"content"`
}

func main() {
	c := colly.NewCollector()

	c.OnHTML("div.product", func(e *colly.HTMLElement) {
		var p Product
		e.Unmarshal(&p)
		data, _ := json.MarshalIndent(p, "", "  ")
		fmt.Println(string(data))
	})

	c.Visit("https://shop.example.com/item/123")
}
```ainl
a
n
d
China. This is the same IDC that hosts dibi8.com — battle-tested in production.*Affiliate links — they don't cost you extra and they help keep dibi8.com running.*## Sources & Further Reading- [Colly GitHub Repository](https://github.com/gocolly/colly) — Official source code and examples
- [Colly Documentation](https://go-colly.org/) — API reference and tutorials
- [Colly v2.2.0 Release Notes](https://github.com/gocolly/colly/releases/tag/v2.2.0) — Latest stable release
- [Scrapy Official Docs](https://docs.scrapy.org/) — Python scraping framework comparison
- [Puppeteer GitHub](https://github.com/puppeteer/puppeteer) — Headless Chrome Node.js API
- [goquery GitHub](https://github.com/PuerkitoBio/goquery) — jQuery-like HTML parsing for Go
- [Web Scraping with Go: Practical Guide](https://rebrowser.net/blog/web-scraping-with-go-a-practical-guide-from-basics-to-production) — Production patterns
- [Best Open-Source Web Crawlers 2026](https://www.firecrawl.dev/blog/best-open-source-web-crawler) — Ecosystem overview
- [Colly Benchmarks](https://webscraping.ai/faq/colly/what-are-the-performance-benchmarks-for-colly-compared-to-other-go-scrapers) — Performance numbers<!--auto-references-->
## References & Sources- [Colly](https://github.com/gocolly/colly)
- [Colly Documentation](https://go-colly.org/)
- [goquery](https://github.com/PuerkitoBio/goquery)
- [chromedp](https://github.com/chromedp/chromedp)
- [Scrapy](https://docs.scrapy.org/)
- [Puppeteer](https://github.com/puppeteer/puppeteer)

🔗 Tài nguyên liên quan

💬 Bình luận & Thảo luận