Building a simple scraper in Go: concurrency, HTTP and parsing
Tutorial for building a web scraper in Go with net/http, goquery, rate limiting and controlled concurrency. Practical scraping.

BeautifulSoup + requests in Python is faster to write. You can have a working scraper in fifteen lines, with HTML parsing, session handling and CSV export without breaking a sweat. For one-off scraping, I still use Python. But when I needed to scrape 50,000 pages concurrently, with fine-grained control over connections, retries and without dragging a virtualenv into production, Go was the option that fit.
Go is not the most comfortable language for quick scraping. That’s a fact. It doesn’t have the Scrapy ecosystem, nor the extraction tools community that Python has. But it has goroutines, compilation to a static binary, a solid standard HTTP library and a concurrency model that doesn’t need asyncio or event loops. For scrapers that will run as services, in containers, processing large volumes, that matters.
What we’re going to build here is a small but real scraper. It makes HTTP requests, parses HTML, extracts structured data, handles errors, respects rate limits and runs with controlled concurrency. If you’re coming from Python and exploring Go, this will give you a concrete example of how scraping flow translates to this language. If you already know Go, you might find a useful pattern for your own scrapers. For a broader comparison between both languages, I have a dedicated article on Go vs Python.
The HTTP client in Go: net/http
Go has an HTTP client in the standard library that doesn’t need anything else. No external dependencies, no wrappers. net/http is what most HTTP tools in Go use under the hood, including frameworks like Gin or libraries like Resty.
The most basic way to make a GET request:
package main
import (
"fmt"
"io"
"net/http"
)
func main() {
resp, err := http.Get("https://example.com")
if err != nil {
fmt.Println("Error:", err)
return
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Error reading body:", err)
return
}
fmt.Println(string(body))
}It works, but it has a fundamental problem for scraping: it uses the default HTTP client (http.DefaultClient), which has no timeout. If a server takes ten minutes to respond, your program will wait ten minutes. In a concurrent scraper, that’s a disaster.
The first thing you need is to create your own client with explicit configuration:
client := &http.Client{
Timeout: 10 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 30 * time.Second,
},
}Timeout is the total request timeout (including body reading). Transport controls the connection pool. MaxIdleConnsPerHost is important for scraping: if you’re making many requests to the same domain, you want to reuse TCP connections instead of opening a new one each time.
For more configurable requests, use http.NewRequest instead of http.Get:
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return fmt.Errorf("creating request for %s: %w", url, err)
}
req.Header.Set("User-Agent", "MyScraper/1.0 (+https://example.com/bot)")
req.Header.Set("Accept", "text/html")
req.Header.Set("Accept-Language", "en-US,en;q=0.9")
resp, err := client.Do(req)
if err != nil {
return fmt.Errorf("doing GET %s: %w", url, err)
}
defer resp.Body.Close()Notice the User-Agent. It’s not optional. It’s the minimum you should do as a responsible scraper: identify yourself. Many servers block requests without a User-Agent or with generic User-Agents.
HTML parsing with goquery
Go doesn’t have a BeautifulSoup equivalent in the standard library. It has golang.org/x/net/html for parsing HTML, but its API is low-level and working with it directly is tedious. The library everyone uses for scraping in Go is goquery. It’s the jQuery equivalent for Go: CSS selectors, DOM traversal, text and attribute extraction.
Install it with:
go get github.com/PuerkitoBio/goqueryBasic usage:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
resp, err := http.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
// Extract the page title
title := doc.Find("title").Text()
fmt.Println("Title:", title)
// Extract all links
doc.Find("a").Each(func(i int, s *goquery.Selection) {
href, exists := s.Attr("href")
if exists {
fmt.Printf("Link %d: %s -> %s\n", i, s.Text(), href)
}
})
}goquery CSS selectors cover practically everything you need:
// By class
doc.Find(".article-title")
// By ID
doc.Find("#main-content")
// Compound selectors
doc.Find("div.product > h2.name")
// Attributes
doc.Find("a[href^='https']")
// Pseudo-selectors
doc.Find("tr:nth-child(even)")To extract data, the most common methods are:
// Element text
text := s.Text()
// Attribute
href, exists := s.Attr("href")
// Inner HTML
html, err := s.Html()
// First matching element
first := doc.Find(".item").First()
// Iterate all elements
doc.Find(".item").Each(func(i int, s *goquery.Selection) {
// ...
})If you’re coming from BeautifulSoup, the mental translation is direct. soup.select(".class") is doc.Find(".class"). tag.get_text() is s.Text(). tag["href"] is s.Attr("href").
Building the scraper: extracting data from a page
Let’s build something concrete. Imagine we want to scrape a fictional news site and extract articles from the main page: title, link, summary and date.
First, we define the data structure:
type Article struct {
Title string `json:"title"`
URL string `json:"url"`
Summary string `json:"summary"`
Date string `json:"date"`
}Now, the function that parses a page and extracts articles:
func parseArticles(doc *goquery.Document, baseURL string) []Article {
var articles []Article
doc.Find("article.post").Each(func(i int, s *goquery.Selection) {
title := strings.TrimSpace(s.Find("h2.post-title").Text())
if title == "" {
return // Skip elements without a title
}
href, exists := s.Find("h2.post-title a").Attr("href")
if !exists {
return
}
// Resolve relative URLs
fullURL := resolveURL(baseURL, href)
summary := strings.TrimSpace(s.Find("p.post-summary").Text())
date := strings.TrimSpace(s.Find("time").AttrOr("datetime", ""))
articles = append(articles, Article{
Title: title,
URL: fullURL,
Summary: summary,
Date: date,
})
})
return articles
}The resolveURL function converts relative URLs to absolute ones:
func resolveURL(base, ref string) string {
baseURL, err := url.Parse(base)
if err != nil {
return ref
}
refURL, err := url.Parse(ref)
if err != nil {
return ref
}
return baseURL.ResolveReference(refURL).String()
}And the function that makes the HTTP request and connects everything:
func fetchArticles(client *http.Client, pageURL string) ([]Article, error) {
req, err := http.NewRequest("GET", pageURL, nil)
if err != nil {
return nil, fmt.Errorf("creating request: %w", err)
}
req.Header.Set("User-Agent", "GoScraper/1.0")
resp, err := client.Do(req)
if err != nil {
return nil, fmt.Errorf("fetch %s: %w", pageURL, err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("status %d for %s", resp.StatusCode, pageURL)
}
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return nil, fmt.Errorf("parsing HTML from %s: %w", pageURL, err)
}
return parseArticles(doc, pageURL), nil
}Notice the pattern: each error is wrapped with context using %w. This lets you know exactly what failed and where when debugging. If Go’s error handling seems excessive to you, I recommend reading my article on errors in Go where I explain why this verbosity is a real advantage.
Adding concurrency: goroutines and worker pool
Up to here we have a sequential scraper. It works, but if you have 1,000 pages to scrape, it will take forever. This is where Go shines.
The naive approach (don’t do this)
// DON'T do this
for _, url := range urls {
go func(u string) {
articles, err := fetchArticles(client, u)
// ...
}(url)
}Launching a goroutine per URL without control will cause you to fire 1,000 simultaneous requests. The server will block you, you’ll exhaust file descriptors and your scraper will blow up. It’s the equivalent of opening a thousand browser tabs at once.
Worker pool: controlled concurrency
The correct pattern is a worker pool. A fixed number of goroutines (workers) process URLs from a shared channel. This gives you real but controlled concurrency. If you want to dig deeper into this pattern, I have a dedicated article on worker pools in Go.
func scrapeWithWorkers(client *http.Client, urls []string, numWorkers int) []Article {
var (
mu sync.Mutex
results []Article
wg sync.WaitGroup
)
jobs := make(chan string, len(urls))
// Launch workers
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go func(workerID int) {
defer wg.Done()
for url := range jobs {
articles, err := fetchArticles(client, url)
if err != nil {
log.Printf("[Worker %d] Error scraping %s: %v", workerID, url, err)
continue
}
mu.Lock()
results = append(results, articles...)
mu.Unlock()
log.Printf("[Worker %d] OK: %s (%d articles)", workerID, url, len(articles))
}
}(i)
}
// Send URLs to the channel
for _, u := range urls {
jobs <- u
}
close(jobs)
// Wait for all workers to finish
wg.Wait()
return results
}Let’s break down what happens:
jobschannel: acts as a work queue. Workers read from this channel.sync.WaitGroup: lets us wait for all workers to finish.sync.Mutex: protects theresultsslice from concurrent writes. Without this, you’d have a race condition.range jobs: each worker reads URLs from the channel until it closes. This is idiomatic in Go.
With numWorkers = 10, you have ten goroutines processing URLs in parallel. If a request takes 2 seconds, instead of taking 2,000 seconds for 1,000 URLs, it takes around 200 seconds. Real concurrency without asyncio, without callbacks, without promises.
For finer control, you can add context in Go to cancel scraping if something goes wrong:
func scrapeWithContext(ctx context.Context, client *http.Client, urls []string, numWorkers int) ([]Article, error) {
var (
mu sync.Mutex
results []Article
wg sync.WaitGroup
)
jobs := make(chan string, len(urls))
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go func(workerID int) {
defer wg.Done()
for url := range jobs {
select {
case <-ctx.Done():
return
default:
}
articles, err := fetchArticles(client, url)
if err != nil {
log.Printf("[Worker %d] Error: %v", workerID, err)
continue
}
mu.Lock()
results = append(results, articles...)
mu.Unlock()
}
}(i)
}
for _, u := range urls {
select {
case jobs <- u:
case <-ctx.Done():
close(jobs)
wg.Wait()
return results, ctx.Err()
}
}
close(jobs)
wg.Wait()
return results, nil
}The select with ctx.Done() allows each worker to check if the context has been cancelled before processing the next URL. If you call cancel() from outside, all workers finish cleanly.
Rate limiting: time.Ticker and semaphore
Having controlled concurrency with a worker pool is not enough. You need rate limiting. Even with only 5 workers, if responses are fast, you can make hundreds of requests per second. That will draw attention from the server and you’ll likely get blocked.
Rate limiting with time.Ticker
time.Ticker emits a value on a channel at regular intervals. You can use it as a rate limiter:
func scrapeWithRateLimit(client *http.Client, urls []string, numWorkers int, requestsPerSecond int) []Article {
var (
mu sync.Mutex
results []Article
wg sync.WaitGroup
)
jobs := make(chan string, len(urls))
ticker := time.NewTicker(time.Second / time.Duration(requestsPerSecond))
defer ticker.Stop()
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go func(workerID int) {
defer wg.Done()
for url := range jobs {
<-ticker.C // Wait for the next tick
articles, err := fetchArticles(client, url)
if err != nil {
log.Printf("[Worker %d] Error: %v", workerID, err)
continue
}
mu.Lock()
results = append(results, articles...)
mu.Unlock()
}
}(i)
}
for _, u := range urls {
jobs <- u
}
close(jobs)
wg.Wait()
return results
}With requestsPerSecond = 5, the ticker emits a value every 200ms. Each worker must wait for a tick to be available before making its request. This gives you a maximum of 5 requests per second, regardless of how many workers you have.
Semaphore with buffered channel
Another option is to use a buffered channel as a semaphore to limit active concurrent requests:
type Scraper struct {
client *http.Client
semaphore chan struct{}
delay time.Duration
}
func NewScraper(maxConcurrent int, delay time.Duration) *Scraper {
return &Scraper{
client: &http.Client{
Timeout: 10 * time.Second,
},
semaphore: make(chan struct{}, maxConcurrent),
delay: delay,
}
}
func (s *Scraper) Fetch(url string) ([]Article, error) {
s.semaphore <- struct{}{} // Acquire slot
defer func() {
time.Sleep(s.delay) // Delay between requests
<-s.semaphore // Release slot
}()
return fetchArticles(s.client, url)
}The semaphore channel has a buffer of size maxConcurrent. When it’s full, the next s.semaphore <- struct{}{} blocks until a worker releases its slot. Combined with time.Sleep(s.delay) after each request, you have control over both concurrency and speed.
Error handling and retries
In scraping, errors are the norm, not the exception. Timeouts, 429 (Too Many Requests), 503 (Service Unavailable), reset connections, malformed HTML. Your scraper has to handle all this without crashing.
Retries with exponential backoff
func fetchWithRetry(client *http.Client, url string, maxRetries int) (*http.Response, error) {
var lastErr error
for attempt := 0; attempt <= maxRetries; attempt++ {
if attempt > 0 {
backoff := time.Duration(1<<uint(attempt-1)) * time.Second // 1s, 2s, 4s, 8s...
jitter := time.Duration(rand.Int63n(int64(500 * time.Millisecond)))
time.Sleep(backoff + jitter)
log.Printf("Retry %d/%d for %s", attempt, maxRetries, url)
}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, fmt.Errorf("creating request: %w", err)
}
req.Header.Set("User-Agent", "GoScraper/1.0")
resp, err := client.Do(req)
if err != nil {
lastErr = fmt.Errorf("attempt %d: %w", attempt, err)
continue
}
// Retry on certain status codes
if resp.StatusCode == http.StatusTooManyRequests ||
resp.StatusCode == http.StatusServiceUnavailable ||
resp.StatusCode >= 500 {
resp.Body.Close()
lastErr = fmt.Errorf("attempt %d: status %d", attempt, resp.StatusCode)
// If there's a Retry-After header, respect it
if retryAfter := resp.Header.Get("Retry-After"); retryAfter != "" {
if seconds, err := strconv.Atoi(retryAfter); err == nil {
time.Sleep(time.Duration(seconds) * time.Second)
}
}
continue
}
return resp, nil
}
return nil, fmt.Errorf("exhausted %d retries for %s: %w", maxRetries, url, lastErr)
}Key points:
- Exponential backoff: 1s, 2s, 4s, 8s… Each retry waits twice as long as the previous.
- Jitter: a random component to prevent all workers from retrying at the same time (thundering herd).
- Retry-After: if the server tells you how long to wait, listen to it.
- Only retry recoverable errors: a 404 makes no sense to retry. A 429 or 503, yes.
Classifying errors
Not all errors deserve the same treatment:
func isRetryable(statusCode int) bool {
switch statusCode {
case http.StatusTooManyRequests, // 429
http.StatusServiceUnavailable, // 503
http.StatusBadGateway, // 502
http.StatusGatewayTimeout: // 504
return true
default:
return statusCode >= 500
}
}
func isSkippable(statusCode int) bool {
switch statusCode {
case http.StatusNotFound, // 404
http.StatusForbidden, // 403
http.StatusGone: // 410
return true
default:
return false
}
}In the worker, you use this to decide what to do:
if isSkippable(resp.StatusCode) {
log.Printf("Skipping %s: status %d", url, resp.StatusCode)
continue
}
if isRetryable(resp.StatusCode) {
// Retry with backoff
}Saving results: JSON output
For a simple scraper, JSON is the most practical format. Easy to generate, easy to consume, easy to inspect.
Writing results to a file
func saveResults(articles []Article, filename string) error {
file, err := os.Create(filename)
if err != nil {
return fmt.Errorf("creating file %s: %w", filename, err)
}
defer file.Close()
encoder := json.NewEncoder(file)
encoder.SetIndent("", " ")
if err := encoder.Encode(articles); err != nil {
return fmt.Errorf("writing JSON: %w", err)
}
return nil
}Incremental writing with JSON Lines
If the scraper is going to run for hours, you don’t want to accumulate everything in memory and write at the end. Use JSON Lines (one JSON object per line):
func newResultWriter(filename string) (*ResultWriter, error) {
file, err := os.OpenFile(filename, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0644)
if err != nil {
return nil, err
}
return &ResultWriter{
file: file,
encoder: json.NewEncoder(file),
mu: sync.Mutex{},
}, nil
}
type ResultWriter struct {
file *os.File
encoder *json.Encoder
mu sync.Mutex
}
func (w *ResultWriter) Write(article Article) error {
w.mu.Lock()
defer w.mu.Unlock()
return w.encoder.Encode(article)
}
func (w *ResultWriter) Close() error {
return w.file.Close()
}With sync.Mutex, multiple workers can write to the file safely. Each Encode writes a complete line, so if the scraper crashes midway, you don’t lose already written data.
Respecting robots.txt and being a good citizen
Just because you can scrape a site doesn’t mean you should do it without consideration. There are basic rules every scraper should follow.
Checking robots.txt
import "github.com/temoto/robotstxt"
func checkRobotsTxt(client *http.Client, siteURL, userAgent string) (*robotstxt.Group, error) {
robotsURL := siteURL + "/robots.txt"
resp, err := client.Get(robotsURL)
if err != nil {
return nil, fmt.Errorf("fetching robots.txt: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
// No robots.txt, assume everything is allowed
return nil, nil
}
robots, err := robotstxt.FromResponse(resp)
if err != nil {
return nil, fmt.Errorf("parsing robots.txt: %w", err)
}
return robots.FindGroup(userAgent), nil
}
// Before scraping a URL
func canFetch(group *robotstxt.Group, path string) bool {
if group == nil {
return true
}
return group.Test(path)
}Use it before each request:
parsedURL, _ := url.Parse(targetURL)
if !canFetch(robotsGroup, parsedURL.Path) {
log.Printf("Blocked by robots.txt: %s", targetURL)
continue
}General best practices
Beyond robots.txt, there are principles you should follow:
- Identify yourself: Use a descriptive User-Agent. Include a contact URL.
- Always rate limit: Maximum 1-2 requests per second to the same domain, unless you know the server can handle more.
- Respect
Retry-After: If the server tells you to wait, wait. - Don’t scrape protected content: If there’s login, CAPTCHA or terms of use that prohibit it, don’t do it.
- Cache: If you already have a page downloaded, don’t request it again.
- Timing: If you can choose, scrape during off-peak hours.
This isn’t just ethics. It’s pragmatism. A scraper that behaves well lasts longer without being blocked.
Complete working example
Here’s the complete scraper, putting together everything we’ve seen. This code is functional: you can copy it, adjust the CSS selectors and run it.
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"math/rand"
"net/http"
"net/url"
"os"
"strconv"
"strings"
"sync"
"time"
"github.com/PuerkitoBio/goquery"
)
// --- Types ---
type Article struct {
Title string `json:"title"`
URL string `json:"url"`
Summary string `json:"summary"`
Date string `json:"date"`
}
type ScraperConfig struct {
MaxWorkers int
RequestsPerSecond int
MaxRetries int
Timeout time.Duration
UserAgent string
}
// --- HTTP client ---
func newHTTPClient(cfg ScraperConfig) *http.Client {
return &http.Client{
Timeout: cfg.Timeout,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 30 * time.Second,
},
}
}
// --- Request with retries ---
func fetchWithRetry(ctx context.Context, client *http.Client, url string, userAgent string, maxRetries int) (*http.Response, error) {
var lastErr error
for attempt := 0; attempt <= maxRetries; attempt++ {
if attempt > 0 {
backoff := time.Duration(1<<uint(attempt-1)) * time.Second
jitter := time.Duration(rand.Int63n(int64(500 * time.Millisecond)))
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(backoff + jitter):
}
log.Printf("Retry %d/%d for %s", attempt, maxRetries, url)
}
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return nil, fmt.Errorf("creating request: %w", err)
}
req.Header.Set("User-Agent", userAgent)
req.Header.Set("Accept", "text/html")
resp, err := client.Do(req)
if err != nil {
lastErr = fmt.Errorf("attempt %d: %w", attempt, err)
continue
}
if resp.StatusCode == http.StatusTooManyRequests ||
resp.StatusCode >= 500 {
resp.Body.Close()
lastErr = fmt.Errorf("attempt %d: status %d", attempt, resp.StatusCode)
if retryAfter := resp.Header.Get("Retry-After"); retryAfter != "" {
if seconds, err := strconv.Atoi(retryAfter); err == nil {
time.Sleep(time.Duration(seconds) * time.Second)
}
}
continue
}
return resp, nil
}
return nil, fmt.Errorf("exhausted %d retries for %s: %w", maxRetries, url, lastErr)
}
// --- Parsing ---
func resolveURL(base, ref string) string {
baseURL, err := url.Parse(base)
if err != nil {
return ref
}
refURL, err := url.Parse(ref)
if err != nil {
return ref
}
return baseURL.ResolveReference(refURL).String()
}
func parseArticles(doc *goquery.Document, baseURL string) []Article {
var articles []Article
doc.Find("article.post").Each(func(i int, s *goquery.Selection) {
title := strings.TrimSpace(s.Find("h2.post-title").Text())
if title == "" {
return
}
href, exists := s.Find("h2.post-title a").Attr("href")
if !exists {
return
}
articles = append(articles, Article{
Title: title,
URL: resolveURL(baseURL, href),
Summary: strings.TrimSpace(s.Find("p.post-summary").Text()),
Date: strings.TrimSpace(s.Find("time").AttrOr("datetime", "")),
})
})
return articles
}
func fetchArticles(ctx context.Context, client *http.Client, pageURL string, cfg ScraperConfig) ([]Article, error) {
resp, err := fetchWithRetry(ctx, client, pageURL, cfg.UserAgent, cfg.MaxRetries)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("status %d for %s", resp.StatusCode, pageURL)
}
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return nil, fmt.Errorf("parsing HTML from %s: %w", pageURL, err)
}
return parseArticles(doc, pageURL), nil
}
// --- Worker pool with rate limiting ---
func scrape(ctx context.Context, cfg ScraperConfig, urls []string) ([]Article, error) {
client := newHTTPClient(cfg)
var (
mu sync.Mutex
results []Article
wg sync.WaitGroup
)
jobs := make(chan string, len(urls))
ticker := time.NewTicker(time.Second / time.Duration(cfg.RequestsPerSecond))
defer ticker.Stop()
// Launch workers
for i := 0; i < cfg.MaxWorkers; i++ {
wg.Add(1)
go func(workerID int) {
defer wg.Done()
for pageURL := range jobs {
// Check cancellation
select {
case <-ctx.Done():
return
default:
}
// Rate limiting
<-ticker.C
articles, err := fetchArticles(ctx, client, pageURL, cfg)
if err != nil {
log.Printf("[Worker %d] Error scraping %s: %v", workerID, pageURL, err)
continue
}
mu.Lock()
results = append(results, articles...)
mu.Unlock()
log.Printf("[Worker %d] OK: %s (%d articles)", workerID, pageURL, len(articles))
}
}(i)
}
// Send URLs
for _, u := range urls {
select {
case jobs <- u:
case <-ctx.Done():
break
}
}
close(jobs)
wg.Wait()
return results, nil
}
// --- Save results ---
func saveResults(articles []Article, filename string) error {
file, err := os.Create(filename)
if err != nil {
return fmt.Errorf("creating file: %w", err)
}
defer file.Close()
encoder := json.NewEncoder(file)
encoder.SetIndent("", " ")
return encoder.Encode(articles)
}
// --- Main ---
func main() {
cfg := ScraperConfig{
MaxWorkers: 5,
RequestsPerSecond: 2,
MaxRetries: 3,
Timeout: 10 * time.Second,
UserAgent: "GoScraper/1.0 (+https://example.com/bot)",
}
// URLs to scrape (adjust to your case)
urls := []string{
"https://example-news.com/page/1",
"https://example-news.com/page/2",
"https://example-news.com/page/3",
"https://example-news.com/page/4",
"https://example-news.com/page/5",
}
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
log.Printf("Starting scraping of %d pages with %d workers", len(urls), cfg.MaxWorkers)
articles, err := scrape(ctx, cfg, urls)
if err != nil {
log.Fatalf("Error in scraping: %v", err)
}
log.Printf("Total articles extracted: %d", len(articles))
if err := saveResults(articles, "results.json"); err != nil {
log.Fatalf("Error saving results: %v", err)
}
log.Println("Results saved to results.json")
}To run it:
go mod init scraper
go mod tidy
go run main.gogo mod tidy will download goquery and its dependencies automatically. The binary compiled with go build is a static executable you can move to any server without installing anything.
When Python is still better for scraping
It would be dishonest to finish without this. Go has clear advantages for scraping at scale, but Python is still the best option in many scenarios:
Python wins when:
- You prototype quickly: You want to see if a scraper is viable. BeautifulSoup + requests + a Jupyter notebook. In ten minutes you have data. In Go you spend half an hour setting up the project, defining structs and handling errors.
- You need Scrapy: Scrapy is a complete scraping framework with middlewares, pipelines, cookie handling, automatic throttling, export to multiple formats and a huge community. Go has nothing comparable.
- JavaScript rendering: If the site loads content with JavaScript, you need a headless browser. Python has Playwright and Selenium with mature bindings. Go has chromedp, which works but is less ergonomic.
- One-shot scripts: A scraper you’re going to run once to extract data doesn’t need to be compiled. Python with a virtualenv is fine.
- Data/ML teams: If the team that will maintain the scraper works in Python and the data goes to a pandas/sklearn pipeline, adding Go to the equation doesn’t add enough.
Go wins when:
- High volume: Thousands or tens of thousands of pages. Go’s native concurrency and low memory usage make a difference.
- Scraper as a service: If the scraper is going to run continuously in a container, a 10MB static binary is better than a Python container with dependencies.
- Backend teams: If the team already works in Go, it doesn’t make sense to introduce Python just for a scraper.
- Performance matters: HTML parsing in Go (goquery uses the
golang.org/x/net/htmlparser) is significantly faster than BeautifulSoup. - Clean deployment: One binary. No runtime, no virtualenv, no pip version conflicts.
The question isn’t “which language is better for scraping”. It’s “what do I need in this specific case”. For a broader comparison, check the Go vs Python article.
From quick script to production tool
We’ve built a Go scraper from scratch that covers the fundamental aspects: configured HTTP client, HTML parsing with goquery, structured data extraction, concurrency with worker pool, rate limiting with time.Ticker, retries with exponential backoff, JSON output and robots.txt compliance.
The patterns we’ve used are the same ones you’ll find in production tools. The worker pool with channels is the standard Go concurrency pattern. Error handling with wrapping is idiomatic. Rate limiting with Ticker is the usual way to control speed.
Go is not the fastest option for putting together a quick scraper. But when you need a scraper that runs in production, that handles concurrency without pain, that deploys as a static binary and that scales without dragging dependencies, it makes sense. Especially if you’re already working in Go for the rest of your backend.
The complete code in this article is a starting point. Adapt it to your case: change the CSS selectors, adjust the number of workers, add database persistence instead of JSON, integrate metrics with Prometheus. The base structure is the same.


