Benchmarks in Go: how to measure before optimizing

Go benchmarks with testing.B: how to measure performance, avoid measurement errors and make practical decisions with data.

Roger Bosch Mar 16, 2026

Once I wasted two days optimizing a function that wasn’t the bottleneck. I rewrote the parsing of a large JSON with unsafe pointers, eliminated allocations, flattened nested structs. The result was unreadable code that improved that function’s performance by 40%. The problem was that function represented 2% of the total request time. Two days of work for an imperceptible improvement in production.

Go’s benchmarking tools would have saved me those two days. Five minutes with testing.B and pprof would have made it clear that the real bottleneck was the database connection, not the parsing. But I preferred to trust my intuition, and my intuition was wrong.

Optimizing without measuring is an expensive way to confirm biases. This article is about how to measure before touching anything.

The testing.B interface: how benchmarks work in Go

If you’ve already written tests in Go, benchmarks will feel familiar. They live in the same _test.go files, use the same testing package, and run with go test. The difference is that instead of receiving *testing.T, they receive *testing.B.

The signature of a benchmark is always the same:

func BenchmarkDescriptiveName(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // code to measure
    }
}

The name starts with Benchmark (mandatory) followed by a description in CamelCase. The b *testing.B parameter gives you access to the benchmarking framework.

The most important thing here is b.N. You don’t define it. Go adjusts it automatically. The framework runs your benchmark multiple times, incrementing b.N on each iteration, until it gets a statistically stable measurement. It may run your code 100 times, 10,000 times or 100,000,000 times. You don’t control that number and you shouldn’t try to.

This means that what you put inside the for i := 0; i < b.N; i++ loop has to be a complete unit of work. Not half work. Not work with side effects that accumulate between iterations. A clean, repeatable, isolated operation.

Your first benchmark

Suppose you have a function that concatenates strings. Something simple:

// concat.go
package concat

import "strings"

func ConcatPlus(parts []string) string {
    result := ""
    for _, p := range parts {
        result += p
    }
    return result
}

func ConcatBuilder(parts []string) string {
    var sb strings.Builder
    for _, p := range parts {
        sb.WriteString(p)
    }
    return sb.String()
}

Two implementations of the same problem. The first uses the + operator, which in Go creates a new string on each iteration (strings are immutable). The second uses strings.Builder, which accumulates bytes in an internal buffer and generates the final string just once.

Intuitively, strings.Builder should be faster. But “should” is not a data point. Let’s measure:

// concat_test.go
package concat

import "testing"

var parts = []string{
    "benchmark", "in", "go", "is", "a", "fundamental",
    "tool", "for", "measuring", "performance",
}

func BenchmarkConcatPlus(b *testing.B) {
    for i := 0; i < b.N; i++ {
        ConcatPlus(parts)
    }
}

func BenchmarkConcatBuilder(b *testing.B) {
    for i := 0; i < b.N; i++ {
        ConcatBuilder(parts)
    }
}

Notice that parts is defined outside the benchmark as a package variable. This is deliberate: we don’t want to measure the time to create the input slice, only the concatenation. If we define it inside the loop, we’d be measuring noise.

Running benchmarks: go test -bench

Benchmarks don’t run by default when you do go test. You need the -bench flag:

go test -bench=. ./...

The . after -bench is a regular expression that filters which benchmarks to run. A dot means “all”. If you only want to run one:

go test -bench=BenchmarkConcatBuilder ./...

Or a partial pattern:

go test -bench=Concat ./...

Flags you’ll use constantly:

-bench=. runs all benchmarks.
-benchmem includes memory statistics (allocations and bytes).
-count=N repeats the benchmark N times for better statistical significance.
-benchtime=5s changes the minimum duration of each benchmark (default 1 second).
-cpu=1,2,4 runs the benchmarks with different values of GOMAXPROCS.

A complete and useful command:

go test -bench=. -benchmem -count=5 ./...

This runs all benchmarks, reports memory, and repeats each one five times. The five runs are important when you want to compare later with benchstat.

Reading the output: ns/op, B/op, allocs/op

When you run the concatenation benchmark with -benchmem, you get something like:

BenchmarkConcatPlus-8       1924562       617.3 ns/op     352 B/op      9 allocs/op
BenchmarkConcatBuilder-8    5765418       207.1 ns/op     120 B/op      2 allocs/op

Each column means something specific:

BenchmarkConcatPlus-8: the benchmark name. The -8 indicates it ran with GOMAXPROCS=8 (8 logical cores).
1924562: the value of b.N, i.e., how many times Go ran your code to get a stable measurement.
617.3 ns/op: nanoseconds per operation. This is the main performance metric.
352 B/op: heap bytes allocated per operation.
9 allocs/op: number of heap allocations per operation.

The data confirms what we expected: strings.Builder is three times faster and uses much less memory. But now it’s not an intuition, it’s a data point with enough iterations to be reliable.

What matters and what doesn’t

ns/op matters when performance is a requirement. B/op and allocs/op always matter, because allocations pressure the garbage collector, and GC is a constant source of latency in Go applications under heavy load.

The value of b.N doesn’t matter directly. It’s an artifact of the measurement mechanism. If one benchmark runs 100 million iterations and another runs 1 million, it doesn’t mean one is better. It means Go needed more iterations to stabilize the measurement of the faster one.

Comparing benchmarks: benchstat

Running a benchmark once gives you a number. Running it five times and comparing with benchstat gives you a statistical data point with confidence intervals.

Install benchstat if you don’t have it:

go install golang.org/x/perf/cmd/benchstat@latest

The workflow is to save the output of your benchmarks to files and compare them:

go test -bench=. -benchmem -count=10 ./... > old.txt

# Make your changes to the code

go test -bench=. -benchmem -count=10 ./... > new.txt

benchstat old.txt new.txt

The output of benchstat gives you something like:

goos: linux
goarch: amd64
pkg: example.com/concat
                  │  old.txt   │             new.txt              │
                  │   sec/op   │   sec/op    vs base              │
ConcatPlus-8       617.0n ± 2%   210.5n ± 1%  -65.88% (p=0.000)
ConcatBuilder-8    207.0n ± 1%   198.2n ± 1%   -4.25% (p=0.001)

The ± 2% is the variation between runs. The p=0.000 indicates that the difference is statistically significant (p < 0.05). If benchstat tells you ~ (p=0.342), the difference is not significant and you shouldn’t treat it as a real improvement.

This is fundamental. A single benchmark can give different results depending on system load, CPU temperature, or what your browser is doing in the background. benchstat with -count=10 protects you against that.

Common measurement errors

The compiler optimizes your code

Go has an aggressive compiler. If the result of your function isn’t used anywhere, the compiler may eliminate the entire call. Your benchmark would literally measure nothing.

Bad:

func BenchmarkBad(b *testing.B) {
    for i := 0; i < b.N; i++ {
        ConcatBuilder(parts) // result is discarded
    }
}

In many cases this works because the compiler isn’t that aggressive with complex functions. But with simple or pure functions, it can eliminate it. The standard solution is to assign the result to a package variable:

var result string

func BenchmarkGood(b *testing.B) {
    var r string
    for i := 0; i < b.N; i++ {
        r = ConcatBuilder(parts)
    }
    result = r
}

The result variable is a package variable, so the compiler can’t assume nobody uses it. We assign to r inside the loop to avoid a write to a global variable on each iteration (which would be a measurable side effect) and copy at the end.

b.ResetTimer: setup you don’t want to measure

If your benchmark needs expensive setup before measuring, use b.ResetTimer() so that setup doesn’t count:

func BenchmarkWithSetup(b *testing.B) {
    // Expensive setup: create data, open connections, etc.
    data := generateTestData(10000)

    b.ResetTimer() // The timer resets here

    for i := 0; i < b.N; i++ {
        processData(data)
    }
}

Without b.ResetTimer(), the time for generateTestData would contaminate your measurement. Especially problematic if that setup takes longer than the function you want to measure.

b.StopTimer and b.StartTimer: pausing between iterations

Sometimes you need to do work between iterations that you don’t want to measure. For example, resetting state:

func BenchmarkWithPause(b *testing.B) {
    for i := 0; i < b.N; i++ {
        b.StopTimer()
        data := generateFreshData() // not measured
        b.StartTimer()

        processData(data) // this is measured
    }
}

Use this carefully. Calling b.StopTimer() and b.StartTimer() inside the loop has its own overhead. If your function is very fast (nanoseconds), that overhead may be larger than what you’re measuring. In those cases, it’s better to prepare all data outside the loop.

Benchmarks that depend on previous state

Each iteration should be independent. If your function modifies shared state between iterations, the measurements become contaminated:

// BAD: the map grows on each iteration
func BenchmarkMapAppend(b *testing.B) {
    m := make(map[string]int)
    for i := 0; i < b.N; i++ {
        m[fmt.Sprintf("key-%d", i)] = i // each iteration slower
    }
}

Here, the last iterations are slower than the first because the map is larger. The benchmark gives you an average that doesn’t represent any real case.

Memory benchmarks: b.ReportAllocs

You’ve already seen that -benchmem gives you memory statistics. You can also activate allocation reporting from within the benchmark itself with b.ReportAllocs():

func BenchmarkWithReportAllocs(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        ConcatPlus(parts)
    }
}

This is useful when you want allocations to always be reported for a specific benchmark, without depending on whoever runs it remembering to add -benchmem.

Why allocations matter

In Go, each heap allocation is future work for the garbage collector. Fewer allocations means less pressure on GC, which translates to lower latency and fewer pauses. In high-load applications, like those you’d build if you’re using Go for heavy tasks, the difference between 0 allocations and 3 allocations per operation can be the difference between a p99 of 5ms and a p99 of 50ms.

A good practice is to start by measuring allocations before looking at nanoseconds. If a function makes 10 allocations per call and you call it 100,000 times per second, that’s a million allocations per second the GC has to manage.

Pre-allocating to reduce allocations

A common optimization that benchmarks reveal is pre-allocation of slices and maps:

func BenchmarkSliceWithoutPrealloc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        s := []int{}
        for j := 0; j < 1000; j++ {
            s = append(s, j)
        }
    }
}

func BenchmarkSlicePreallocated(b *testing.B) {
    for i := 0; i < b.N; i++ {
        s := make([]int, 0, 1000)
        for j := 0; j < 1000; j++ {
            s = append(s, j)
        }
    }
}

Run this with -benchmem and you’ll see that the version without pre-allocation makes between 10 and 20 allocations (every time the slice grows, Go allocates a larger backing array). The pre-allocated version makes just one allocation.

Profiling with pprof: the basics

Benchmarks tell you how long something takes. pprof tells you where time is spent. They are complementary tools.

To generate a CPU profile from your benchmarks:

go test -bench=BenchmarkConcatPlus -cpuprofile=cpu.out ./...

This generates a cpu.out file you can analyze with go tool pprof:

go tool pprof cpu.out

Inside pprof, the most useful commands:

(pprof) top10          # the 10 functions that consume the most CPU
(pprof) list ConcatPlus  # shows source code annotated with timings
(pprof) web            # opens an interactive graph in the browser

For memory profiles:

go test -bench=BenchmarkConcatPlus -memprofile=mem.out ./...
go tool pprof mem.out

Inside the memory profile, top10 shows which functions allocate the most memory, and list points you to the exact lines.

The complete workflow

The pragmatic workflow for optimizing performance in Go is:

Benchmark: identify how long each operation takes.
pprof CPU: identify where time is spent.
pprof memory: identify what is being allocated.
Optimize: change only what the data tells you matters.
Benchmark again: verify your change actually improved something.
benchstat: confirm the improvement is statistically significant.

If you skip step 1 and go straight to optimizing, you’re where I was a few years ago: wasting two days on something that doesn’t matter.

When to benchmark and when not to

Yes

When you’re choosing between two implementations and performance is a factor.
Before optimizing anything. Measure first.
When a change touches code on the hot path of your application.
To validate that an optimization actually improved something.
In CI, to detect performance regressions between commits.

No

For code that runs once at application startup.
For endpoints that do I/O (HTTP, database): the benchmark doesn’t capture network latency.
As a substitute for a load test. Benchmarks measure isolated functions. A load test measures the complete system.
When the difference between both options is nanoseconds and your application handles 10 requests per second. Readability matters more.

The general rule: if you can’t articulate why the performance of that specific function is critical, you probably don’t need a benchmark. And if you do need it, the data will tell you quickly.

Practical example: comparing two implementations

Let’s look at a more realistic case. Suppose you have a service that filters users by a set of roles. Two approaches:

// users.go
package users

type User struct {
    Name  string
    Roles []string
}

// FilterWithSlice iterates through each user's roles using a slice.
func FilterWithSlice(users []User, allowedRoles []string) []User {
    var result []User
    for _, u := range users {
        for _, role := range u.Roles {
            if sliceContains(allowedRoles, role) {
                result = append(result, u)
                break
            }
        }
    }
    return result
}

func sliceContains(s []string, val string) bool {
    for _, v := range s {
        if v == val {
            return true
        }
    }
    return false
}

// FilterWithMap converts allowed roles to a map for O(1) lookup.
func FilterWithMap(users []User, allowedRoles []string) []User {
    allowed := make(map[string]struct{}, len(allowedRoles))
    for _, r := range allowedRoles {
        allowed[r] = struct{}{}
    }

    var result []User
    for _, u := range users {
        for _, role := range u.Roles {
            if _, ok := allowed[role]; ok {
                result = append(result, u)
                break
            }
        }
    }
    return result
}

The first implementation is O(nmk) where n=users, m=roles per user, k=allowed roles. The second is O(n*m) because map lookup is O(1). In theory, the map should win. But the map has creation and hashing overhead. With few allowed roles, the slice could be faster.

The benchmark:

// users_test.go
package users

import "testing"

func generateUsers(n int) []User {
    users := make([]User, n)
    roles := []string{"admin", "editor", "viewer", "moderator", "guest"}
    for i := range users {
        users[i] = User{
            Name:  "user",
            Roles: []string{roles[i%len(roles)], roles[(i+1)%len(roles)]},
        }
    }
    return users
}

var allowedRoles = []string{"admin", "editor", "moderator"}
var resultSink []User

func BenchmarkFilterWithSlice_100(b *testing.B) {
    users := generateUsers(100)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        resultSink = FilterWithSlice(users, allowedRoles)
    }
}

func BenchmarkFilterWithMap_100(b *testing.B) {
    users := generateUsers(100)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        resultSink = FilterWithMap(users, allowedRoles)
    }
}

func BenchmarkFilterWithSlice_10000(b *testing.B) {
    users := generateUsers(10000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        resultSink = FilterWithSlice(users, allowedRoles)
    }
}

func BenchmarkFilterWithMap_10000(b *testing.B) {
    users := generateUsers(10000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        resultSink = FilterWithMap(users, allowedRoles)
    }
}

Notice several details:

We generate data outside the loop and use b.ResetTimer() to not measure generation.
We assign to resultSink (package variable) to prevent the compiler from eliminating the call.
We test with two sizes (100 and 10,000) because behavior may change with data volume.

With 3 allowed roles, it’s possible the slice version wins for 100 users (the map overhead doesn’t pay off). And that the map version wins with 10,000 users. Or maybe not. That’s exactly the point: the benchmark gives you the answer instead of forcing you to guess.

If the results show the difference is marginal, choose the more readable implementation. If one is 10x faster with your real data volume, choose the faster one. But let the numbers decide.

Sub-benchmarks for parameterization

Go allows sub-benchmarks with b.Run, which are ideal for testing multiple sizes without duplicating functions:

func BenchmarkFilterWithSlice(b *testing.B) {
    for _, size := range []int{10, 100, 1000, 10000} {
        users := generateUsers(size)
        b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                resultSink = FilterWithSlice(users, allowedRoles)
            }
        })
    }
}

The output will be:

BenchmarkFilterWithSlice/size=10-8      ...
BenchmarkFilterWithSlice/size=100-8     ...
BenchmarkFilterWithSlice/size=1000-8    ...
BenchmarkFilterWithSlice/size=10000-8   ...

You can filter specific sub-benchmarks:

go test -bench=FilterWithSlice/size=1000 ./...

This makes benchmarks much more maintainable. Instead of writing one function per case, you parameterize and let the framework do the work.

Parallel benchmarks: b.RunParallel

If your code runs in a concurrent context (and in Go, almost everything does), you can measure performance under concurrency with b.RunParallel:

func BenchmarkFilterConcurrent(b *testing.B) {
    users := generateUsers(1000)
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            FilterWithMap(users, allowedRoles)
        }
    })
}

b.RunParallel launches multiple goroutines that execute your code in parallel. Each goroutine iterates by calling pb.Next() instead of using b.N directly. The framework takes care of distributing the iterations.

This is useful for detecting contention: locks, atomic operations, cache false sharing. A function that is fast in a sequential benchmark can be slow when ten goroutines execute it simultaneously.

Measure first, optimize later

Everything we’ve seen boils down to one idea: don’t optimize based on what you think is slow. Measure, confirm, and only then act.

Go gives you the tools built into the standard library. You don’t need external frameworks, special configuration, or licenses. testing.B has been there from day one, ready to use in any _test.go file.

The summarized workflow:

Write a benchmark for the function you suspect is slow.
Run with -benchmem -count=5 to have CPU and memory data.
Use benchstat to compare before and after with statistical significance.
Generate profiles with -cpuprofile and -memprofile if you need to know exactly where time is spent.
Optimize only what the data points to as the real bottleneck.
Measure again to confirm your change actually improved something.

If the benchmark tells you the difference between two implementations is 20 nanoseconds and your endpoint takes 200 milliseconds, close the benchmark and spend your time on something that matters. If it tells you one implementation allocates 50 times less memory on a hot path that processes 100,000 requests per second, you’ve found something worth optimizing.

Data has no opinions. Your intuition does. Trust the data.

Tags: #go #benchmark #performance #testingb #optimization #profiling

Back to all posts

Cover for Build a Go Tool to Convert CSV to JSON

Roger Bosch

•

May 20, 2026

Build a Go Tool to Convert CSV to JSON

Cover for Consuming Kafka Messages with Go: A Practical Backend Example

Roger Bosch

•

May 10, 2026

Consuming Kafka Messages with Go: A Practical Backend Example

Cover for Build a background job worker in Go

Roger Bosch

•

May 1, 2026

Benchmarks in Go: how to measure before optimizing

The testing.B interface: how benchmarks work in Go

Your first benchmark

Running benchmarks: go test -bench

Reading the output: ns/op, B/op, allocs/op

What matters and what doesn’t

Comparing benchmarks: benchstat

Common measurement errors

The compiler optimizes your code

b.ResetTimer: setup you don’t want to measure

b.StopTimer and b.StartTimer: pausing between iterations

Benchmarks that depend on previous state

Memory benchmarks: b.ReportAllocs

Why allocations matter

Pre-allocating to reduce allocations

Profiling with pprof: the basics

The complete workflow

When to benchmark and when not to

Yes

No

Practical example: comparing two implementations

Sub-benchmarks for parameterization

Parallel benchmarks: b.RunParallel

Measure first, optimize later

Related Posts

Build a Go Tool to Convert CSV to JSON

Consuming Kafka Messages with Go: A Practical Backend Example

Build a background job worker in Go

Legal

Navigation

RRSS

Cookie Settings

Benchmarks in Go: how to measure before optimizing

The testing.B interface: how benchmarks work in Go

Your first benchmark

Running benchmarks: go test -bench

Reading the output: ns/op, B/op, allocs/op

What matters and what doesn’t

Comparing benchmarks: benchstat

Common measurement errors

The compiler optimizes your code

b.ResetTimer: setup you don’t want to measure

b.StopTimer and b.StartTimer: pausing between iterations

Benchmarks that depend on previous state

Memory benchmarks: b.ReportAllocs

Why allocations matter

Pre-allocating to reduce allocations

Profiling with pprof: the basics

The complete workflow

When to benchmark and when not to

Yes

No

Practical example: comparing two implementations

Sub-benchmarks for parameterization

Parallel benchmarks: b.RunParallel

Measure first, optimize later

Related Posts

Build a Go Tool to Convert CSV to JSON

Consuming Kafka Messages with Go: A Practical Backend Example

Build a background job worker in Go

Legal

Navigation

RRSS