Benchmarks in Go: how to measure before optimizing
Go benchmarks with testing.B: how to measure performance, avoid measurement errors and make practical decisions with data.

Once I wasted two days optimizing a function that wasn’t the bottleneck. I rewrote the parsing of a large JSON with unsafe pointers, eliminated allocations, flattened nested structs. The result was unreadable code that improved that function’s performance by 40%. The problem was that function represented 2% of the total request time. Two days of work for an imperceptible improvement in production.
Go’s benchmarking tools would have saved me those two days. Five minutes with testing.B and pprof would have made it clear that the real bottleneck was the database connection, not the parsing. But I preferred to trust my intuition, and my intuition was wrong.
Optimizing without measuring is an expensive way to confirm biases. This article is about how to measure before touching anything.
The testing.B interface: how benchmarks work in Go
If you’ve already written tests in Go, benchmarks will feel familiar. They live in the same _test.go files, use the same testing package, and run with go test. The difference is that instead of receiving *testing.T, they receive *testing.B.
The signature of a benchmark is always the same:
func BenchmarkDescriptiveName(b *testing.B) {
for i := 0; i < b.N; i++ {
// code to measure
}
}The name starts with Benchmark (mandatory) followed by a description in CamelCase. The b *testing.B parameter gives you access to the benchmarking framework.
The most important thing here is b.N. You don’t define it. Go adjusts it automatically. The framework runs your benchmark multiple times, incrementing b.N on each iteration, until it gets a statistically stable measurement. It may run your code 100 times, 10,000 times or 100,000,000 times. You don’t control that number and you shouldn’t try to.
This means that what you put inside the for i := 0; i < b.N; i++ loop has to be a complete unit of work. Not half work. Not work with side effects that accumulate between iterations. A clean, repeatable, isolated operation.
Your first benchmark
Suppose you have a function that concatenates strings. Something simple:
// concat.go
package concat
import "strings"
func ConcatPlus(parts []string) string {
result := ""
for _, p := range parts {
result += p
}
return result
}
func ConcatBuilder(parts []string) string {
var sb strings.Builder
for _, p := range parts {
sb.WriteString(p)
}
return sb.String()
}Two implementations of the same problem. The first uses the + operator, which in Go creates a new string on each iteration (strings are immutable). The second uses strings.Builder, which accumulates bytes in an internal buffer and generates the final string just once.
Intuitively, strings.Builder should be faster. But “should” is not a data point. Let’s measure:
// concat_test.go
package concat
import "testing"
var parts = []string{
"benchmark", "in", "go", "is", "a", "fundamental",
"tool", "for", "measuring", "performance",
}
func BenchmarkConcatPlus(b *testing.B) {
for i := 0; i < b.N; i++ {
ConcatPlus(parts)
}
}
func BenchmarkConcatBuilder(b *testing.B) {
for i := 0; i < b.N; i++ {
ConcatBuilder(parts)
}
}Notice that parts is defined outside the benchmark as a package variable. This is deliberate: we don’t want to measure the time to create the input slice, only the concatenation. If we define it inside the loop, we’d be measuring noise.
Running benchmarks: go test -bench
Benchmarks don’t run by default when you do go test. You need the -bench flag:
go test -bench=. ./...The . after -bench is a regular expression that filters which benchmarks to run. A dot means “all”. If you only want to run one:
go test -bench=BenchmarkConcatBuilder ./...Or a partial pattern:
go test -bench=Concat ./...Flags you’ll use constantly:
-bench=.runs all benchmarks.-benchmemincludes memory statistics (allocations and bytes).-count=Nrepeats the benchmark N times for better statistical significance.-benchtime=5schanges the minimum duration of each benchmark (default 1 second).-cpu=1,2,4runs the benchmarks with different values ofGOMAXPROCS.
A complete and useful command:
go test -bench=. -benchmem -count=5 ./...This runs all benchmarks, reports memory, and repeats each one five times. The five runs are important when you want to compare later with benchstat.
Reading the output: ns/op, B/op, allocs/op
When you run the concatenation benchmark with -benchmem, you get something like:
BenchmarkConcatPlus-8 1924562 617.3 ns/op 352 B/op 9 allocs/op
BenchmarkConcatBuilder-8 5765418 207.1 ns/op 120 B/op 2 allocs/opEach column means something specific:
- BenchmarkConcatPlus-8: the benchmark name. The
-8indicates it ran withGOMAXPROCS=8(8 logical cores). - 1924562: the value of
b.N, i.e., how many times Go ran your code to get a stable measurement. - 617.3 ns/op: nanoseconds per operation. This is the main performance metric.
- 352 B/op: heap bytes allocated per operation.
- 9 allocs/op: number of heap allocations per operation.
The data confirms what we expected: strings.Builder is three times faster and uses much less memory. But now it’s not an intuition, it’s a data point with enough iterations to be reliable.
What matters and what doesn’t
ns/op matters when performance is a requirement. B/op and allocs/op always matter, because allocations pressure the garbage collector, and GC is a constant source of latency in Go applications under heavy load.
The value of b.N doesn’t matter directly. It’s an artifact of the measurement mechanism. If one benchmark runs 100 million iterations and another runs 1 million, it doesn’t mean one is better. It means Go needed more iterations to stabilize the measurement of the faster one.
Comparing benchmarks: benchstat
Running a benchmark once gives you a number. Running it five times and comparing with benchstat gives you a statistical data point with confidence intervals.
Install benchstat if you don’t have it:
go install golang.org/x/perf/cmd/benchstat@latestThe workflow is to save the output of your benchmarks to files and compare them:
go test -bench=. -benchmem -count=10 ./... > old.txt
# Make your changes to the code
go test -bench=. -benchmem -count=10 ./... > new.txt
benchstat old.txt new.txtThe output of benchstat gives you something like:
goos: linux
goarch: amd64
pkg: example.com/concat
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
ConcatPlus-8 617.0n ± 2% 210.5n ± 1% -65.88% (p=0.000)
ConcatBuilder-8 207.0n ± 1% 198.2n ± 1% -4.25% (p=0.001)The ± 2% is the variation between runs. The p=0.000 indicates that the difference is statistically significant (p < 0.05). If benchstat tells you ~ (p=0.342), the difference is not significant and you shouldn’t treat it as a real improvement.
This is fundamental. A single benchmark can give different results depending on system load, CPU temperature, or what your browser is doing in the background. benchstat with -count=10 protects you against that.
Common measurement errors
The compiler optimizes your code
Go has an aggressive compiler. If the result of your function isn’t used anywhere, the compiler may eliminate the entire call. Your benchmark would literally measure nothing.
Bad:
func BenchmarkBad(b *testing.B) {
for i := 0; i < b.N; i++ {
ConcatBuilder(parts) // result is discarded
}
}In many cases this works because the compiler isn’t that aggressive with complex functions. But with simple or pure functions, it can eliminate it. The standard solution is to assign the result to a package variable:
var result string
func BenchmarkGood(b *testing.B) {
var r string
for i := 0; i < b.N; i++ {
r = ConcatBuilder(parts)
}
result = r
}The result variable is a package variable, so the compiler can’t assume nobody uses it. We assign to r inside the loop to avoid a write to a global variable on each iteration (which would be a measurable side effect) and copy at the end.
b.ResetTimer: setup you don’t want to measure
If your benchmark needs expensive setup before measuring, use b.ResetTimer() so that setup doesn’t count:
func BenchmarkWithSetup(b *testing.B) {
// Expensive setup: create data, open connections, etc.
data := generateTestData(10000)
b.ResetTimer() // The timer resets here
for i := 0; i < b.N; i++ {
processData(data)
}
}Without b.ResetTimer(), the time for generateTestData would contaminate your measurement. Especially problematic if that setup takes longer than the function you want to measure.
b.StopTimer and b.StartTimer: pausing between iterations
Sometimes you need to do work between iterations that you don’t want to measure. For example, resetting state:
func BenchmarkWithPause(b *testing.B) {
for i := 0; i < b.N; i++ {
b.StopTimer()
data := generateFreshData() // not measured
b.StartTimer()
processData(data) // this is measured
}
}Use this carefully. Calling b.StopTimer() and b.StartTimer() inside the loop has its own overhead. If your function is very fast (nanoseconds), that overhead may be larger than what you’re measuring. In those cases, it’s better to prepare all data outside the loop.
Benchmarks that depend on previous state
Each iteration should be independent. If your function modifies shared state between iterations, the measurements become contaminated:
// BAD: the map grows on each iteration
func BenchmarkMapAppend(b *testing.B) {
m := make(map[string]int)
for i := 0; i < b.N; i++ {
m[fmt.Sprintf("key-%d", i)] = i // each iteration slower
}
}Here, the last iterations are slower than the first because the map is larger. The benchmark gives you an average that doesn’t represent any real case.
Memory benchmarks: b.ReportAllocs
You’ve already seen that -benchmem gives you memory statistics. You can also activate allocation reporting from within the benchmark itself with b.ReportAllocs():
func BenchmarkWithReportAllocs(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
ConcatPlus(parts)
}
}This is useful when you want allocations to always be reported for a specific benchmark, without depending on whoever runs it remembering to add -benchmem.
Why allocations matter
In Go, each heap allocation is future work for the garbage collector. Fewer allocations means less pressure on GC, which translates to lower latency and fewer pauses. In high-load applications, like those you’d build if you’re using Go for heavy tasks, the difference between 0 allocations and 3 allocations per operation can be the difference between a p99 of 5ms and a p99 of 50ms.
A good practice is to start by measuring allocations before looking at nanoseconds. If a function makes 10 allocations per call and you call it 100,000 times per second, that’s a million allocations per second the GC has to manage.
Pre-allocating to reduce allocations
A common optimization that benchmarks reveal is pre-allocation of slices and maps:
func BenchmarkSliceWithoutPrealloc(b *testing.B) {
for i := 0; i < b.N; i++ {
s := []int{}
for j := 0; j < 1000; j++ {
s = append(s, j)
}
}
}
func BenchmarkSlicePreallocated(b *testing.B) {
for i := 0; i < b.N; i++ {
s := make([]int, 0, 1000)
for j := 0; j < 1000; j++ {
s = append(s, j)
}
}
}Run this with -benchmem and you’ll see that the version without pre-allocation makes between 10 and 20 allocations (every time the slice grows, Go allocates a larger backing array). The pre-allocated version makes just one allocation.
Profiling with pprof: the basics
Benchmarks tell you how long something takes. pprof tells you where time is spent. They are complementary tools.
To generate a CPU profile from your benchmarks:
go test -bench=BenchmarkConcatPlus -cpuprofile=cpu.out ./...This generates a cpu.out file you can analyze with go tool pprof:
go tool pprof cpu.outInside pprof, the most useful commands:
(pprof) top10 # the 10 functions that consume the most CPU
(pprof) list ConcatPlus # shows source code annotated with timings
(pprof) web # opens an interactive graph in the browserFor memory profiles:
go test -bench=BenchmarkConcatPlus -memprofile=mem.out ./...
go tool pprof mem.outInside the memory profile, top10 shows which functions allocate the most memory, and list points you to the exact lines.
The complete workflow
The pragmatic workflow for optimizing performance in Go is:
- Benchmark: identify how long each operation takes.
- pprof CPU: identify where time is spent.
- pprof memory: identify what is being allocated.
- Optimize: change only what the data tells you matters.
- Benchmark again: verify your change actually improved something.
- benchstat: confirm the improvement is statistically significant.
If you skip step 1 and go straight to optimizing, you’re where I was a few years ago: wasting two days on something that doesn’t matter.
When to benchmark and when not to
Yes
- When you’re choosing between two implementations and performance is a factor.
- Before optimizing anything. Measure first.
- When a change touches code on the hot path of your application.
- To validate that an optimization actually improved something.
- In CI, to detect performance regressions between commits.
No
- For code that runs once at application startup.
- For endpoints that do I/O (HTTP, database): the benchmark doesn’t capture network latency.
- As a substitute for a load test. Benchmarks measure isolated functions. A load test measures the complete system.
- When the difference between both options is nanoseconds and your application handles 10 requests per second. Readability matters more.
The general rule: if you can’t articulate why the performance of that specific function is critical, you probably don’t need a benchmark. And if you do need it, the data will tell you quickly.
Practical example: comparing two implementations
Let’s look at a more realistic case. Suppose you have a service that filters users by a set of roles. Two approaches:
// users.go
package users
type User struct {
Name string
Roles []string
}
// FilterWithSlice iterates through each user's roles using a slice.
func FilterWithSlice(users []User, allowedRoles []string) []User {
var result []User
for _, u := range users {
for _, role := range u.Roles {
if sliceContains(allowedRoles, role) {
result = append(result, u)
break
}
}
}
return result
}
func sliceContains(s []string, val string) bool {
for _, v := range s {
if v == val {
return true
}
}
return false
}
// FilterWithMap converts allowed roles to a map for O(1) lookup.
func FilterWithMap(users []User, allowedRoles []string) []User {
allowed := make(map[string]struct{}, len(allowedRoles))
for _, r := range allowedRoles {
allowed[r] = struct{}{}
}
var result []User
for _, u := range users {
for _, role := range u.Roles {
if _, ok := allowed[role]; ok {
result = append(result, u)
break
}
}
}
return result
}The first implementation is O(nmk) where n=users, m=roles per user, k=allowed roles. The second is O(n*m) because map lookup is O(1). In theory, the map should win. But the map has creation and hashing overhead. With few allowed roles, the slice could be faster.
The benchmark:
// users_test.go
package users
import "testing"
func generateUsers(n int) []User {
users := make([]User, n)
roles := []string{"admin", "editor", "viewer", "moderator", "guest"}
for i := range users {
users[i] = User{
Name: "user",
Roles: []string{roles[i%len(roles)], roles[(i+1)%len(roles)]},
}
}
return users
}
var allowedRoles = []string{"admin", "editor", "moderator"}
var resultSink []User
func BenchmarkFilterWithSlice_100(b *testing.B) {
users := generateUsers(100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
resultSink = FilterWithSlice(users, allowedRoles)
}
}
func BenchmarkFilterWithMap_100(b *testing.B) {
users := generateUsers(100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
resultSink = FilterWithMap(users, allowedRoles)
}
}
func BenchmarkFilterWithSlice_10000(b *testing.B) {
users := generateUsers(10000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
resultSink = FilterWithSlice(users, allowedRoles)
}
}
func BenchmarkFilterWithMap_10000(b *testing.B) {
users := generateUsers(10000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
resultSink = FilterWithMap(users, allowedRoles)
}
}Notice several details:
- We generate data outside the loop and use
b.ResetTimer()to not measure generation. - We assign to
resultSink(package variable) to prevent the compiler from eliminating the call. - We test with two sizes (100 and 10,000) because behavior may change with data volume.
With 3 allowed roles, it’s possible the slice version wins for 100 users (the map overhead doesn’t pay off). And that the map version wins with 10,000 users. Or maybe not. That’s exactly the point: the benchmark gives you the answer instead of forcing you to guess.
If the results show the difference is marginal, choose the more readable implementation. If one is 10x faster with your real data volume, choose the faster one. But let the numbers decide.
Sub-benchmarks for parameterization
Go allows sub-benchmarks with b.Run, which are ideal for testing multiple sizes without duplicating functions:
func BenchmarkFilterWithSlice(b *testing.B) {
for _, size := range []int{10, 100, 1000, 10000} {
users := generateUsers(size)
b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
for i := 0; i < b.N; i++ {
resultSink = FilterWithSlice(users, allowedRoles)
}
})
}
}The output will be:
BenchmarkFilterWithSlice/size=10-8 ...
BenchmarkFilterWithSlice/size=100-8 ...
BenchmarkFilterWithSlice/size=1000-8 ...
BenchmarkFilterWithSlice/size=10000-8 ...You can filter specific sub-benchmarks:
go test -bench=FilterWithSlice/size=1000 ./...This makes benchmarks much more maintainable. Instead of writing one function per case, you parameterize and let the framework do the work.
Parallel benchmarks: b.RunParallel
If your code runs in a concurrent context (and in Go, almost everything does), you can measure performance under concurrency with b.RunParallel:
func BenchmarkFilterConcurrent(b *testing.B) {
users := generateUsers(1000)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
FilterWithMap(users, allowedRoles)
}
})
}b.RunParallel launches multiple goroutines that execute your code in parallel. Each goroutine iterates by calling pb.Next() instead of using b.N directly. The framework takes care of distributing the iterations.
This is useful for detecting contention: locks, atomic operations, cache false sharing. A function that is fast in a sequential benchmark can be slow when ten goroutines execute it simultaneously.
Measure first, optimize later
Everything we’ve seen boils down to one idea: don’t optimize based on what you think is slow. Measure, confirm, and only then act.
Go gives you the tools built into the standard library. You don’t need external frameworks, special configuration, or licenses. testing.B has been there from day one, ready to use in any _test.go file.
The summarized workflow:
- Write a benchmark for the function you suspect is slow.
- Run with
-benchmem -count=5to have CPU and memory data. - Use
benchstatto compare before and after with statistical significance. - Generate profiles with
-cpuprofileand-memprofileif you need to know exactly where time is spent. - Optimize only what the data points to as the real bottleneck.
- Measure again to confirm your change actually improved something.
If the benchmark tells you the difference between two implementations is 20 nanoseconds and your endpoint takes 200 milliseconds, close the benchmark and spend your time on something that matters. If it tells you one implementation allocates 50 times less memory on a hot path that processes 100,000 requests per second, you’ve found something worth optimizing.
Data has no opinions. Your intuition does. Trust the data.


