Recently, I was bored and decided to test the performance of a 4x4 matrix multiply. I wanted to compare how GCC and Clang optimize the same code with different flags. So I manually wrote a benchmark harness using a neat trick: a C file that’s also a valid shell script.
The idea is simple. You embed shell commands in C comments at the top of the file. When you run sh benchmark.c, the shell sees the commands and executes them to compile and run the benchmark. When the C compiler processes it, the commands are just comments.
#if 0
gcc -O3 -c benchmark.c -o bench_gcc.o ...
clang -O3 -c benchmark.c -o bench_clang.o ...
gcc benchmark.c bench_gcc.o bench_clang.o -o benchmark
./benchmark
exit
#endif
#include <stdio.h>
// ... rest of the C code
I first remember seeing this trick in libmemory by skeeto, though he probably wasn’t the first to do it. It’s a neat way to make self-contained, self-building C files. You can simply run them with sh benchmark.c.
After I manually wrote a testcase with a harness using this trick, I immediately started wondering if it could be automated. And that is how I came up with benchpress.
How It Works
benchpress is a tool that generates these self-building benchmark harnesses automatically. You write a template C file with three types of annotations:
BENCHFUNC- The function(s) you want to compile with different configurationsWARMUP- Code to warm up caches before each benchmark runBENCHMARK- The actual benchmark iterations
You can have multiple BENCHMARK functions to test different scenarios. The warmup function pairs with a benchmark by naming convention: append _warmup to the benchmark name. So run_benchmark_warmup pairs with run_benchmark. Each benchmark must have a corresponding warmup function.
Here’s the example that started it all:
// matmul_template.c
#include <stdint.h>
typedef struct { float data[16]; } Matrix4x4;
void init_matrix(Matrix4x4 *m) {
for (int i = 0; i < 16; i++) {
m->data[i] = (float)(i + 1);
}
}
BENCHFUNC void matmul(Matrix4x4 *a, Matrix4x4 *b, Matrix4x4 *result) {
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 4; j++) {
float sum = 0.0f;
for (int k = 0; k < 4; k++) {
sum += a->data[i*4+k] * b->data[k*4+j];
}
result->data[i*4+j] = sum;
}
}
}
WARMUP void run_benchmark_warmup(void) {
Matrix4x4 a, b, result;
init_matrix(&a);
init_matrix(&b);
for (int64_t i = 0; i < 1000000; i++) {
matmul(&a, &b, &result);
}
}
BENCHMARK void run_benchmark(void) {
Matrix4x4 a, b, result;
init_matrix(&a);
init_matrix(&b);
for (int64_t i = 0; i < 1000000000; i++) {
matmul(&a, &b, &result);
}
}
Then, run benchpress to generate a benchmark harness:
python3 benchpress.py matmul_template.c \
--compilers gcc:clang \
--flags="-O2:-O3" \
-o benchmark.c
Run it:
sh benchmark.c
Output:
GCC version: gcc (GCC) 15.2.0
Clang version: clang version 20.1.6
gcc -O2: 6.374 seconds
gcc -O3: 3.681 seconds
clang -O2: 6.163 seconds
clang -O3: 3.801 seconds
gcc -O2 vs clang -O2: clang -O2 was 1.03x faster
clang -O3 vs gcc -O3: gcc -O3 was 1.03x faster
The generated benchmark.c file is completely self-contained. It uses the same polyglot trick described earlier. You can share it as a single file, and anyone with gcc and clang installed can run it with sh benchmark.c.
Key Features
Automatic Combinations
benchpress can generate all combinations of compilers and flags:
benchpress.py template.c \
--compilers gcc:clang \
--flags="-O2:-O3:-O3 -march=native" \
-o benchmark.c
This creates 6 configurations (2 compilers × 3 flag sets) and automatically compares results.
Custom Configurations
For more control, specify individual configurations with --config:
benchpress.py template.c \
--config gcc:-O2 \
--config gcc:-O3 \
--compare "gcc -O2,gcc -O3" \
-o benchmark.c
This generates a benchmark that compares gcc -O3 to gcc -O2.
Use Cases
Here is a short list of usecases for benchpress:
- Quick experiments: Does that optimization flag actually make a difference? Generate a benchmark and find out in seconds
- Compiler comparisons: Which produces faster code for this algorithm - GCC or Clang? Test it directly
- Bug reports: Attach a single self-contained file demonstrating a performance regression
- Architecture flags: Measure the actual impact of
-march=nativeor other hardware-specific optimizations - Sharing benchmarks: Send someone a file they can run with
sh benchmark.c- no build instructions needed
Try It Out
The tool is open source and available on GitHub. Feel free to open issues, send me an email, or leave comments on this blog post. Suggestions, contributions and feedback are welcome!