# The Bandwidth Benchmark This is a collection of simple streaming kernels for teaching purposes. It is heavily inspired by John McCalpin's https://www.cs.virginia.edu/stream/ benchmark. It contains the following streaming kernels with corresponding data access pattern (Notation: S - store, L - load, WA - write allocate). All variables are vectors, s is a scalar: * init (S1, WA): Initilize an array: `a = s`. Store only. * sum (L1): Vector reduction: `s += a`. Load only. * copy (L1, S1, WA): Classic memcopy: `a = b`. * update (L1, S1): Update vector: `a = a * scalar`. Also load + store but without write allocate. * triad (L2, S1, WA): Stream triad - `a = b + b * scalar`. * daxpy (L2, S1): Daxpy - `a = a + b * scalar`. * striad (L3, S1, WA): Schoenauer triad - `a = b + c * d`. * sdaxpy (L3, S1): Schoenauer triad without write allocate - `a = a + b * c`. As added benefit the code is a blueprint for a minimal benchmarking application with a generic makefile and modules for aligned array allocation, accurate timing and affinity settings. Those components can be used standalone in your own project. ## Build 1. Configure the toolchain and additional options in `config.mk`: ``` # Supported: GCC, CLANG, ICC TAG ?= GCC ENABLE_OPENMP ?= false OPTIONS = -DSIZE=40000000ull OPTIONS += -DNTIMES=10 OPTIONS += -DARRAY_ALIGNMENT=64 #OPTIONS += -DVERBOSE_AFFINITY #OPTIONS += -DVERBOSE_DATASIZE #OPTIONS += -DVERBOSE_TIMER ``` The verbosity options enable detailed output about affinity settings, allocation sizes and timer resolution. 2. Build with: ``` make ``` You can build multiple toolchains in the same directory, but notice that the Makefile is only acting on the one currently set. Intermediate build results are located in the `` directory. To output the executed commands use: ``` make Q= ``` 3. Clean up with: ``` make clean ``` to clean intermediate build results. ``` make distclean ``` to clean intermediate build results and binary. 4. (Optional) Generate assembler: ``` make asm ``` The assembler files will also be located in the `` directory. ## Usage To run the benchmark call: ``` ./bwBench- ``` The benchmark will output the results similar to the stream benchmark. Results are validated. For threaded execution it is recommended to control thread affinity. We recommend to use likwid-pin for benchmarking: ``` likwid-pin -c 0-3 ./bwbench-GCC ``` Example output for threaded execution: ``` ------------------------------------------------------------- [pthread wrapper] [pthread wrapper] MAIN -> 0 [pthread wrapper] PIN_MASK: 0->1 1->2 2->3 [pthread wrapper] SKIP MASK: 0x0 threadid 140271463495424 -> core 1 - OK threadid 140271455102720 -> core 2 - OK threadid 140271446710016 -> core 3 - OK OpenMP enabled, running with 4 threads ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Init: 14681.5000 0.0110 0.0109 0.0111 Sum: 20634.9290 0.0079 0.0078 0.0082 Copy: 18822.2827 0.0172 0.0170 0.0176 Update: 28135.9717 0.0115 0.0114 0.0117 Triad: 19263.0634 0.0253 0.0249 0.0268 Daxpy: 26718.1377 0.0182 0.0180 0.0187 STriad: 21229.4470 0.0305 0.0301 0.0313 SDaxpy: 26714.3897 0.0243 0.0240 0.0253 ------------------------------------------------------------- Solution Validates ```