diff --git a/Makefile b/Makefile index 54527fd..dd4f0ef 100644 --- a/Makefile +++ b/Makefile @@ -1,3 +1,28 @@ +#======================================================================================= +# +# Author: Jan Eitzinger (je), jan.treibig@gmail.com +# Copyright (c) 2019 RRZE, University Erlangen-Nuremberg +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# +#======================================================================================= + #CONFIGURE BUILD SYSTEM TARGET = bwbench-$(TAG) BUILD_DIR = ./$(TAG) diff --git a/README.md b/README.md index 6119293..3fb7111 100644 --- a/README.md +++ b/README.md @@ -2,12 +2,7 @@ This is a collection of simple streaming kernels for teaching purposes. -It consists of two banchmark applications: - -* [MainMemory](https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/MainMemory) -* [MemoryHierarchy](https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/MemoryHierarchy) - -Apart from the microbenchmarking functionality this is also a blueprint for other microbenchmarking applications. +Apart from the micro-benchmark functionality this is also a blueprint for other micro-benchmark applications. It contains C modules for: * Aligned data allocation @@ -15,3 +10,124 @@ It contains C modules for: * Accurate timing Moreover the benchmark showcases a simple generic Makefile that can be used in other projects. + +## Overview + +The benchmark is heavily inspired by John McCalpin's https://www.cs.virginia.edu/stream/ benchmark. + +It contains the following streaming kernels with corresponding data access pattern (Notation: S - store, L - load, WA - write allocate). All variables are vectors, s is a scalar: + +* init (S1, WA): Initilize an array: `a = s`. Store only. +* sum (L1): Vector reduction: `s += a`. Load only. +* copy (L1, S1, WA): Classic memcopy: `a = b`. +* update (L1, S1): Update vector: `a = a * scalar`. Also load + store but without write allocate. +* triad (L2, S1, WA): Stream triad: `a = b + c * scalar`. +* daxpy (L2, S1): Daxpy: `a = a + b * scalar`. +* striad (L3, S1, WA): Schoenauer triad: `a = b + c * d`. +* sdaxpy (L3, S1): Schoenauer triad without write allocate: `a = a + b * c`. + +As added benefit the code is a blueprint for a minimal benchmarking application with a generic makefile and modules for aligned array allocation, accurate timing and affinity settings. Those components can be used standalone in your own project. + +## Build + +1. Configure the toolchain and additional options in `config.mk`: +``` +# Supported: GCC, CLANG, ICC +TAG ?= GCC +ENABLE_OPENMP ?= false +ENABLE_LIKWID ?= false + +OPTIONS = -DSIZE=40000000ull +OPTIONS += -DNTIMES=10 +OPTIONS += -DARRAY_ALIGNMENT=64 +#OPTIONS += -DVERBOSE_AFFINITY +#OPTIONS += -DVERBOSE_DATASIZE +#OPTIONS += -DVERBOSE_TIMER +``` + +The verbosity options enable detailed output about affinity settings, allocation sizes and timer resolution. + +2. Build with: +``` +make +``` + +You can build multiple toolchains in the same directory, but notice that the Makefile is only acting on the one currently set. Intermediate build results are located in the `` directory. + +To output the executed commands use: +``` +make Q= +``` + +3. Clean up with: +``` +make clean +``` +to clean intermediate build results. + +``` +make distclean +``` +to clean intermediate build results and binary. + +4. (Optional) Generate assembler: +``` +make asm +``` +The assembler files will also be located in the `` directory. + +## Usage + +To run the benchmark call: +``` +./bwBench- +``` + +The benchmark will output the results similar to the stream benchmark. Results are validated. +For threaded execution it is recommended to control thread affinity. + +We recommend to use likwid-pin for benchmarking: +``` +likwid-pin -c 0-3 ./bwbench-GCC +``` + +Example output for threaded execution: +``` +------------------------------------------------------------- +[pthread wrapper] +[pthread wrapper] MAIN -> 0 +[pthread wrapper] PIN_MASK: 0->1 1->2 2->3 +[pthread wrapper] SKIP MASK: 0x0 + threadid 140271463495424 -> core 1 - OK + threadid 140271455102720 -> core 2 - OK + threadid 140271446710016 -> core 3 - OK +OpenMP enabled, running with 4 threads +---------------------------------------------------------------------------- +Function Rate(MB/s) Rate(MFlop/s) Avg time Min time Max time +Init: 22111.53 - 0.0148 0.0145 0.0165 +Sum: 46808.59 46808.59 0.0077 0.0068 0.0140 +Copy: 30983.06 - 0.0207 0.0207 0.0208 +Update: 43778.69 21889.34 0.0147 0.0146 0.0148 +Triad: 34476.64 22984.43 0.0282 0.0278 0.0305 +Daxpy: 45908.82 30605.88 0.0214 0.0209 0.0242 +STriad: 37502.37 18751.18 0.0349 0.0341 0.0388 +SDaxpy: 46822.63 23411.32 0.0281 0.0273 0.0325 +---------------------------------------------------------------------------- +Solution Validates +``` + +## Benchmarking skript + +A perl wrapper script (bench.pl) is also provided to scan ranges of thread counts and determine the absolute highest sustained main memory bandwidth. In order to use it `likwid-pin` has to be in your path. The script has three required and one optional command line arguments: +``` +$./bench.pl [] +``` +Example usage: +``` +$./bench.pl ./bwbench-GCC 2-8 6 +``` +The script will always use physical cores only, where two SMT threads is the default. For different SMT thread counts use the 4th command line argument. Example for a processor without SMT: +``` +$./bench.pl ./bwbench-GCC 14-24 10 1 +``` +