Move docs from git to README

2019-09-15 06:52:29 +02:00
parent 7dd895fefc
commit 3c9ff383c4
2 changed files with 147 additions and 6 deletions
--- a/25
+++ b/25
@@ -1,3 +1,28 @@
 #=======================================================================================
 #
 #     Author:   Jan Eitzinger (je), jan.treibig@gmail.com
 #     Copyright (c) 2019 RRZE, University Erlangen-Nuremberg
 #
 #     Permission is hereby granted, free of charge, to any person obtaining a copy
 #     of this software and associated documentation files (the "Software"), to deal
 #     in the Software without restriction, including without limitation the rights
 #     to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 #     copies of the Software, and to permit persons to whom the Software is
 #     furnished to do so, subject to the following conditions:
 #
 #     The above copyright notice and this permission notice shall be included in all
 #     copies or substantial portions of the Software.
 #
 #     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 #     IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 #     FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 #     AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 #     LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 #     OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 #     SOFTWARE.
 #
 #=======================================================================================
 #CONFIGURE BUILD SYSTEM
 TARGET	   = bwbench-$(TAG)
 BUILD_DIR  = ./$(TAG)
--- a/README.md
+++ b/README.md
@@ -2,12 +2,7 @@
 This is a collection of simple streaming kernels for teaching purposes.
-It consists of two banchmark applications:
+Apart from the micro-benchmark functionality this is also a blueprint for other micro-benchmark applications.
 * [MainMemory](https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/MainMemory)
 * [MemoryHierarchy](https://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/MemoryHierarchy)
 Apart from the microbenchmarking functionality this is also a blueprint for other microbenchmarking applications.
 It contains C modules for:
 * Aligned data allocation
@@ -15,3 +10,124 @@ It contains C modules for:
 * Accurate timing
 Moreover the benchmark showcases a simple generic Makefile that can be used in other projects.
 ## Overview
 The benchmark is heavily inspired by John McCalpin's https://www.cs.virginia.edu/stream/ benchmark.
 It contains the following streaming kernels with corresponding data access pattern (Notation: S - store, L - load, WA - write allocate). All variables are vectors, s is a scalar:
 * init (S1, WA): Initilize an array: `a = s`. Store only.
 * sum (L1): Vector reduction: `s += a`. Load only.
 * copy  (L1, S1, WA): Classic memcopy: `a = b`.
 * update (L1, S1): Update vector: `a = a * scalar`. Also load + store but without write allocate.
 * triad (L2, S1, WA): Stream triad: `a = b + c * scalar`.
 * daxpy (L2, S1): Daxpy: `a = a + b * scalar`.
 * striad (L3, S1, WA): Schoenauer triad: `a = b + c * d`.
 * sdaxpy (L3, S1): Schoenauer triad without write allocate: `a = a + b * c`.
 As added benefit the code is a blueprint for a minimal benchmarking application with a generic makefile and modules for aligned array allocation, accurate timing and affinity settings. Those components can be used standalone in your own project.
 ## Build
 1. Configure the toolchain and additional options in `config.mk`:
 ```
 # Supported: GCC, CLANG, ICC
 TAG ?= GCC
 ENABLE_OPENMP ?= false
 ENABLE_LIKWID ?= false
 OPTIONS  =  -DSIZE=40000000ull
 OPTIONS +=  -DNTIMES=10
 OPTIONS +=  -DARRAY_ALIGNMENT=64
 #OPTIONS +=  -DVERBOSE_AFFINITY
 #OPTIONS +=  -DVERBOSE_DATASIZE
 #OPTIONS +=  -DVERBOSE_TIMER
 ```
 The verbosity options enable detailed output about affinity settings, allocation sizes and timer resolution.
 2. Build with:
 ```
 make
 ```
 You can build multiple toolchains in the same directory, but notice that the Makefile is only acting on the one currently set. Intermediate build results are located in the `<TOOLCHAIN>` directory.
 To output the executed commands use:
 ```
 make Q=
 ```
 3. Clean up with:
 ```
 make clean
 ```
 to clean intermediate build results.
 ```
 make distclean
 ```
 to clean intermediate build results and binary.
 4. (Optional) Generate assembler:
 ```
 make asm
 ```
 The assembler files will also be located in the `<TOOLCHAIN>` directory.
 ## Usage
 To run the benchmark call:
 ```
 ./bwBench-<TOOLCHAIN>
 ```
 The benchmark will output the results similar to the stream benchmark. Results are validated.
 For threaded execution it is recommended to control thread affinity.
 We recommend to use likwid-pin for benchmarking:
 ```
 likwid-pin -c 0-3 ./bwbench-GCC
 ```
 Example output for threaded execution:
 ```
 -------------------------------------------------------------
 [pthread wrapper]
 [pthread wrapper] MAIN -> 0
 [pthread wrapper] PIN_MASK: 0->1  1->2  2->3
 [pthread wrapper] SKIP MASK: 0x0
        threadid 140271463495424 -> core 1 - OK
        threadid 140271455102720 -> core 2 - OK
        threadid 140271446710016 -> core 3 - OK
 OpenMP enabled, running with 4 threads
 ----------------------------------------------------------------------------
 Function      Rate(MB/s)  Rate(MFlop/s)  Avg time     Min time     Max time
 Init:          22111.53    -             0.0148       0.0145       0.0165
 Sum:           46808.59    46808.59      0.0077       0.0068       0.0140
 Copy:          30983.06    -             0.0207       0.0207       0.0208
 Update:        43778.69    21889.34      0.0147       0.0146       0.0148
 Triad:         34476.64    22984.43      0.0282       0.0278       0.0305
 Daxpy:         45908.82    30605.88      0.0214       0.0209       0.0242
 STriad:        37502.37    18751.18      0.0349       0.0341       0.0388
 SDaxpy:        46822.63    23411.32      0.0281       0.0273       0.0325
 ----------------------------------------------------------------------------
 Solution Validates
 ```
 ## Benchmarking skript
 A perl wrapper script (bench.pl) is also provided to scan ranges of thread counts and determine the absolute highest sustained main memory bandwidth. In order to use it `likwid-pin` has to be in your path. The script has three required and one optional command line arguments:
 ```
 $./bench.pl <executable> <thread count range>  <repetitions> [<SMT setting>]
 ```
 Example usage:
 ```
 $./bench.pl ./bwbench-GCC 2-8 6
 ```
 The script will always use physical cores only, where two SMT threads is the default. For different SMT thread counts use the 4th command line argument. Example for a processor without SMT:
 ```
 $./bench.pl ./bwbench-GCC 14-24  10  1
 ```