Skip to content
Snippets Groups Projects
Select Git revision
  • cpp-parallel
  • master default protected
  • sse
  • openacc
  • offloading
5 results

pi-parallelization-examples

  • Clone with SSH
  • Clone with HTTPS
  • Martin Kröning's avatar
    Martin Kröning authored
    d5115eab
    History

    Pi Parallelization Examples

    Benchmarking several strategies for parallelly computing pi.

    This project is largely inspired by

    Introduction

    Pi can be computed using the following integral:

    \large \int_{0}^{1} \frac{4}{1+x^{2}} ; \mathrm{d}x = [ 4 \tan^{-1}(x) ]_{0}^{1} = \pi \approx 3.1416

    The integral can be approximated using the midpoint rule:

    \large \int_{0}^{1} \frac{4}{1+x^{2}} ; \mathrm{d} x = \lim_{|\Delta x|\rightarrow0} \sum_{i=1}^{n}  \frac{4}{1+{x_i^}^{2}} ,\Delta x_i \ \approx \sum_{i=0}^{{10}^6-1}  \frac{4}{1+((i+0.5){10}^{-6})^{2}} ,{10}^{-6} =: S

    The deviation from pi is small:

    \large |\pi - S| <  8.\overline{3}*{10}^{-14}

    The summands can be computed independently. We will explore and benchmark some strategies to parallelize this.

    Requirements

    Compiling

    To setup a build directory called buildclangrelease to build with Clang, full native optimization and without debug info, run:

    $ CXX=clang++ CXXFLAGS="-march=native" meson setup --buildtype=release buildclangrelease

    The project can be built and tested with:

    $ ninja -C buildclangrelease test

    Run the benchmarks with:

    $ ./buildclangrelease/pi-calculator-portable/pi-calculator-portable-benchmark
    $ ./buildclangrelease/pi-calculator-avx/pi-calculator-avx-benchmark

    Comparison

    These results are from @mkroening's Intel Core i7-7500U.

    The project has been built with Clang 8.0.0 as release with different CXXFLAGS.

    Only one benchmark of pi-calculator-avx is listed, since more aggressive optimization does not improve the speed on such specific code.

    No additional CXXFLAGS

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla               1470575 ns    1468924 ns        473
    BM_PiCalculatorOpenMPSIMD             838745 ns     838071 ns        822
    BM_PiCalculatorOpenMPParallel         769489 ns     764301 ns        769
    BM_PiCalculatorOpenMPParallelSIMD     413227 ns     411954 ns       1696
    Benchmark                         Time           CPU Iterations
    ----------------------------------------------------------------
    BM_PiCalculatorAVXASM        573359 ns     573062 ns       1214
    BM_PiCalculatorAVXIntrin     573359 ns     573079 ns       1209

    CXXFLAGS="-mavx"

    Explicitly enable AVX instructions. Requires a CPU which supports AVX to run.

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla               1582090 ns    1580321 ns        438
    BM_PiCalculatorOpenMPSIMD             670314 ns     669561 ns       1030
    BM_PiCalculatorOpenMPParallel         824931 ns     822780 ns        835
    BM_PiCalculatorOpenMPParallelSIMD     296564 ns     295824 ns       2365

    CXXFLAGS="-march=native"

    Generate code specific to the system's CPU. For more info see the Gentoo Wiki.

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla               1584891 ns    1584221 ns        442
    BM_PiCalculatorOpenMPSIMD             573493 ns     573227 ns       1204
    BM_PiCalculatorOpenMPParallel         826333 ns     825273 ns        794
    BM_PiCalculatorOpenMPParallelSIMD     296294 ns     288988 ns       2424

    CXXFLAGS="-march=native -Ofast"

    -Ofast is not recommended for use in production. It disregards strict standard compliance in favour of most agressive speed optimizations. See Stack Overflow for what -ffast-math, which among others is enabled by -Ofast, does to floating point operations. See the Gentoo Wiki for more information on optimization levels.

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla                573767 ns     573524 ns       1212
    BM_PiCalculatorOpenMPSIMD             573311 ns     573069 ns       1183
    BM_PiCalculatorOpenMPParallel         288170 ns     287855 ns       2425
    BM_PiCalculatorOpenMPParallelSIMD     290118 ns     288521 ns       2428

    License

    GLP3