Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
  • sse
  • cpp-parallel
  • openacc
  • offloading
5 results

pi-parallelization-examples

  • Clone with SSH
  • Clone with HTTPS
  • Pi Parallelization Examples

    Benchmarking several strategies for parallelly computing pi.

    This project is largely inspired by

    Introduction

    Pi can be computed using the following integral:

    \large \int_{0}^{1} \frac{4}{1+x^{2}} ; \mathrm{d}x = [ 4 \tan^{-1}(x) ]_{0}^{1} = \pi \approx 3.1416

    The integral can be approximated using the midpoint rule:

    \large \int_{0}^{1} \frac{4}{1+x^{2}} ; \mathrm{d} x = \lim_{|\Delta x|\rightarrow0} \sum_{i=1}^{n}  \frac{4}{1+{x_i^}^{2}} ,\Delta x_i \ \approx \sum_{i=0}^{{10}^6-1}  \frac{4}{1+((i+0.5){10}^{-6})^{2}} ,{10}^{-6} =: S

    The deviation from pi is small:

    \large |\pi - S| <  8.\overline{3}*{10}^{-14}

    The summands can be computed independently. We will explore and benchmark some strategies to parallelize this.

    Requirements

    Compiling

    To setup a build directory called buildclangrelease to build with Clang, full native optimization and without debug info, run:

    $ CXX=clang++ CXXFLAGS="-march=native" meson setup --buildtype=release buildclangrelease

    The project can be built and tested with:

    $ ninja -C buildclangrelease test

    Run the benchmarks with:

    $ ./buildclangrelease/pi-calculator-portable/pi-calculator-portable-benchmark
    $ ./buildclangrelease/pi-calculator-avx/pi-calculator-avx-benchmark

    Comparison

    These results are from @mkroening's Intel Core i7-7500U.

    The project has been built with Clang 8.0.0 as release with different CXXFLAGS.

    Only one benchmark of pi-calculator-avx is listed, since more aggressive optimization does not improve the speed on such specific code.

    No additional CXXFLAGS

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla               1470575 ns    1468924 ns        473
    BM_PiCalculatorOpenMPSIMD             838745 ns     838071 ns        822
    BM_PiCalculatorOpenMPParallel         769489 ns     764301 ns        769
    BM_PiCalculatorOpenMPParallelSIMD     413227 ns     411954 ns       1696
    Benchmark                         Time           CPU Iterations
    ----------------------------------------------------------------
    BM_PiCalculatorAVXASM        573359 ns     573062 ns       1214
    BM_PiCalculatorAVXIntrin     573359 ns     573079 ns       1209

    CXXFLAGS="-mavx"

    Explicitly enable AVX instructions. Requires a CPU which supports AVX to run.

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla               1582090 ns    1580321 ns        438
    BM_PiCalculatorOpenMPSIMD             670314 ns     669561 ns       1030
    BM_PiCalculatorOpenMPParallel         824931 ns     822780 ns        835
    BM_PiCalculatorOpenMPParallelSIMD     296564 ns     295824 ns       2365

    CXXFLAGS="-march=native"

    Generate code specific to the system's CPU. For more info see the Gentoo Wiki.

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla               1584891 ns    1584221 ns        442
    BM_PiCalculatorOpenMPSIMD             573493 ns     573227 ns       1204
    BM_PiCalculatorOpenMPParallel         826333 ns     825273 ns        794
    BM_PiCalculatorOpenMPParallelSIMD     296294 ns     288988 ns       2424

    CXXFLAGS="-march=native -Ofast"

    -Ofast is not recommended for use in production. It disregards strict standard compliance in favour of most agressive speed optimizations. See Stack Overflow for what -ffast-math, which among others is enabled by -Ofast, does to floating point operations. See the Gentoo Wiki for more information on optimization levels.

    Benchmark                                  Time           CPU Iterations
    -------------------------------------------------------------------------
    BM_PiCalculatorVanilla                573767 ns     573524 ns       1212
    BM_PiCalculatorOpenMPSIMD             573311 ns     573069 ns       1183
    BM_PiCalculatorOpenMPParallel         288170 ns     287855 ns       2425
    BM_PiCalculatorOpenMPParallelSIMD     290118 ns     288521 ns       2428

    License

    GLP3