Pi Parallelization Examples
Benchmarking several strategies for parallelly computing pi.
This project is largely inspired by
- the course "Foundations of Computer Science 4" at RWTH Aachen University
- Much of the assembly code is based on os/gi4/chapter2/pi.
- the "Parallelism in C++" video series (repository) by Bisqwit
Introduction
Pi can be computed using the following integral:
The integral can be approximated using the midpoint rule:
The deviation from pi is small:
The summands can be computed independently. We will explore and benchmark some strategies to parallelize this.
Requirements
Compiling
To setup a build directory
called buildclangrelease
to build with Clang, full native optimization and without debug info, run:
$ CXX=clang++ CXXFLAGS="-march=native" meson setup --buildtype=release buildclangrelease
The project can be built and tested with:
$ ninja -C buildclangrelease test
Run the benchmarks with:
$ ./buildclangrelease/pi-calculator-portable/pi-calculator-portable-benchmark
$ ./buildclangrelease/pi-calculator-avx/pi-calculator-avx-benchmark
Comparison
These results are from @mkroening's Intel Core i7-7500U.
The project has been built with Clang 8.0.0
as release
with different CXXFLAGS
.
Only one benchmark of pi-calculator-avx
is listed, since more aggressive optimization does not improve the speed on such specific code.
CXXFLAGS
No additional Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_PiCalculatorVanilla 1470575 ns 1468924 ns 473
BM_PiCalculatorOpenMPSIMD 838745 ns 838071 ns 822
BM_PiCalculatorOpenMPParallel 769489 ns 764301 ns 769
BM_PiCalculatorOpenMPParallelSIMD 413227 ns 411954 ns 1696
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_PiCalculatorAVXASM 573359 ns 573062 ns 1214
BM_PiCalculatorAVXIntrin 573359 ns 573079 ns 1209
CXXFLAGS="-mavx"
Explicitly enable AVX
instructions. Requires a CPU which supports AVX
to run.
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_PiCalculatorVanilla 1582090 ns 1580321 ns 438
BM_PiCalculatorOpenMPSIMD 670314 ns 669561 ns 1030
BM_PiCalculatorOpenMPParallel 824931 ns 822780 ns 835
BM_PiCalculatorOpenMPParallelSIMD 296564 ns 295824 ns 2365
CXXFLAGS="-march=native"
Generate code specific to the system's CPU. For more info see the Gentoo Wiki.
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_PiCalculatorVanilla 1584891 ns 1584221 ns 442
BM_PiCalculatorOpenMPSIMD 573493 ns 573227 ns 1204
BM_PiCalculatorOpenMPParallel 826333 ns 825273 ns 794
BM_PiCalculatorOpenMPParallelSIMD 296294 ns 288988 ns 2424
CXXFLAGS="-march=native -Ofast"
-Ofast
is not recommended for use in production. It disregards strict standard
compliance in favour of most agressive speed optimizations.
See Stack Overflow for what
-ffast-math
, which among others is enabled by -Ofast
, does to floating point
operations. See the Gentoo Wiki
for more information on optimization levels.
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_PiCalculatorVanilla 573767 ns 573524 ns 1212
BM_PiCalculatorOpenMPSIMD 573311 ns 573069 ns 1183
BM_PiCalculatorOpenMPParallel 288170 ns 287855 ns 2425
BM_PiCalculatorOpenMPParallelSIMD 290118 ns 288521 ns 2428