Multithreading Performance of the Cell Sorting Example (Amdahl's Law)
Theory
One measure of multithreaded performance is to calculate the possible theoretical speedup given by Amdahl’s law [1]. It provides an estimate for the speedup and assumes that the workload can be split into a parallelizable and non-parallelizable part which is quantified by $0\leq p \leq1$. A higher value means that the contribution coming from non-parallelizable algorithms is lower.
$$\begin{equation} T(n) = T_0\frac{1}{(1-p) + \frac{p}{n}} \end{equation}$$
Here, $n$ is the number of used parallel threads and $p$ is the proportion of execution time which benefits from parallelization.
Simulation Setup
Measuring the performance of any simulation will be highly dependent on the specific cellular
properties and complexity.
For this comparison, we chose the cell-sorting example which contains
minimal complexity compared to the remaining showcases.
Any computational overhead which is intrinsic to cellular_raza
and not related to the chosen
example would thus be more likely to manifest itself in performance results.
The total runtime of the simulation is of no relevance since we are only concerned with relative
speedup upon using additional resources.
The
cellular_raza-benchmarks
crate is a command-line utility which can be used to run benchmarks with various configurations.
# cd cellular_raza-benchmarks
# cargo run -- -h
cellular_raza benchmarks
Usage: cell_sorting [OPTIONS] <NAME> [COMMAND]
Commands:
threads Thread scaling benchmark
sim-size Simulation Size scaling benchmark
help Print this message or the help of the given subcommand(s)
Arguments:
<NAME> Name of the current runs such as name of the device to be benchmarked
Options:
-o, --output-directory <OUTPUT_DIRECTORY>
Output directory of benchmark results [default: benchmark_results]
-s, --sample-size <SAMPLE_SIZE>
Number of samples to be generated for each measurement [default: 5]
--no-save
Do not save results. This takes priority against the overwrite settings
--overwrite
Overwrite existing results
--no-output
Disables output
-h, --help
Print help
-V, --version
Print version
Results generated in this way are stored inside the benchmark_results
folder.
In addition, we provide a python script plotting/cell_sorting.py
to quickly visualize obtained
results.
Hardware
This benchmark was run on three distinct hardware configurations. In order to produce reliable benchmarks, in principle we would have to control a wide range of variables or at least test their configurations. However, we expect that the biggest effect will be produced by the power-limits and frequency of the respective hardware. Both of these effects can be circumvented by choosing a artificial fixed frequency which is low enough such that the total power limit of the CPU is never reached. In addition, we fixed the frequency of each processor, to circumvent power-dependent behaviour. While it is well known that other aspects such as cache-size and memory latency can have an impact on absolute performance, they should however not introduce any significant deviations in terms of relative performance scaling.
CPU | Fixed Clockspeed | Memory Frequency | TDP |
---|---|---|---|
AMD Ryzen 3700X [2] | @2200MHz | 3200MT/s | 65W |
AMD Ryzen Threadripper 3960X [3] | @2000MHz | 3200MT/s | 280W |
Intel Core i7-12700H [4] | @2000MHz | 4800MT/s | 45W |
Table 1: Fit parameters for quadratic approximation of scaling behaviour.
Results
Figure 1: Performance results for increasing number of utilized threads.
We fit equation $(1)$ and obtain the parameter $p$ from which the theoretical maximal speedup $S$ can be calculated via
$$\begin{equation} S = \frac{1}{1-p} \end{equation}$$
and thus from figure obtain the values $S_\text{3700X}=13.64\pm1.73$, $S_\text{3960X}=44.99\pm2.80$ and $S_\text{12700H}=34.74\pm5.05$. The uncertainty $\sigma(S)$ is calculated via the standard gaussian propagation
$$\begin{equation} \sigma(S) = \frac{\sigma(p)}{(1-p)^2} \end{equation}$$
where we obtained $\sigma(p)$ from the fit above.
Discussion
The perfect score of a fully parallelizable system with $p=1$ is considered almost unattainable in a real-world scenario where effects such as the workload of the underlying operating system and physical constraints make it hard to achieve this value. In practice, the value measured here does also depend on the respective hardware.
In addition to hardware-related influences, we also expect a portion of $1-p$ our simulation code
to be fundamentally not parallelizable.
This fraction can be made up of the initial setup of the simulation which necessarily has to start single-threaded and can only extend to multiple processes once all respective
subdomains have been
generated.
Furthermore, stopping the simulation frees resources after combining all threads again.
Even more importantly, all threads are currently using a
Barrier
to sync with each
other.
This also creates a dependency and introduces overhead.
The total speedup $S$ is still very good for all configurations which can be directly attributed
to the core assumption of cellular_raza
that
all interactions are strictly local and subdomains are only
interacting along their borders without the need to construct complex long-ranging
synchronization algorithms.
References
[1] D. P. Rodgers, “Improvements in multiprocessor system design,” ACM SIGARCH Computer Architecture News, vol. 13, no. 3. Association for Computing Machinery (ACM), pp. 225–231, Jun. 1985. doi: 10.1145/327070.327215.
[2] AMD Ryzen™ 7 3700X. [Online]. Available: https://www.amd.com/en/product/8446
[3] AMD Ryzen™ Threadripper™ 3960X Drivers & Support. [Online]. Available: https://www.amd.com/en/support/downloads/drivers.html/processors/ryzen-threadripper/ryzen-threadripper-3000-series/amd-ryzen-threadripper-3960x.html
[4] Intel® Core™ i7-12700H Processor. [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/132228/intel-core-i7-12700h-processor-24m-cache-up-to-4-70-ghz.html