Speeding up simulations by parallel execution on Symmetric Multiprocessors (SMPs)

On both symmetric multiprocessors (SMPs) and massively parallel processors (MPPs), MercuryDPM can run simulations in parallel. The distinction between the computer architectures is that SMPs (single-processor machines) use shared memory, while MPPs (clusters) use distributed memory. We parallelise on shared memory with an OpenMP implementation and parallelize on distributed memory with an MPI implementation. The following tutorials will demonstrate how to use OpenMP parallelisation in your code.

To use the MercuryDPM-OpenMP framework, you need to include some code fragments in the main function, and some extra options during compiling and linking.

To illustrate this, let us consider a "Demo" application that we want to run in parallel using an activated OpenMP environment. The driver code is structured as follows:

#include "Mercury3D.h"
 
class Demo : public Mercury3D {
  void setupInitialConditions() override {
    //define walls, boundary conditions, particle positions here
  }
};
 
int main(int argc, char** argv) {
  Demo problem;
  //define contact law, time step, final time, domain size here
  problem.solve(argc, argv);
}

Firstly, the OpenMP parallel environment has to be activated using cmake, i.e., by turning the flag MercuryDPM_USE_OpenMP to ON as follows:

# enter your build directory
cd MercuryBuild
# use cmake to change the CMake configuration
cmake . -DMERCURYDPM_USE_OpenMP=ON
# Run fullTest with OpenMP activated
make fullTest

Alternatively, you can use ccmake or cmake-gui to change the CMake configuration. Instructions for ccmake:

# use ccmake or cmake-gui to change the CMake configuration
ccmake ../MercurySource
# Press [c] to configure
# Set MercuryDPM_USE_OpenMP --> ON
# Press [c] twice to configure 2 times, then press [g] to generate

Next, we set the number of threads to run in parallel for the program by calling the setNumberOfOMPThreads(n) function in the main function of the application, where n denotes the number of threads.

...
 
int main(int argc, char** argv) {
  Demo problem;
  ...
  //Set the number of threads (cores) for the identity-based OpenMP framework   
  problem.setNumberOfOMPThreads(4);
  problem.solve(argc, argv);
}

After adding the above modifications, the simulation will be executed in parallel, with only one additional line included to the main() function.

Alternatively, you can set the number of OMP threads using a command-line argument that is passed through helpers::readFromCommandLine(...) function:

...
 
int main(int argc, char** argv) {
  Demo problem;
  ...
  // command line arguments:
  problem.setNumberOfOMPThreads(helpers::readFromCommandLine(argc, argv, "-omp",1));
  helpers::removeFromCommandline(argc, argv, "-omp", 1);
  problem.solve(argc, argv);
}

In that case, the number of threads is set at execution-time via the command line, as follows:

time ./Demo -test -omp 4

Or you can use helpers::getMaximumNumberOfOMPThreads() the get the number of available OpenMP threads:

...
 
int main(int argc, char** argv) {
  Demo problem;
  ...
  problem.setNumberOfOMPThreads(helpers::getMaximumNumberOfOMPThreads());
  problem.solve(argc, argv);
}

In this case, the number of threads is set at execution-time automatically:

time ./Demo -test

For an example of an OpenMP-ready code, see /Drivers/MercurySimpleDemos/FreeCooling2DinWallsDemo.cpp.

Performance

The table below shows performance results of the MercuryDPM-OpenMP framework:

Runtime and scalability of the MercuryDPM-OpenMP framework on a quad-core processor with up to 8 CPUs when hyperthreading

The performance was tested using 200,000-particle simulation of a cooling granular gas on a quad core processor, with a time step 5 x 10-5 and a save count = 100, simulated up to a maximum time = 0.1. The fully functional parallel program produced exactly the same (identical) output as the serial application. We see that the execution time decreases with the number of threads. The theoretical maximum speedup is based on Ahmdahl's law for 4 cores, assuming that 68% of the code can be parallelised.

On the following pages, you'll find more parallelisation demonstrations:
Parallel processing using MPI
Parallel processing for Input-Output, I/O files
Alternatively, go back to Overview of advanced tutorials