-
Notifications
You must be signed in to change notification settings - Fork 109
Description
This may end up being resolved through documentation, but we spent enough time on it to warrant capturing it in an issue.
While running parameter sweeps to compare strong scaling performance between "all-rank" and "all-thread" simulations we observed some anomalous behavior on our Linux systems for the threaded simulations. Fixes are:
- In
slurmscript (usingsbatch):export OMPI_MCA_hwloc_base_binding_policy=socket - in
mpiruncommand-line:--bind-to socket
With these changes, the strong scaling tracked extremely well for both ranks and threads.
Use of the command-line option needs to be qualified as it is not support on MacOS. When invoked on Mac it produces an extremely helpful message shedding more light on the issue.
RRTE uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.
On OS X, processor and memory binding is not available at all (i.e.,
the OS does not expose this functionality).
On Linux, lack of the functionality can mean that you are on a
platform where processor and memory affinity is not supported in Linux
itself, or that hwloc was built without NUMA and/or processor affinity
support. When building hwloc (which, depending on your PRRTE
installation, may be embedded in PRRTE itself), it is important to
have the libnuma header and library files available. Different linux
distributions package these files under different names; look for
packages with the word "numa" in them. You may also need a developer
version of the package (e.g., with "dev" or "devel" in the name) to
obtain the relevant header files.
If you are getting this message on a non-OS X, non-Linux platform,
then hwloc does not support processor / memory affinity on this
platform. If the OS/platform does actually support processor / memory
affinity, then you should contact the hwloc maintainers:
https://github.com/open-mpi/hwloc
It may be that some of the anecdotal comments about threading performance concerns are hitting common pitfalls like this. A section in the SST document discussing this could be very helpful.
Another suggestion is to consider adding support for MPI_THREAD_MULTIPLE which requires replacing MPI_Init with MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);