Skip to content

mpirun thread binding behavior differs across platforms #1565

@kpgriesser

Description

@kpgriesser

This may end up being resolved through documentation, but we spent enough time on it to warrant capturing it in an issue.

While running parameter sweeps to compare strong scaling performance between "all-rank" and "all-thread" simulations we observed some anomalous behavior on our Linux systems for the threaded simulations. Fixes are:

  1. In slurm script (using sbatch): export OMPI_MCA_hwloc_base_binding_policy=socket
  2. in mpirun command-line: --bind-to socket

With these changes, the strong scaling tracked extremely well for both ranks and threads.

Use of the command-line option needs to be qualified as it is not support on MacOS. When invoked on Mac it produces an extremely helpful message shedding more light on the issue.

RRTE uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.

On OS X, processor and memory binding is not available at all (i.e.,
the OS does not expose this functionality).

On Linux, lack of the functionality can mean that you are on a
platform where processor and memory affinity is not supported in Linux
itself, or that hwloc was built without NUMA and/or processor affinity
support. When building hwloc (which, depending on your PRRTE
installation, may be embedded in PRRTE itself), it is important to
have the libnuma header and library files available. Different linux
distributions package these files under different names; look for
packages with the word "numa" in them. You may also need a developer
version of the package (e.g., with "dev" or "devel" in the name) to
obtain the relevant header files.

If you are getting this message on a non-OS X, non-Linux platform,
then hwloc does not support processor / memory affinity on this
platform. If the OS/platform does actually support processor / memory
affinity, then you should contact the hwloc maintainers:
https://github.com/open-mpi/hwloc

It may be that some of the anecdotal comments about threading performance concerns are hitting common pitfalls like this. A section in the SST document discussing this could be very helpful.

Another suggestion is to consider adding support for MPI_THREAD_MULTIPLE which requires replacing MPI_Init with MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions