gen_cpumapping - determine optimal CPU mapping on this machine for MPI task placement
gen_cpumapping {--help|-?|--man}
gen_cpumapping [ --scenario=scatter|bunch|linear ] [ --(no)zero ] [ --(no)physical ] [ --delimiter=<char> ] [ --verbose ]
On modern multi-core architectures with NUMA properties or shared caches the correct placement of tasks or threads of parallel programs has a more or less high impact on performance.
Modern MPI libraries allow the user to specify task-to-CPU mapping via environment variables, and are able to bind tasks to the specified CPUs.
However, the setting of these environment variables require the knowledge of the used architecture details (cache levels, CPU and NUMA node numberings, simultaneous multithreading supported or not, ...).
The Portable Hardware Locality (hwloc) software package helps to retrieve this information. See http://www.open-mpi.org/projects/hwloc for additional details.
The gen_cpumapping script uses hwloc to figure out the architecture topology of the machine where it is started. Based on that, it sorts the IDs of the available processor elements according to different approaches, and prints the result to STDOUT.
The output can be fed into environment variables that determine the placement scheme of MPI tasks or OpenMP threads before a parallel executable is started.
There is support for mapping out logical SMT CPUs, and CPU numberings that are relative within a cpuset.
The sorting algorithms used by gen_cpumapping were also implemented in recent MVAPICH2, see http://mvapich.cse.ohio-state.edu/overview/mvapich2/.
Selects the sorting algorithm. The default is scatter.
Sorts CPU IDs by a maximum distance approach. Additional logical SMT CPUs are sorted after the first logical CPUs.
This scenario ensures that tasks do not share caches, or share memory bandwidth to local memory to the maximum possible extent. It is most useful for usual MPI applications, when fewer tasks are started than there are physical CPUs on a machine.
Sorts CPU IDs by a minimum distance approach. Additional logical SMT CPUs are sorted after the first logical CPUs.
This scenario ensures that tasks or threads are mapped to CPUs as close as possible. It is most useful for threaded applications that share data in a global address space. This scenario is often the default in usual MPI libraries that support CPU affinity.
Prints CPU IDs in topology ordering. This is the ordering also seen in e.g. lstopo outputs.
With --nozero the machine-global CPU IDs are printed, as determined by the BIOS. This is the default.
With --zero the printed CPU IDs always start with 0. This is helpful when the script is started under control of a cpuset, which restricts it to a subset of CPUs of a large SMP machine, and when in addition the application that uses its output relies on relative CPU numbers. This is e.g. the case for the Message Passing Toolkit of Silicon Graphics, Inc.
With --(no)physical all (logical) CPUs that are available to the script are contained in the output. This is the default.
With --physical additional logical CPUs are mapped out. When the current machine does not support several hardware threads, then there is no difference between --physical and --nophysical.
Specifies a different list delimiter. The default is ' ' (space).
When specified, verbose information is printed to STDERR.
Shows the man-page that you are currently reading.
Shows a usage message and exits.
The command line
gen_cpumapping -v -s linear -p
shows the CPU IDs in system ordering without additional SMT CPUs.
Assuming that an MPI executable a.out is bound with an MPI library, that supports CPU affinity handling, and that implements the environment variable MPI_CPU_MAPPING to specify the task-to-CPU mapping as comma-separated list, one may set the content of this variable as follows (Bourne Shell and related:
export MPI_CPU_MAPPING=$(gen_cpumapping -s scatter -d ,) echo $MPI_CPU_MAPPING mpiexec a.out
Assume a system that contains two Intel Xeon Harpertown sockets with four cores, each. The System CPU numbering and topology shortcut is
(((0,2),(4,6)),((1,3),(5,7)))
That means: even CPU IDs on first socket, odd CPU IDs on second socket ("legacy numbering").
This processor is characterized by shared L2 caches. Core pairs 0+2, 4+6, 1+3, 5+7 share a common L2 cache.
The command gen_cpumapping with default option settings will print:
0 1 4 5 2 3 6 7
An MPI application that underlies this CPU mapping will run as follows:
1st task on 1st socket 1st core 2nd task on 2nd socket 1st core 3rd task on 1st socket 3rd core 4th task on 2nd socket 3rd core 5th task on 1st socket 2nd core 6th task on 2nd socket 2nd core 7th task on 1st socket 4th core 8th task on 2nd socket 4th core
Thus, if the application would start 4 tasks or less, only, it is ensured that every task does not share an L2 cache with another task. If the application would start 8 tasks, there is no difference to other pinning schemes.
Assume a system that contains two Intel Xeon Nehalem-EP sockets with four cores, each. SMT is switched on via BIOS settings, thus there are two hardware threads per physical core. The System CPU numbering and topology shortcut is
(((0,8),(1,9),(2,10),(3,11)),((4,12),(5,13),(6,14),(7,15)))
That means: CPUs are numbered in sequence, CPUs 0-3 on 1st socket, CPUs 4-7 on 2nd socket ("common numbering"). SMT CPUs have numbers 8-15. The logical CPUs 0+8, 1+9, ... run on the same physical core.
This architecture is characterized by connecting the two sockets via the Intel QPI, which results in one NUMA node per socket. All cores of a socket share a common L3 cache, but have an L2 cache of their own.
0 4 1 5 2 6 3 7 8 12 9 13 10 14 11 15
1st task on 1st socket 1st core 1st hardware thread 2nd task on 2nd socket 1st core 1st hardware thread 3rd task on 1st socket 2nd core 1st hardware thread 4th task on 2nd socket 2nd core 1st hardware thread 5th task on 1st socket 3rd core 1st hardware thread 6th task on 2nd socket 3rd core 1st hardware thread 7th task on 1st socket 4th core 1st hardware thread 8th task on 2nd socket 4th core 1st hardware thread 9th task on 1st socket 1st core 2nd hardware thread ...
Thus, if the application would start 2 tasks, only, it is ensured that they do not share a common NUMA node, thus getting the maximum memory bandwidth. The effect is retained when starting up to 6 tasks. If the application starts up to 8 tasks, it is ensured that each task runs on its own physical core.
Bernd Kallies, <kallies@zib.de>
Copyright (C) 2011 Zuse Institute Berlin
This library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.
To install Sys::Hwloc, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Sys::Hwloc
CPAN shell
perl -MCPAN -e shell install Sys::Hwloc
For more information on module installation, please visit the detailed CPAN module installation guide.