Bernd Kallies

NAME

gen_cpumapping - determine optimal CPU mapping on this machine for MPI task placement

SYNOPSIS

gen_cpumapping {--help|-?|--man}

gen_cpumapping [ --scenario=scatter|bunch|linear ] [ --(no)zero ] [ --(no)physical ] [ --delimiter=<char> ] [ --verbose ]

DESCRIPTION

On modern multi-core architectures with NUMA properties or shared caches the correct placement of tasks or threads of parallel programs has a more or less high impact on performance.

Modern MPI libraries allow the user to specify task-to-CPU mapping via environment variables, and are able to bind tasks to the specified CPUs.

However, the setting of these environment variables require the knowledge of the used architecture details (cache levels, CPU and NUMA node numberings, simultaneous multithreading supported or not, ...).

The Portable Hardware Locality (hwloc) software package helps to retrieve this information. See http://www.open-mpi.org/projects/hwloc for additional details.

The gen_cpumapping script uses hwloc to figure out the architecture topology of the machine where it is started. Based on that, it sorts the IDs of the available processor elements according to different approaches, and prints the result to STDOUT.

The output can be fed into environment variables that determine the placement scheme of MPI tasks or OpenMP threads before a parallel executable is started.

There is support for mapping out logical SMT CPUs, and CPU numberings that are relative within a cpuset.

The sorting algorithms used by gen_cpumapping were also implemented in recent MVAPICH2, see http://mvapich.cse.ohio-state.edu/overview/mvapich2/.

OPTIONS

--scenario=scatter|bunch|linear

Selects the sorting algorithm. The default is scatter.

scatter

Sorts CPU IDs by a maximum distance approach. Additional logical SMT CPUs are sorted after the first logical CPUs.

This scenario ensures that tasks do not share caches, or share memory bandwidth to local memory to the maximum possible extent. It is most useful for usual MPI applications, when fewer tasks are started than there are physical CPUs on a machine.

bunch

Sorts CPU IDs by a minimum distance approach. Additional logical SMT CPUs are sorted after the first logical CPUs.

This scenario ensures that tasks or threads are mapped to CPUs as close as possible. It is most useful for threaded applications that share data in a global address space. This scenario is often the default in usual MPI libraries that support CPU affinity.

linear

Prints CPU IDs in topology ordering. This is the ordering also seen in e.g. lstopo outputs.

--(no)zero

With --nozero the machine-global CPU IDs are printed, as determined by the BIOS. This is the default.

With --zero the printed CPU IDs always start with 0. This is helpful when the script is started under control of a cpuset, which restricts it to a subset of CPUs of a large SMP machine, and when in addition the application that uses its output relies on relative CPU numbers. This is e.g. the case for the Message Passing Toolkit of Silicon Graphics, Inc.

--(no)physical

With --(no)physical all (logical) CPUs that are available to the script are contained in the output. This is the default.

With --physical additional logical CPUs are mapped out. When the current machine does not support several hardware threads, then there is no difference between --physical and --nophysical.

--delimiter=<char>

Specifies a different list delimiter. The default is ' ' (space).

--verbose

When specified, verbose information is printed to STDERR.

--man

Shows the man-page that you are currently reading.

--help|-?

Shows a usage message and exits.

EXAMPLES

1. Print CPU IDs

The command line

  gen_cpumapping -v -s linear -p

shows the CPU IDs in system ordering without additional SMT CPUs.

2. Combine it with MPI

Assuming that an MPI executable a.out is bound with an MPI library, that supports CPU affinity handling, and that implements the environment variable MPI_CPU_MAPPING to specify the task-to-CPU mapping as comma-separated list, one may set the content of this variable as follows (Bourne Shell and related:

  export MPI_CPU_MAPPING=$(gen_cpumapping -s scatter -d ,)
  echo $MPI_CPU_MAPPING
  mpiexec a.out

EXAMPLE SCATTER OUTPUTS

a) Intel Xeon Harpertown

Assume a system that contains two Intel Xeon Harpertown sockets with four cores, each. The System CPU numbering and topology shortcut is

  (((0,2),(4,6)),((1,3),(5,7)))

That means: even CPU IDs on first socket, odd CPU IDs on second socket ("legacy numbering").

This processor is characterized by shared L2 caches. Core pairs 0+2, 4+6, 1+3, 5+7 share a common L2 cache.

The command gen_cpumapping with default option settings will print:

  0 1 4 5 2 3 6 7

An MPI application that underlies this CPU mapping will run as follows:

  1st task on 1st socket 1st core
  2nd task on 2nd socket 1st core
  3rd task on 1st socket 3rd core
  4th task on 2nd socket 3rd core
  5th task on 1st socket 2nd core
  6th task on 2nd socket 2nd core
  7th task on 1st socket 4th core
  8th task on 2nd socket 4th core

Thus, if the application would start 4 tasks or less, only, it is ensured that every task does not share an L2 cache with another task. If the application would start 8 tasks, there is no difference to other pinning schemes.

b) Intel Xeon Nehalem-EP

Assume a system that contains two Intel Xeon Nehalem-EP sockets with four cores, each. SMT is switched on via BIOS settings, thus there are two hardware threads per physical core. The System CPU numbering and topology shortcut is

  (((0,8),(1,9),(2,10),(3,11)),((4,12),(5,13),(6,14),(7,15)))

That means: CPUs are numbered in sequence, CPUs 0-3 on 1st socket, CPUs 4-7 on 2nd socket ("common numbering"). SMT CPUs have numbers 8-15. The logical CPUs 0+8, 1+9, ... run on the same physical core.

This architecture is characterized by connecting the two sockets via the Intel QPI, which results in one NUMA node per socket. All cores of a socket share a common L3 cache, but have an L2 cache of their own.

The command gen_cpumapping with default option settings will print:

  0 4 1 5 2 6 3 7 8 12 9 13 10 14 11 15

An MPI application that underlies this CPU mapping will run as follows:

  1st task on 1st socket 1st core 1st hardware thread
  2nd task on 2nd socket 1st core 1st hardware thread
  3rd task on 1st socket 2nd core 1st hardware thread
  4th task on 2nd socket 2nd core 1st hardware thread
  5th task on 1st socket 3rd core 1st hardware thread
  6th task on 2nd socket 3rd core 1st hardware thread
  7th task on 1st socket 4th core 1st hardware thread
  8th task on 2nd socket 4th core 1st hardware thread
  9th task on 1st socket 1st core 2nd hardware thread
  ...

Thus, if the application would start 2 tasks, only, it is ensured that they do not share a common NUMA node, thus getting the maximum memory bandwidth. The effect is retained when starting up to 6 tasks. If the application starts up to 8 tasks, it is ensured that each task runs on its own physical core.

AUTHOR

Bernd Kallies, <kallies@zib.de>

COPYRIGHT AND LICENSE

Copyright (C) 2011 Zuse Institute Berlin

This library is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.