wgrib2: all about OpenMP
Introduction
Wgrib2 needs to run fast, and the only way to make wgrib2 run fast
is to use multiple cores. Wgrib2 uses "OpenMP" to multithread its
calculations. Some parts of wgrib2 are scalar (jpeg2000,
I/O) and see no speedup for using mutiple cores. While other parts
(-new_grid, -ens_processing, complex unpacking) see large speedups.
Much effort has been made to parallelize time-consuming parts
of the code. In order to use OpenMP, the compilers must support
at least OpenMP v3.1. Some additional speedups can be obtained
by compiling with AVX512 or AVX2 enabled and using a version of
OpenMP that supports SIMD.
To check that your wgrib2 executable is OpenMP enabled, examine
the output of $ wgrib2 -config
ebis@landing2:~$ wgrib2 -config
wgrib2 v3.1.4beta1 11/2023 Wesley Ebisuzaki, Reinoud Bokhorst, John Howard, Jaakko Hyvätti, Dusan Jovic, Daniel Lee, Kristian Nilssen, Karl Pfeiffer, Pablo Romero, Manfred Schwarb, Gregor Schee, Arlindo da Silva, Niklas Sondell, Sam Trahan, George Trojan, Sergey Varlamov
..
OpenMP: control number of threads with environment variable OMP_NUM_THREADS
..
If you don't have the line "OpenMP: control ..", then your copy of wgrib2 is not OpenMP enabled.
Using OpenMP
You control the number of cores that OpenMP-enabled programs will use by the
environment variable OPM_NUM_THREADS.
bash, sh
$ export OMP_NUM_THREADS=4
csh
$ setenv OMP_NUM_THREADS 4
There is no hard-and-fast rule for the optimum number of
threads to allocate to wgrib2. For example, if I am
running 4 copies of wgrib2 on a 4 core cpu, I would
set OMP_NUM_THREADS to 1. If I am running wgrib2 on
a 128 core CPU with 30 other users, I may set OMP_NUM_THREADS
to 4. Generally the wgrib2 speed up is minimal for greater
than 5 cores unless you are in heavy compute options
that are well parallelized such as -new_grid and -ens_processing.
Wgrib2 uses 1 core in serial sections of the code and up
to OMP_NUM_THREADS in the parallel sections of the code.
Under normal situations, you want the unused cores
(up to OMP_NUM_THREADS-1) to be made available for other
jobs. You do this by
bash, sh
$ export OMP_WAIT_POLICY=PASSIVE
csh
$ setenv OMP_WAIT_POLICY PASSIVE
On a HPC node where your jobs have sole use of the
(physical) CPU, you may want to set the OMP_WAIT_POLICY to ACTIVE.
-ncpu
The environment variable $OMP_NUM_THREADS can be overridden by the
option -ncpu N.
SIMD
OpenMP v4.0+ supports SIMD. The current wgrib2 (5/2024) prefers
multi-threading over SIMD because multi-threading can be used
in more cases and SIMD has a limited vector length. There are
a limited number of cases where the outer loop is parallelized
by multi-threading and the inner loop is parallelized by SIMD.
It is hard to generalize about speed of multi-threading vs
SIMD. Multi-threading could have speed advantages when more
than two memory controllers are present.
To best take advantage of SIMD, wgrib2 should be compiled with
AVX-512 or AVX2 enabled (x86 cpus).
See also: -ncpu,
speed,
|