ADACS GPU Acceleration of LALSuite

As part of the 2018A semester of the ADACS Software Support Program, ADACS developer Dr. Greg Poole developed a GPU accelerated version of the LALSuite XLALSimIMRPhenomPFrequencySequence routine, which generates gravitational wave models of binary mergers of compact objects. Details and results of this work are presented here. Subsequent sections will describe how to download, install and use the ADACS GPU-accelerated version of LALSuite as well as present a Python package developed to aid testing and to demonstrate the use of the LALSuite SWIG wrappers which permit users to interface these routines from Python.

LALSuite Development Details

Users need to be aware of the following minor changes to LALSuite:

  1. A very slight change to the compilation procedure.
  2. An additional parameter has been added to SimIMRPhenomPFrequencySequence() to accommodate a new one-time-allocation buffer.
  3. Two function calls have been added for allocating and deallocating this buffer.

Changes to compilation

To compile the code with GPU acceleration, an NVidia GPU must be available with all relevant software installed, and the configuration step of the compilation must have the –enable-cuda switch added. See the Installation section for more details.

Changes to the LALSimulation API

When calling lalsimulation.SimIMRPhenomPFrequencySequence, an additional parameter has been added at the end of the function’s parameter list. This permits the passing of a one-time-allocated memory buffer (returned from a call to lalsimulation.PhenomPCore_buffer()), greatly increasing the speed of repeated calls (during the generation of MCMC chains, for example). Pass a NULL pointer (or None in Python) to ignore this functionality.

The LALSuite calls for allocating and deallocating this buffer are as follows:

  • buf=lalsimulation.PhenomPCore_buffer(n_freq_max, n_streams) and
  • lalsimulation.free_PhenomPCore_buffer(buf), respectively.

Some notes on the LALSimulation buffer

The memory buffer takes two parameters as input: n_freq_max and n_streams. The first should be set to the maximum number of frequencies that need to be generated by calls to lalsimulation.SimIMRPhenomPFrequencySequence using it, and the second is the number of asynchronous streams to be used.

The ADACS implementation of GPU acceleration for SimIMRPhenomPFrequencySequence() is asynchronous-enabled. This means that multiple independent streams can run concurrently, allowing simultaneous uploading of input arrays and downloading of results from the card. Presently, because time was not available to alter the LALSuite memory allocators to enable the allocation of pinned memory, this implementation is suboptimal and it is recommended that asynchronous functionality not be used by setting n_streams to be less-than-or-equal to 0. See Figures 1 and 2 below for a comparison.

Note

Be sure to call lalsimulation.free_PhenomPCore_buffer(buf) on the buffer returned by buf=lalsimulation.PhenomPCore_buffer() to free the memory allocated for the buffer, when it is no longer needed.

Note

See the lal_cuda.SimIMRPhenomP module and the lal_cuda executables for concrete practical examples of how to use the ADACS branch of LALSuite.

The lal_cuda Python package

A Python package (called lal_cuda) for running regression tests on the LALSuite routines discussed above – and for illustrating their use via Python – has been developed. See the lal_cuda Python API and lal_cuda Executables sections for an account of it’s API and of the executable scripts it provides.

Performance gains

The performance gains obtained from this work are presented in the two figures below. In short: speed-up factors of as high as approximately 8.5 have been obtained, although this result is a strong function of the number of frequencies being simulated.

_images/timings.deltas.png

Figure 1: Time-per-call (in milliseconds) of lalsimulation.SimIMRPhenomPFrequencySequence, as a function of the number of frequencies simulated. Baseline is the case of using the original LALSuite version that was branched from for the project’s development (SHA: 8cbd1b7187), No Cuda is the version developed for this project with CUDA disabled. Other cases are for various numbers of asynchronous streams (N_s; N_s=0 indicated no asynchronous use); with buffer cases indicate that a reusable memory buffer was used.

_images/timings.speedup.png

Figure 2: Factor of speed-ups of calls to lalsimulation.SimIMRPhenomPFrequencySequence relative to the baseline case, as a function of the number of frequencies simulated. Baseline is the case of using the original LALSuite version that was branched from for the project’s development (SHA: 8cbd1b7187), No Cuda is the version developed for this project with CUDA disabled. Other cases are for various numbers of asynchronous streams (N_s; N_s=0 indicated no asynchronous use); with buffer cases indicate that a reusable memory buffer was used.