SCA: A Library to Accelerate Stencil Codes on Vector EngineTechnical Articles
Nov 1, 2020
Arihiro YOSHIDA (Numerical Library Engineer), Ryusei OGATA (Numerical Library Manager)
AI Platform Division, NEC Corporation
Stencil codes are typical costly computing patterns appearing in a wide variety of domains.
To accelerate execution of stencil codes, we provide a library named Stencil Code Accelerator, SCA.
SCA realizes this acceleration by utilizing the computation power of Vector Engine.
In this article, first we explain "stencil code", and then we introduce SCA with its performance.
1. Stencil Code Overview
Before we introduce the library SCA, we need to explain, what is a "stencil code".
It is a computing pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, and so on. It updates each element in a multidimensional array by referring to the neighbor elements. So, it requires significant performance of both computation and memory access.
This is an example of a stencil code. The left is a source code in Fortran, where the lines of the stencil code are highlighted. It is implemented based on the equation on the right side (colored red), which is derived by discretizing a Laplace equation using the finite-difference method. The value of b(i,j) is updated by referring to the values of 4 neighbor grid points a(i,j-1), a(i-1,j), a(i+1,j), and a(i,j+1).
This shows the domains where stencil codes appear. As you can see in this list, stencil codes appear in a wide variety of domains.
2. Stencil Code Accelerator Library
To reduce the execution time of the programs that have stencil codes, we provide a library named Stencil Code Accelerator, SCA. It can highly accelerate execution of stencil codes by utilizing the computation power of the Vector Engine. It supports up to 4-dimensional stencils with any shape. You can use it in C, C++ and Fortran programs.
To use SCA in your program, you need to follow 3 steps: initialization, computation, and finalization. In the initialization step, a "stencil code" needs to be created using a "stencil description". Here, a "stencil description" is the information about stencil element attributes and an output array, and a "stencil code" is an executable binary generated on memory.
In the computation step, the "stencil code" is executed. Usually, it is done repeatedly until the time step reaches the end or convergence conditions are met.
In the finalization step, the "stencil code" needs to be destroyed.
This is an example of a stencil code using SCA. The left is a source code in Fortran, where the lines of SCA routine calls are highlighted. The first highlighted part corresponds to the initialization step. A highly optimized executable binary of the stencil code is generated on memory. The second highlighted part corresponds to the computation step. The binary of the stencil code is executed here. The last highlighted part corresponds to the finalization step.
3. Performance Benchmarking
We investigated stencil code optimizing software for other platforms. We found that most are frameworks with domain specific languages, not libraries. For benchmarking, we took two frameworks. The first is YASK, which is a C++ framework using a C++-like domain specific language. It is developed by Intel, and targeted at x86 processors including Xeon Phi. The second is Physis, C/C++/CUDA framework using a C-like domain specific language. It is mainly targeted at NVIDIA GPUs.
We did benchmarking under these conditions. We chose stencil shapes that are most commonly used in scientific simulations. Particularly for seismic imaging, large ones are often used. The computing domain size is 1024x1024x512.
This chart shows the benchmarking results. The pink and red bars are a naïve implementation and SCA executed on Vector Engine Type 20B, respectively. The light green and green bars are a naïve implementation and Physis executed on Tesla V100, respectively. The light blue and blue are a naïve implementation and YASK executed on Xeon Skylake 40 cores, respectively. Looking at the red bars, SCA shows the highest performance. In particular, the performance reaches 2.8 TFLOPS for the stencil shape "6x6y6za".
Stencil codes appear in a wide variety of domains, and create large computing costs. To reduce the execution time of the programs that have stencil codes, we provide SCA, which is a library to highly accelerate execution of stencil codes on the Vector Engine. SCA achieves 2.8 TFLOPS for stencil shapes commonly used in scientific simulations. The performance is superior to that of stencil code optimizing software for scalar processors and GPGPUs.
The online manuals of SCA are available at the following URLs:
- Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
- NVIDIA and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries.
- Linux is a trademark or a registered trademark of Linus Torvalds in the U.S. and other countries.
- Proper nouns such as product names are registered trademarks or trademarks of individual manufacturers.