Quantum Molecular Dynamics with VASP on NEC SX-Aurora Vector EngineTechnical Articles
Nov. 1, 2022
Patrick Lipka, NEC Deutschland HPCE
Building on the successful evolution of the SX-series of vector-supercomputers, NEC have combined the flexibility of an HPC cluster with the raw power of the NEC Vector Engine Processor. Whereas the SX-series has previously been only available to large corporations or compute centers, the Vector Engine (VE) is now also available to engineers and HPC enthusiasts all over the world.
NEC have developed a Vector Engine (VE) for accelerated computing using vectorization, with the concept that the full application runs on the high performance Vector Engine and the operating system tasks are taken care of by the Vector Host (VH), which is a standard x86 server. This way the NEC SX series vector processor is integrated transparently into the Linux software environment. This allows the Vector Engine to concentrate on providing the best application performance.
With the three design concepts vectorization, maximum memory bandwidth and few but strong cores, the new NEC vector architecture gives a strong foundation for high sustained performance.
The Vector Engine Processor V2 integrates eight vector-cores and 48 GB of high bandwidth memory (HBM2) providing a peak performance of up to 3.07 TeraFLOPS. The computational efficiency is achieved by the outstanding memory bandwidth of up to 1.53 TB/s per CPU and by the latency-hiding effect of the vector architecture.
The new NEC SX-Aurora TSUBASA platform provides the full spectrum of hardware from a workstation for the engineering office up to state-of-the-art computer platform for renowned research institutes. This gives software developers a strong platform that scales up to large systems, and also enables research institutes to provide smaller hardware installations where research groups can test and evaluate their own code without having to use time on a large production installation.
What is VASP?
The Vienna Ab initio Simulation Package (VASP) is a computer program for modelling materials at the atomic level, e.g. for calculations of electronic structure and quantum mechanical molecular dynamics, starting from first principles.
VASP computes an approximate solution for the many-body Schrödinger equation, either in the framework of density functional theory (DFT), solving the Kohn-Sham equations, or in the framework of the Hartree-Fock (HF) approximation, solving the Roothaan equations. Hybrid functionals that mix the Hartree-Fock approach with density functional theory are also implemented. In addition, Green's function methods (GW quasiparticles and ACFDT-RPA) and many-body perturbation theory (Møller-Plesset 2nd order) are available in VASP.
In VASP, central quantities such as the one-electron orbitals, the electronic charge density and the local potential are expressed in plane wave basis sets. The interactions between the electrons and ions are described by norm-conserving or ultra-soft pseudopotentials or the projector-augmented- wave method.
To determine the electronic ground state, VASP uses efficient iterative matrix diagonalisation techniques such as the residual minimisation method with direct inversion of the iterative subspace (RMM-DIIS) or blocked Davidson algorithms. These are coupled with highly efficient Broyden and Pulay density mixing techniques to accelerate the self-consistency cycle.
VASP 6.3 on NEC SX-Aurora Vector Engine
The VASP port and optimization for Aurora was done according to the principle "as few changes as possible, but as many as necessary".
After a few necessary adaptions like exchange of header files for numeric libraries the first port was successful. As basis for a performance analysis NEC Aurora Ftrace information of different test cases have been studied. A major part of the VASP code showed already quite good performance in the ported but not yet optimized version.
A lot of performance optimizations could be achieved by introduction of NEC compiler directives and activation or deactivation of VASP pre-processor flags.
One major performance bottleneck was the non-vectorized random number generator. The existing code has been vectorized as far as possible. Furthermore a few adaptions have been implemented to allow cross-file in lining in order to avoid call overhead.
Furthermore in a few selected routines performance optimizations by exchange of loop order and splitting of loops into vectorized and unvectorized parts have been implemented.
The port and optimization steps were applied to VASP version 6.2.1. The necessary adaptions have been provided to the VASP team, who implemented the adaptions into the official release of the VASP 6.3.0 version .
To measure the performance of VASP 6.3 on the NEC SX-Aurora Vector Engine, we compared the runtime of an Aurora-based Vector Host B300-8 based on an Intel 6148 host processor and equipped with 8 direct-liquid-cooled vector engines of type 20B – of which only two were operative – with an AMD-based pure x86 system.
As a result of NEC's porting and optimization efforts, the VASP performance as well as the ratio of performance per Watt on five test cases, which come along with the official VASP package, could be improved significantly.
As can be seen in Figure 1, two VE cards of type 20B are roughly at the same performance level as one dual-socket node with AMD Milan 7713. In order to measure the power consumption of VASP on Aurora and AMD Milan, four instances of VASP were run at the same time. In the Aurora system four times two Vector Engine jobs have been run in order to fill up all eight Vector Engines of a Vector Host. Correspondingly on the Milan system one instance of VASP has been run on each node of a four node server. The power measurement was done for one full Vector Host and one full Milan-server.
As can be seen on Figure 3, there is a 40% advantage for NEC SX-Aurora Vector Engine with regards to power / energy consumption, compared to a current-generation AMD Milan system.
Porting VASP 6.x to NEC's SX Vector Engine architecture has proven to be no difficult task at all. In especial, we would like to stress that the collaboration and the technical discussions with the VASP developers at the University of Vienna has been overly fruitful, and our Application and Benchmarking team received very much support from the VASP team.
Our benchmarking results clearly show the significant advantage of NEC's SX Vector Engine architecture over competing off-the-shelf x86-based systems, with regards to performance by power consumption. Especially in the light of recent developments and trends, bearing in mind that absolute performance is not an advantage per se – and can trivially be reached simply by putting in more hardware to the same problem – but that energy efficiency as a key guiding principle for the development of new technology is more important than ever before.