Breadcrumb navigation

- Performance and Energy Efficiency -

Technical Articles

Nov. 1, 2022
Masashi Ikuta, Advanced Platform Division, NEC Corporation


QuantumESPRESSO (QE) is a simulation package for first-principles calculations, which includes not only programs to solve fundamental physical equations, but also many packages and plugins to help users calculate answers to various physics problems. It is widely used in many domains including material science, chemistry, and molecular dynamics. QE is an open source software developed by the community and distributed under GNU General Public License (GPL). Further details can be found on the official Quantum ESPRESSO web page [1].

According to a Linkedin post by the QE developers, there were more than 15000 downloads in the first 4 months of 2022 from all around the world and this number is growing rapidly.


There are two motivations for us to support QE on SX-Aurora TSUBASA. As we just discussed, QE has a lot of users from various domains and of course this is our first motivation to support QE. And our second motivation is that QE seemed to be a software well suited to the SX-Aurora TUSBASA vector architecture. QE is a software for first-principles calculations whose algorithm is dominated by matrix-matrix calculation, matrix-vector calculation and FFT calculation. Vector architecture has advantage in these calculations and thus, we can expect good QE performance on SX-Aurora TSUBASA.
You can download QE for SX-Aurora TSUBASA on github [2]. QE version 6.4.1 is the currently supported version and work to support version 7.1 is on going.

Performance comparison

To compare performance, we start from PWscf (Plane-Wave Self-Consistent Field) which is the most popular package in QE. We used dataset AUSURF112 and executed on SX-Aurora TSUBASA and Intel Xeon Gold 6326 (Ice Lake generation). AUSURF112 [3] is one of the most popular dataset for QE benchmarking which calculates 112 atoms as the dataset name implies. Intel Xeon Gold 6326 has 16 physical cores and you can use 32 threads enabling hyper threading. In our measurement, we used 2 sockets which means 32 cores, 64 threads in total. As for SX-Aurora TSUBASA, we used 1 Vector Engine node (VE type 20B) which has 8 cores. While Intel Xeon took 443 seconds to finish the calculation, SX-Aurora TSUBASA finished the same in 163 seconds which is 2.7 times speed up. For both executions, MPI was used for parallel processing.

Figure 1: PWscf performance comparison (dataset = AUSURF112)

Another widely used package is Wannier90 which calculates maximally localized Wannier functions in QE. Below is Wannier90 execution results carried out by IIT Delhi (Indian Institute of Technology Delhi). Here, previous generation SX-Aurora TSUBASA (VE type 10B x1 socket) is used which have lower performance. See our website [4] for more detailed specifications. Comparison is between 2 sockets of Intel Xeon Gold 6230 (Cascade Lake generation) using IIT's original dataset.

Figure 2: Wannier90 performance comparison (dataset = IITD original)

From this report, we can show that SX-Aurora TSUBASA is also good for Wannier90 package in QE. We would also like to highlight that Wannier90 package is only ported to SX-Aurora TSUBASA and it is not tuned for vector architecture. Thus, there should be more speed up and this would be our future work.

Power Efficiency

Whilst absolute performance is a crucial measure, power efficiency (performance per watt) is also a very important factor in modern HPC. An interesting study relating to power and performance was carried out by CRIANN in partnership with Nanoclean Energy Industrial Chair ENSICAEN funded by ANR (National Research Agency) in France and TotalEnergies. Here, a comparison was made between SX-Aurora TSUBASA (Type 10A) and AMD EPYC 7642 (Rome generation). When power consumption was limited to 1300 Watts, 4 sockets of AMD EPYC Rome 7642 completed a 408 atoms calculation in 2318 seconds. The same calculation when carried out on 4 sockets of SX-Aurora Vector Engine and with the same 1300 Watts power limitation completed in 648 seconds. Thus, it can be shown that under same power budget SX-Aurora is 3.5 times faster than AMD EPYC.

Figure 3: Performance comparison under 1300 Watts power limitation (408 atoms dataset)

We can further validate this result other way around. This time, we compare how many Watts we need to finish 4 steps of 408 atoms calculation under 800 seconds. As for AMD EPYC 7642, 12 sockets were required using 3750 Watts while SX-Aurora TSUBASA required only 3 sockets using 1100 Watts. Thus, for the same QE calculation, we can say that SX-Aurora TSUBASA can do it 3.4 times efficiently in terms of power.

Figure 4: Energy efficiency comparison (408 atoms dataset, 4 steps under 800 sec)

Summary and Future work

From our measurement and also from work done by IITD, we could confirm that SX-Aurora TSUBASA shows good performance running Quantum ESPRESSO. Thanks to the work carried out by CRIANN, we also observed that SX-Aurora TSUBASA has an advantage against x86_64 processors in terms of power consumption. 
As for future work, our plan is to port and optimize QE version 7.1 for SX-Aurora TSUBASA, then to carry out further benchmarking. In QE version 7.1, a new algorithm called RISM which is suited for larger calculation sizes is supported. Preliminary studies by NEC imply that SX-Aurora TSUBASA shows good performance using the RISM algorithm and it we believe this will be beneficial to study in more detail.