Accelerating Social Value Creation NEC's AI Supercomputer
July 24, 2023
1. Overview of AI research supercomputer
NEC has started building a supercomputer for AI research in order to maintain and strengthen its dominance in the field of AI development amid increasingly intense global competition. With the launch of full-scale operations planned for March 2023, the AI supercomputer will become the largest in the industry in Japan (*1) with a performance of more than 580 PFLOPS. Part of the system (100 PFLOPS) is already being used by hundreds of AI researchers at NEC. The addition of the 480 PFLOPS system that is currently under construction will realize Japan's leading research and development environment dedicated to AI, which will help to accelerate the development of more advanced AI. In the future, NEC aims to establish a center of excellence for AI research that creates advanced social value through co-creation with customers and partners.
2. Hardware configuration of AI research supercomputer
In the field of deep learning, computational resources are a source of competitiveness. NEC aims to provide Japan's largest corporate AI research supercomputer to hundreds of AI researchers at NEC, which will significantly strengthen AI research and development.
When full-scale operations begin in March 2023, the AI research supercomputer will realize Japan's largest corporate AI research and development environment, with computing performed by 928 enterprise GPUs, a network that incorporates ultrahigh-speed 200 GbE Ethernet switches, and a large-scale distributed storage system with a total capacity of more than 16 PB.
Table 1 Overview of AI research supercomputer system configuration
|Computing||116 GPU servers, each equipped with 8 NVIDIA A100 80 GB Tensor Core GPUs per node||928 GPUs |
|Network||NVIDIA Spectrum SN3700 high-speed Ethernet switches (200 GbE)||Several thousand optical cable connections|
|Storage||ES400NVX high-speed storage appliances manufactured by DDN||16 PB|
2.1 Computing specifications
The AI supercomputer uses NVIDIA A100 80 GB Tensor Core GPUs that incorporate NVIDIA's latest Ampere architecture. These GPUs support a new matrix computation mode called TensorFloat-32 (TF32), which enables learning at higher speeds and with less loss of accuracy than FP32 in many cases. In addition, with support for 12 NVLink connections per GPU and the total bandwidth that reaches 600 GB/s, high-speed multi-GPU learning is possible. At NEC, AI learning is conducted for a wide variety of applications, such as biometric authentication, image recognition, speech recognition, data analysis, and robot control technology. NVIDIA's GPUs have been adopted for their ideal suitability in creating a variety of advanced AI quickly, due to their high-speed performance that enables dramatically accelerated learning, and their large GPU memory capacity that enables learning with any neural network.
Each GPU server is equipped with eight NVIDIA A100 80 GB Tensor Core GPUs. The Supermicro SYS-420GP-TNAR is used for the server form factor, and the system is optimized for large-scale distributed learning applications by customizing the configuration of the CPU, storage, and memory, as well as the BIOS settings.
Table 2 GPU server specifications
|CPU||Intel Xeon Platinum 8358 (32 core, 2.6 GHz) x 2|
|GPU||NVIDIA A100 80 GB Tensor Core GPU x 8|
|Local Storage||1.9 TB NVMe SSD + 7.6 TB NVMe SSD x 4|
|Interconnect (Servers)||NVIDIA ConnectX-6 (200 Gb/s Ethernet) Single-Port x 5|
|Interconnect (Storage)||NVIDIA ConnectX-6 (200 Gb/s Ethernet) Dual-Port x 1|
2.2 Network specifications
Distributed deep learning requires low-latency, high-bandwidth communication, due to the numerous parameters that need to be communicated between GPUs. For this reason, the system is equipped with NVIDIA Spectrum SN3700 high-speed Ethernet switches, which support 200 Gbps ultrahigh-speed Ethernet.
The connections between all servers are implemented using NVIDIA ConnectX-6 low-latency interconnects installed in the servers, along with NVIDIA Spectrum SN3700 200 GbE high-speed Ethernet switches. Since the network configuration uses spine-leaf architecture and incorporates RoCEv2 (RDMA over Converged Ethernet) to achieve ultrahigh-speed and low-latency communication, high-speed distributed learning can be performed.
Table 3 Switch specifications
|Switch||NVIDIA Spectrum SN3700 (200 Gbps Ethernet Switch)|
2.3 Storage specifications
With a total capacity of more than 16 PB, the large-scale storage system is comprised of ES400NVX storage appliances manufactured by DataDirect Networks (DDN), which are equipped with the EXAScaler parallel file system. The high-speed area consisting of many NVMe SSDs enables high-performance storage with sequential reading speeds of up to 400 GB/s and sequential writing speeds of up to 320 GB/s.
The ability to access the storage from all GPU servers makes it possible to quickly retrieve even large datasets with tens of millions of images from GPU servers, thereby enabling high-speed distributed learning. Since AI learning at NEC is conducted with a diverse range of data such as videos, images, audio, and text, a high-performance storage system that can withstand a variety of workloads is required. For this reason, NEC decided to adopt DDN's storage.
Table 4 Storage specifications
|SSD||ES400NVX||Approx. 1.1 PB|
|HDD||ES400NVX + SS9012||Approx. 14.6 PB|
3. Software configuration of AI research supercomputer
NEC has independently developed a distributed learning environment with Kubernetes as the core software, so that NEC's AI researchers can conduct AI learning without building complex environments. As a result, researchers can perform distributed learning using the latest deep learning framework containers. In addition, by independently developing a Kubernetes job scheduler and optimizing physical configurations such as the network topology to enable more efficient scheduling, many researchers are able to use the system. Furthermore, implementation of NEC's own optimization makes it possible for software such as MPI/NCCL communication libraries and OS to perform high-speed distributed learning with GPUDirect RDMA using RoCEv2.
By tightly coupling the hardware architecture, including the server architecture and spine-leaf type network configuration, with an advanced software suite, NEC has achieved a high-performance, highly convenient, advanced deep learning environment. It is only through the integration of computer architecture and software architecture at a high level that such a world-class AI research supercomputer is possible.
Table 5 Software stack of AI research supercomputer
|Operating System||Ubuntu 20.04 LTS Server||Utilizes Ubuntu, which is widely used in deep learning and is familiar to researchers.|
|Container||Kubernetes||Utilizes the de facto container technology and enables the use of a cutting-edge deep learning environment.|
|Job Scheduler||Expanded development of Kubernetes scheduler||Optimization based on factors such as rack layout, network topology, and hardware topology enables many jobs to be processed stably and efficiently.|
|Communication library||NCCL, OpenMPI||NEC's unique tuning and optimization enables high-speed performance of distributed learning using many GPUs.|
|Deep Learning||PyTorch, TensorFlow2, etc.||The latest deep learning frameworks are provided with containers that are developed in accordance with the internal environment.|
Now recruiting researchers and technology developers!
Hundreds of AI researchers at NEC are already using NEC's AI research supercomputer, and new advancements in AI are being developed every day. NEC is currently searching for individuals who can further strengthen our world-class advanced AI and work with us to shape our future. If you are interested in participating in the development and operation of Japan's largest AI research supercomputer and creating innovations, please apply via the links below.
NEC is also recruiting AI researchers and engineers interested in using Japan's top-class AI research and development environment to create new social value.
Person in charge of AI research supercomputer technology development:
Takatoshi Kitano, Digital Technology Development Laboratory