# LSI and Circuit Technologies for the SX-8 Supercomputer

By Jun INASAKA,\* Toshio TANAHASHI,\* Hideaki KOBAYASHI,\* Toshihiro KATOH,\* Mikihiro KAJITA\* and Naoya NAKAYAMA†

**ABSTRACT** This paper describes the LSI and circuit technologies used in the SX-8 supercomputer. The SX-8 achieves high-speed operation and a superior price/performance ratio by using the leading-edge CMOS technology that was developed in cooperation with NEC Electronics. This technology features a high-performance transmission circuit design and noise reduction technology.

# **1. INTRODUCTION**

The price performance of NEC supercomputers is improved by enhancing their technical performance. Such improvement was supported by increasing the scale of integration using CMOS technology and by introducing parallel processor operations. With regard to the SX-8, further performance enhancement has been achieved by significant advances in the fields of LSI and circuit technologies. The newly developed LSI also offers further performance enhancement as a result of applying the most advanced 90nm CMOS, 9-layer copper wiring process with a consequent reduction in the supply voltages.

In order to perfect the system performance, it is essential to increase the LSI signal transfer speed in addition to increasing the processing speeds. The SX-8 has achieved high-speed inter-chip data transfer with the development of a low-latency, multi-channel serial interface and higher levels of integration of multiple channels. This was done by reducing both the power consumption and the surface area of the interface circuit.

Noise that inhibits increases in the signal transfer rate is reduced by using newly developed countermeasures based on noise analysis flow and a technology that enables the observation of any LSI internal noise.

# 2. LSI TECHNOLOGY

High processing capabilities need a faster machine cycle. For this purpose, we have developed a CMOS LSI incorporating the latest devices and process technologies. The specifications of the LSI, interface technology, internal RAM and clock/PLL technology used with this LSI will be described separately in the following.

# 2.1 LSI Specifications

**Photo 1** and **Table 1** show the external view and specifications of the CPU chip developed for the SX-8. This LSI was fabricated by the 90nm CMOS process and improves the wiring delay by using 9-layer copper wiring and an interlayer insulation film with a low-dielectric constant. In general, it is required to decrease the thickness of the gate insulation film in order to obtain a high performance with a low supply voltage, but reducing the gate oxide film thickness usually leads to an increase in the gate leakage current and a decrease in reliability. We therefore developed the "radical nitration process," with which nitrogen is introduced on the surface of the silicon oxide film. This process achieved a high performance transistor by reducing the electrical gate insulation film



Photo 1 External view of the CPU chip.

<sup>\*</sup>Computers Division

**<sup>†</sup>NEC Electronics Corporation** 

thickness while at the same time reducing the gate leakage current. As a decrease in the wiring pitch increases the budget share of delay factors, due to wiring resistance and capacitance, a new interlayer film material with a low dielectric constant (K = 2.9) was used in order to reduce the parasitic capacitance.

# 2.2 High-Speed Multi-Channel Serial Interface

The newly developed low-latency, 3Gbps/CH multi-channel serial interface features the world's highest effective data transfer rate per chip of 2.4Tbps (300Gbytes/sec.) (**Fig. 1**).

The latency of the CPU-MMU transfer has an important influence on the performance. In order to implement low latency, it was decided to perform synchronization and inter-channel data alignment simultaneously by designing the same phase between the clocks for the interface circuits and for the logic circuits. This process facilitated omission of the synchronization circuit, the omission of the coding and decoding circuits, addition of an adjustment control channel and simplification of the creation of the timings for the start, execution and completion of the CDR (Clock and Data Recovery) adjustments.

The integration of multiple channels in a single chip requires a decrease in both the power consumption and the circuit area. For this purpose, we optimized the circuit design and layout in reducing the number of circuits by sharing the circuit usage between channels and reduced the power consumption of the driver circuit. As a result, the power consumption and area of the transmitter block have been reduced to 26 mW/CH and  $0.13 \text{mm}^2$ /CH, and those of the receiver block to 22 mW/CH and  $0.18 \text{mm}^2$ /CH.

To deal with the power supply noise that is produced by the operation of the logic circuits, we separated the power supply circuit on the LSI and added on-chip capacitances. This approach, together with an improvement in the noise immunity due to the digitization of the control signals, has succeeded in a significant improvement in bit error rate.

For the testing of the LSI, a random signal generator and collator are incorporated in the macro and the signal is looped back from the transmitter circuit to the receiver circuit to enable circuit function tests and real time tests at the high speed of 3Gbps.

## 2.3 RAM

The newly developed LSI incorporates multiple types of RAM circuits including a large-capacity cache memory and a multi-port register file. These RAM circuits have been designed specifically to fully exploit the device performance.

The reduction in the supply voltage has made it necessary to quickly increase the amplitude of the signal output from the memory cells, in order to achieve high-speed, stable operations. For this purpose, we adopted a bank system that can reduce the number of memory cells connected to each bit line to a quarter of their former number and a pre-bit sense amplifier system[1] that can reduce the input capacitance of the readout circuit. These two systems have made it possible to reduce the load of the memory cells with small driving capabilities and achieve highspeed, stable operations with larger signal amplitudes than before.

Reduction of power consumption is achieved by a reduction in the dynamic power by simplifying the readout circuit control signals as well as by existing measures such as one-shot operations and eliminating non-selected circuits. The static power is also reduced by the effective utilization of low leakage transistors.

# 2.4 PLL Circuit and Clock Distribution

High-speed clock operation is made possible by



Table 1 Specifications of the CPU chip.

| Item                         | CPU Chip               |
|------------------------------|------------------------|
| Technology node              | 90 nm                  |
| Number of transistors        | 88 million transistors |
| Supply voltage               | 1.0 V                  |
| Number of pins (signal pins) | 8,210 (1,923)          |
| Wiring layer configuration   | 9 copper layers        |
| I/O interface                | 1.0 V serial           |
| Packaging                    | Bare chip packaging    |

Fig. 1 Multi-Channel Serial Interface.

distributing high-speed, low-skew clocks.

The SX-8 generates a high-speed clock by using the APLL (Analog Phase-Locked Loop) circuitry that multiplies the clock input from outside the LSI. The APLL circuitry incorporates a VCO (Voltage-Controlled Oscillator) to align the phases of the LSI for external and internal clocks. For countermeasures against power supply noise, a dedicated independent power is supplied to the LSI for jitter reduction.

Clock distribution at low skew is achieved by dividing the LSI into logic and interface circuit domains and generating and distributing two kinds of sync clocks, which are the logic circuit clock and the interface clock.

The clock distribution adopts a 2-step method: The main clock driver drives several supplementary clock drivers that are distributed locally, and each of the local clock drivers distributes a clock to the flip-flop with low skew. The main clock signal is distributed using a clock-dedicated thick-metal wiring layer that features low resistance. This reduces the signal slope degradation caused by resistance, and at equal delays by taking the resistance (R) and capacitance (C) as well as the inductance (L) components of the wiring into consideration. In addition, the loads and wiring lengths are set uniformly to minimize the effects of variance on the fabrication, and the shielding of the power supply and grounding wires reduces the effects of crosstalk noise. The local clock signals are distributed at reduced delay times and equal delays, and the driving strength of the clock drivers are optimized to reduce the clock skews.

### 3. HIGH-SPEED CIRCUIT TECHNOLOGY

To improve the processing capabilities of such a high-speed system, it is necessary to increase data rates for both inter- and intra-chip links. Countermeasures against power supply noise are also required because noise hinders increases in the signal transfer speed considerably.

# 3.1 Increase in the Inter-LSI Signal Transfer Speed

In recent inter-chip signal transfers, data rates are deteriorated by high-frequency attenuation of the transmission line due to skin effect and dielectric loss. As the traces are long, the attenuation becomes so critical that receivers fail to receive signals correctly. To compensate for this attenuation, low-loss PWBs (Printed Wiring Boards) and pre-emphasis techniques are employed, which enables signal transfer over long wiring distances in our SX-8 system. The pre-emphasis function allows adjustments of the strengths of the output drivers in 2 steps and at 8 levels according to the attenuation of the transmission line. In case transmitted sequential 2-bit data are inverted, the strength of the output driver boosts. On the other hand, when they are the same, the strength is weakened.

To transmit signals at higher speeds, it is also important to reduce waveform distortions such as reflections by adjusting the impedance of the transmission line. For this purpose, we developed new sockets and connectors that result in low waveform distortions. In addition, the receiver circuitry incorporates a circuit for adjusting the termination resistance and impedance so as to match the impedance with that of the transmission line and to optimize the signal waveform. As a result, the reliable data transfer at high data rates has been implemented regardless of any manufacturing variations or environmental ones such as in supply voltage, temperature, etc.

#### 3.2 Noise Reduction Technology

An increase in the transistor speed or in the transient current leads to an increase in the power supply noise. We adopted the following measures to reduce the supply noise.

#### (1) Optimized Decoupling Capacitance

We developed and introduced a new analysis flow (**Fig. 2**) capable of simultaneous simulations of the LSI and PWB models in order to estimate the capacitances required for the LSI and PWB to be packaged.

Efficient noise reduction is not possible by simply



Fig. 2 Flow for the analysis of power supply noise.



Fig. 3 Supply noise distribution map (Result of simulation.)

(Noise is highest at the densely displayed areas near the center.)



Fig. 4 Supply current distribution map. (Current is highest at the densely displayed areas.)



Fig. 5 On-chip detector for sensing power supply noise.

controlling the total amount of the decoupling capacitors but it is also necessary to design them so as to match the frequency bands of the noise. The highfrequency noise inside the LSI is reduced by adopting MOS gate capacitors featuring an optimally designed frequency response. The mid-frequency noise is reduced by adopting a ceramic capacitor with a low parasitic inductance on the LSI package.

**Figure 3** shows the result of power supply noise analysis using the newly introduced analysis flow, and **Fig. 4** shows the LSI current source distribution used in the analysis.

The figures show that a relatively high noise level is produced in the logic units with large current distributions. Therefore, we have designed optimum onchip capacitors so that even the highest noise generated by some logic units is no more than the target level.

# (2) On-Chip Supply Noise Measurements

An increase in the LSI speed makes it more difficult to monitor any LSI internal noise on a board or package. Therefore, we developed an on-die detector for sensing power supply noise. This detector is designed to be capable of measurements even when the system is running, which allows sensing differences in the noise level according to the program being executed.

**Figure 5** shows a schematic view of the on-die power supply noise detector circuit. Its features include; 1) no need for dedicated power supply due to the incorporation of a power supply filter in the detector; 2) easy operation due to adjustments of the reference voltage (Vref) and VCO frequency using externally input digital signals, and; 3) high-speed noise measurements capability due to the incorporation of a VCO that can generate clocks with a higher frequency than the processing core clock.

In the SX-8, we measured the supply noise of the processor chips using this detector and verified that the noise was reduced enough to satisfy.

# 4. CONCLUSION

To summarize, in this paper, we introduce the outline of the LSI and circuit technologies incorporated in the SX-8. In the future, we will continue the development of even more advanced LSI and circuit technologies for supercomputers featuring higher performances and improved cost efficiencies.

### REFERENCE

 K. Takeda et al.: "Per-bit sense amplifier scheme for a GHz SRAM macro in sub-100 nm CMOS technology," Solid-State Circuits Conference 2004 Digest of Technical Papers. ISSCC, 2004. IEEE International 15-19, Feb. 2004, Page(s) 502-542, Vol. 1.

Received August 18, 2005

\* \* \* \* \* \* \* \* \* \* \* \* \* \*



Jun INASAKA joined NEC Corporation in 1985. He is now a Chief Manager of Hardware Technologies, Computers Division.



Toshihiro KATOH joined NEC Corporation in 1988. He is now a Manager of Computers Division.



Toshio TANAHASHI joined Corporation in 1971. He is now a Manager of Circuit Design Engineering, Computers Division.



Mikihiro KAJITA joined NEC Corporation in 1992. He is now a Manager of Computers Division.



Hideaki KOBAYASHI joined NEC Corporation in 1987. He is now a Manager of Computers Division.



Naoya NAKAYAMA joined NEC Corporation in 1992. He is now an Assistant Manager of Server Systems Division, NEC Electronics Corporation.

\* \* \* \* \* \* \* \* \* \* \* \* \* \*