ex3_resources_rgb200.png

Facilitating research on bleeding-edge HPC technologies

The eX3 infrastructure is continuously under build-up and reconfiguration in order to keep up-to-date with the technology development. The following hardware resources, acquired in the first phase procurement, are currently available. For further details, please consult the eX3 wiki.


AMD EPYC

There are four dual processor nodes (Supermicro 2023US-TR4) in the eX3 cluster equipped with the AMD EPYC 7601 processor (Naples family, 64-bit, 32-core, x86). Thus, in total eight processors with 256 cores. Each AMD node has 2 TB DDR4 main memory and 2.8 TB local NVMe scratch storage.

There are 12 single-processor nodes (Gigabyte R272-Z30) with the AMD EPYC 7302P processor (Rome family, 64-bit, 16-core, x86). So, in total 12 processors with 192 cores, primarily intended for networking research. Each of these AMD nodes has 128 GB DDR4 main memory and 256 GB local NVMe scratch storage.

Recently, four additional dual processor nodes (Supermicro 2024US-TRT) with the AMD EPYC 7302 processor (Rome family, 64-bit, 16-core, x86) has been added. That is, in total eight processors with 128 cores, primarily as hosts for FPGA accelerators. Each of these AMD nodes has 512 GB DDR4 main memory and 7.6 TB GB local NVMe scratch storage.

Moreover, we have added four dual processor nodes (Supermicro 2024US-TRT) with the AMD EPYC 7763 processor (Milan family, 64-bit, 64-core, x86). That is, in total eight processors with 512 cores. Two of the nodes have dual AMD MI100 GPUs and two have dual NVIDIA A100 GPUs (see below). The nodes with AMD accelerators are considered as “lightweight replica” of the nodes to appear in the LUMI supercomputer. Each of these AMD nodes has 2 TB DDR4 main memory and 7.6 TB GB local NVMe scratch storage. There are also two AMD EPYC 7763 processors in a dedicated 8-way NVIDIA A100 system, see below.


ARM CAVIUM

There are four dual processor nodes (Gigabyte R281-T94) in the eX3 cluster equipped with the ARM Cavium ThunderX2 CN9980 processor (64-bit, 32-core, ARM). Thus, in total eight processors with 256 cores. Each ARM node has 1 TB DDR4 main memory and 5.3 TB local SSD scratch storage.


HiSilicon
kunpeng

In a are four dual processor nodes (Huawei TaiShan 200) in the eX3 cluster with the HiSilicon KunPeng 920 processor (64-bit, 64-core, ARM). This makes a total of eight processors with 512 cores. Each Huawei node has 1 TB DDR4 main memory and 4.2 TB local SSD scratch storage.


INTEL XEON

Embedded in the NVIDIA DGX-2 node (see below), eX3 offers access to two Intel Xeon Platinum 8168 CPUs (64-bit, 24-core, x86).

There are four dual processor nodes in the eX3 cluster equipped with the Intel Xeon Gold 6130 processor (64-bit, 16-core, x86). Thus, in total eight processors with 128 cores. Each of these nodes has 384 GB DDR4 main memory and 2TB local scratch storage. These nodes doubles as login/management nodes. Two of the nodes are also equipped with additional 384 GB NVDIMM memory.

In addition, eX3 has eight single processor nodes equipped with Intel Xeon Silver 4112 processors (64-bit, 4-core, x86), thus in total 32 cores.

There are also two Intel Xeon Platinum 8360Y processors (Ice Lake, 64-bit, 36-core, x86) in a dedicated 8-way NVIDIA A100 system, see below.


AMD MI100 GPU

Two of the recently installed AMD Milan CPU nodes are accelerated by two AMD Radeon Instinct MI100 GPUs each, see above. These GPUs have 232 GB local memory each.


NVIDIA V100 GPU

The eX3 infrastructure includes a DGX-2 system consisting of 16 NVIDIA Tesla V100 GPUs, allowing simultaneous communication between all eight GPU pairs at 300 GBps through the 12 integrated NVSwitches. This gives a theoretical system-wide system bi-directional bandwidth of 2.4 TBps. All GPUs have 32 GB of local memory (total of 512 GB) and share a 1.5 TB main memory. The total system has 81,920 CUDA cores, and 10,240 Tensor cores delivering 2 Petaflops of tensor performance. The peak performance in double precision is 125 Teraflops.


NVIDIA A100 GPU

Recently, the eX3 infrastructure has added a dedicated 8-way GPU system (Supermicro 4124GO-NART) with NVIDIA A100 GPUs connect by NVLINK. The system hosts two AMD Milan CPUs, see above. All of these A100 GPUs have 80 GB of local memory each (total of 640 GB) and share a 2 TB main memory and 30 TB of local NVMe storage. The total system has 55,296 CUDA cores, and 3,456 Tensor cores.

In addition, there two AMD Milan CPU nodes available, each accelerated with two A100 GPUs, see above. These GPUs have 40 GB local memory each.


XILINX FPGA

Each of the two recently installed AMD Rome CPU nodes are accelerated by one Xilinx U280 and one Xilinx U250 FPGAs.


The eX3 infrastructure includes an IPU-POD, consisting of 64 Colossus Mk2 GC200 IPUs with 1472 cores each, making up a total of more than 94,000 cores. The total In-Processor Memory™ is more than 57 GB supported by a memory bandwidth of 47.5 TB/s per IPU. This IPU-POD is capable of 16 petaflops mixed precision peak performance. The IPU processor has been designed bottom-up for machine learning and AI workloads.

GRAPHCORE
IPU-POD64


INTEL HABANA GAUDI HL-205

The eX3 infrastrucutre includes an 8-way system designed for AI/ML workloads (Supermicro SYS-420GH-TNGR), based on the Intel Habana Gaudi HL-205 accelerator. The system, which also hosts two Intel Ice Lake CPUs (see above), has 2TB of local memory and 30 TB of local scratch storage. Each Gaudi accelerator has 32 GB of onboard local memory.


AKIDA NEURAL PROCESSOR

The KunPeng CPU nodes (see above) hosts four Akida Neural Processors from BrainChip. These processors are designed specifically for neuromorphic computing.


Interconnects

All eX3 nodes are connected through a 200 Gbps Infiniband HDR network using Mellanox components (Quantum switches and ConnectX-6 VPI HCAs) In addition, some nodes are equipped with 32 GT/s Gen3 PCI Express networking provided by Dolphin Interconnect Solutions (MXS824 switches and Non-Transparent Bridging HCAs). In addition, all nodes are connected to a 10/25/100 Gbps Ethernet network dependent on interfaces and managed through a 1 Gbps management network.


Data storage

All nodes in the eX3 infrastructure have access to a NetApp E5760 Enterprise hybrid storage unit through the BeeGFS parallel file system. The total storage capacity is 500 TB, based on a combination of spinning disks and SDDs. Please notice that eX3 does not offer long-term storage. This resource is only meant for the temporary storage of input data and computed results for the execution of experiments.