LLNL Readies to Deploy 5.4 petaFLOP Cluster

Magma, a liquid-cooled supercomputer built by Penguin Computing and CoolIT Systems, uses Intel® Xeon® 9200 series processors.

CASE STUDY
Executive Summary
Lawrence Livermore National Laboratory (LLNL) is one of the Tri-Labs of the Nation­al Nuclear Security Administration, where High Performance Computing (HPC) clusters in the Commodity Technology Systems (CTS-1) program provide over 25 petaFLOPS of computing capacity across the three labs. For its latest HPC acquisition, LLNL turned to Penguin Computing to build a new system using the latest Intel HPC technologies and a unique liquid cooling solution from CoolIT Systems.

Challenge
The National Nuclear Security Administration’s (NNSA) core mission is the responsible stewardship of America’s nuclear stockpile through the application of unparalleled science, technology, engineering, and manufacturing. Computer simulations are critical to the work done by the scientists in the NNSA. They use high performance computing (HPC) resources at three national laboratories abbreviated as the Tri-Labs—Sandia National Laboratories (SNL), Los Alamos National Laboratory (LANL), and Lawrence Livermore National Laboratory (LLNL). In 2016, the Tri-Labs began acquiring a fleet of new HPC systems built by Penguin Computing under the Commodity Technology Systems program (CTS-1), which has delivered more than 25 petaFLOPS of computing across the three institutions through 2019.

CTS-1 systems are used as the everyday workhorses for Tri-Lab scientists and engineers researching a range of problems in hydrodynamics, materials science, molecular dynamics, and particle transport. Some of the CTS-1 systems are also dedicated to institutional computing and collaborations with industry and aca­demia. The NNSA needed additional capacity at LLNL dedicated to the simulation of 2D and 3D physical systems for parametric studies. The new system being built is called Magma.

“We’re continually tracking advancements in technologies and looking for capable and economic HPC solutions for LLNL scientists. Our workloads—although they are intensive on the network—they are most intensive on memory bandwidth. So, we looked at the Intel solutions based on the Intel Xeon 9200 series processors.” –Dr. Matt Leininger, Senior Principal HPC Strategist at LLNL

Solution
Magma, with 760 compute and user nodes plus 12 infrastructure nodes, is a liquid-cooled supercomputer built by Penguin Computing using Intel® Server System S9200WK family chassis with Intel® Xeon® Platinum 9242 processors (compute nodes), Intel® Xeon® Platinum 8200 processors (management and file system access nodes), Intel® Omni-Path Architecture (Intel® OPA) fabric, and a unique liquid cooling system designed by CoolIT Systems. Delivered in the first quarter of 2020, it offers LLNL scientists 5.4 petaFLOPS of computational capacity (theoretical).

First Large-Scale Intel® S9200WK Server-Based Supercomputer in the U.S.
When NNSA needed to add compute cycles to their fleet of CTS-1 systems, the Intel Server System S9200WK family and Intel® Xeon® Platinum 9200 processors were in early launch.

“We’re continually tracking advancements in technologies and looking for capable and economic HPC solutions for LLNL scientists,” explained Dr. Matt Leininger, Senior Principal HPC Strategist at LLNL. “Our workloads—although they are intensive on the network—they are most intensive on memory bandwidth. So, we looked at the Intel solutions based on the Intel Xeon 9200 series processors.”

“One of the things we like about the Intel Xeon Platinum processors,” continued Leininger, “is that they have a tremen­dous amount of memory bandwidth per node, and therefore we can remove that bottleneck from our application and deliver both capable and economical cycles to our mission critical applications.”

The advanced processors offered LLNL’s HPC architects a new level of compute performance compared to their existing CTS-1 resources built on Intel® Xeon® E5-2600 v4 processors in 2016.1

“We also required the system to be liquid cooled,” added Leininger. “Liquid cooling allows LLNL to utilize the higher perfor­mance processors in a high-density solution while also easing the air-cooling requirements within our data centers.”

With Intel Xeon Platinum 9200 processors with liquid cooling, Penguin was able to provide a competitive high-density system offering with outstanding performance per core.2

Innovative cooling technology offers high serviceability and provides very stable liquid cooling across the DIMMs.

Penguin Computing has deployed over 1,000 CTS-1 racks based on their Tundra Systems (built on the OCP–Open Compute Project) architecture, which enables very high-density in a DC-powered rack. The Penguin Computing Tundra solution included a DC-powered version of Intel OPA switch and several options for both liquid and air cooling. The OCP design gave the Tri-Labs high capacity in a smaller space than standard rack designs.

“The increased memory bandwidth of the Intel Xeon Plati­num 9242 processors was compelling, along with the quick availability allowing for a quick deployment, were the two major driving factors in selecting the configuration,” stated Ken Gudenrath, DOE Director at Penguin Computing. “To ensure a quick and complete cluster solution, we partnered with Intel using our Relion XE2142eAP 2U4N server in a standard EIA rack.”

“One of the things we like about the Intel Xeon Platinum processors is that they have a tremendous amount of memory bandwidth per node, and therefore we can remove that bottleneck from our application and deliver both capable and economical cycles to our mission critical applications.” —Dr. Matt Leininger, Senior Principal HPC Strategist at LLNL

Penguin Computing purchased fully integrated Intel Server System components—with liquid cooling support—and worked with Intel and CoolIT Systems to design the remain­ing liquid direct-chip cooling and Cooling Distribution Units (CDU) for the datacenter.

“The quick collaboration amongst all the stakeholders allowed for fast design, contract execution, delivery, and ultimate acceptance of Magma,” added Gudenrath. “Our final goal was achieved when we completed several initial high-perfor­mance Linpack runs and submitted these for qualifying on the November 2019 Top500 list.”

Magma comprises 760 dual-socket compute nodes built on Intel S9200WK servers with Intel® Xeon® Platinum 9242 pro­cessors (total of 72,960 cores). Twelve more nodes provide management and file system access, using 2nd Genera­tion Intel Xeon Gold Scalable processors. Like other CTS-1 systems, the fabric is based on Intel OPA. Due to the higher performance node requirements, LLNL doubled the on-node network performance by adding a second Intel OPA host adapter for each node.

Magma performance previews ranked #69 on the November 2019 Top500 list with 3.4 petaFLOPS from 650 compute nodes.1 The system’s theoretical peak is 5.4 petaFLOPS using all 760 compute nodes.

Innovative Cooling Enables Maximum Performance
HPC architects have integrated liquid cooling into systems for several years. Numerous CTS-1 purchases included direct-to-chip liquid cooling. According to Leininger, LLNL’s experience with liquid cooling shows that it makes a sig­nificant difference with modern processors. Prior to more advanced CPU designs, air cooling offered adequate thermal protection to run the systems at full performance. Today’s modern processors require liquid cooling to reach their maximum compute capability.

Direct-to-chip liquid cooling in a system the size of an HPC cluster has always introduced a level of complexity into the design. Since direct-to-chip cooling brings coolant right to the component(s), the cooling solution adds difficulty—and thus service cost—to replacing parts, sometimes bringing an entire server down to move the cooling structure out of the way, replace the component, and bring it back into service. System leaks or failures amplify the difficulties and costs.

CoolIT considers itself a trusted solution provider to the OEMs, working closely to deliver cooling solutions that opti­mize performance while optimizing serviceability.

“We’re used to designing innovative, custom cooling systems for large clusters,” commented Jason Zeiler from CoolIT Systems. “Our own CDU interface between the facility liquid, subfloor piping, and the secondary side technology in the rack. Our offering to the market is as a technology leader, integration collaborator, and solution provider.”

According to Zeiler, memory failure rates in large systems are high across the industry, so easily serviceable memory cooling is a high priority going forward in HPC clusters. According to Leininger, DIMMs are the single most-replaced component in the CTS-1 clusters. So, Magma uniquely brings liquid cooling right to memory.

“Our design allows for high-density memory heat capture,” explains Zeiler. “It provides very stable liquid cooling across the DIMMs. A key objective for our design was to provide both a cost-effective, high heat capture solution for memory while also maintaining very high serviceability. Our design allows for a high number of insertion cycles per DIMM, allowing them to be removed and replaced without any significant impact to the liquid cooling design.”

With the size of Magma, and especially the amount of DIMMs per server, adding traditional liquid cooling directly to the DIMMs had the potential of significantly magnifying service complexity. But, CoolIT used innovative blind-mate, dry-break quick disconnect connectors to mate the component piping to the server board in each server and between the server and coolant manifold in the back of the rack.

Blind-mate, dry-break connectors automatically mate with a chassis manifold at the component level without having to manually disconnect the plumbing. Unplugging a server automatically unplugs the coolant lines without leaking.

“The Intel server design is very user friendly for liquid cooling with blind-mate connectors,” explained Zeiler. “Liquid cooling often adds an element of mild complexity with manual-mate style quick disconnects. But a blind mate design automatical­ly engages with the system, so users are not really interacting with the connector at all. It’s just as simple as electronic blind mate connections.”

“Our admins really like the serviceability around the memory and that its being liquid cooled as well,” added Leininger. “LLNL was worried that the liquid cooling serviceability would be complex and potentially messy. However, CoolIT designed a clean and non-invasive solution. So that was one thing that definitely impressed our folks as we were making decisions.”

Result
Magma was deployed in the first quarter of 2020. Of the 11 CTS-1 systems on the latest Top500 list, Magma holds the highest ranking at 69. Compared to Jade, a CTS-1 system at LLNL, Magma has 35 percent fewer cores than Jade, but it delivers more than 1.2X higher Rmax.1 Jade is built on Intel® Xeon® E5-2695 v4 processors, illustrating the performance benefit of latest generation Intel Xeon processors.

Magma will deliver an additional 5.4 petaFLOPS of theoreti­cal peak performance to NNSA resources, offering more than 25 petaFLOPS total computing capacity to the Tri-Labs for Stockpile Stewardship and scientific discovery.3

Solution Summary
Needing more computing capacity for its national security mission, NNSA funded the building of Magma, a theoretical 5.4 petaFLOPS supercomputer built on the latest Intel® Xeon® Server Systems S9200WK family using Intel Xeon Platinum 9242 processors and Intel OPA fabric. Liquid cooling across the chassis includes direct cooling to the memory with a blind-mate, dry-break system that simplifies serviceability. Working with Penguin Computing, who designed Magma, CoolIT provided the unique and innovative cooling solution.

Solution Ingredients

  • Intel Xeon Platinum Processor 9200 series
  • 772 nodes (760 compute nodes with 72,960 cores)
  • Built on Intel Server System S9200WK product family
  • Built by Penguin Computing with CoolIT advanced liquid cooling

探索相关产品与解决方案

Intel® Xeon® Scalable Processors

Drive actionable insight, count on hardware-based security, and deploy dynamic service delivery with Intel® Xeon® Scalable processors.

Learn more

2nd Gen Intel® Xeon® Platinum 9200 Processors

Designed for HPC, advanced artificial intelligence and analytics, the Intel® Xeon® Platinum 9000 processors deliver breakthrough levels of performance.

Learn more

Intel® Omni-Path Architecture (Intel® OPA)

Intel® Omni-Path Architecture (Intel® OPA) lowers system TCO while providing reliability, high performance, and extreme scalability.

Learn more

通知和免责声明

英特尔® 技术的特性和优势取决于系统配置,并可能需要支持的硬件、软件或服务激活。实际性能可能因系统配置的不同而有所差异。没有任何计算机系统能够保证绝对安全。请咨询您的系统制造商或零售商,也可登录 www.intel.cn 获取更多信息。// 性能测试中使用的软件和工作负载仅在英特尔® 微处理器上针对性能进行了优化。SYSmark 和 MobileMark 等性能测试使用特定的计算机系统、组件、软件、操作和功能进行测量。上述任何要素的变动都有可能导致测试结果的变化。您应该查询其他信息和性能测试,以帮助您对正在考虑购买的产品作出全面的评估,包括该产品在与其他产品结合使用时的性能表现。如欲了解更多完整信息,请访问 www.intel.cn/benchmarks。// 性能结果基于配置中所规定日期的测试,可能无法反映所有公开的安全更新。有关详细信息,请参见配置信息披露。没有任何产品或组件能保证绝对安全。// 所描述的成本降低方案仅用作示例,表明某些基于英特尔® 的产品在特定环境和配置下会如何影响未来的成本,并节约成本。环境各不相同。英特尔不保证任何成本和成本的节约。// 英特尔并不控制或审核本文档引用的第三方基准资料或网站。您应访问引用的网站,确认参考资料准确无误。// 在某些测试案例中,结果以英特尔内部分析或架构模拟或建模为基础来评测或模拟,且仅供参考。您的系统硬件、软件或配置的任何不同均可能会影响实际性能。

产品和性能信息

1 Data provided by Lawrence Livermore National Laboratory
2 https://www.enterpriseai.news/2019/11/18/ai-ready-2nd-generation-intel-xeon-platinum-9200-processors-demonstrate-leadership-performance/ For details, visit http://www.intel.cn/2019xeonconfigs/ (Intel Xeon Scalable processors – claim #31). For additional detail visit https://www.intel.cn/content/www/cn/zh/high-performance-computing/performance-for-hpc-platforms.html