In today’s world, the rapid advancement of technology has given rise to the concept of supercomputers—machines that are exponentially more powerful than your typical desktop computer. These technological giants play a pivotal role in a wide range of fields, from scientific research and AI development to climate modeling and national security. Despite their important role in today's society, not many know what a supercomputer is. AI data centers have inherited the design philosophy of super computers (dedicated guide on this topic). A basic understanding of super computers is essential in today's AI Data Center revolution, in particular how they differ from regular computing systems. Let’s dive in.
What is a Supercomputer?
A supercomputer is a high-performance computing (HPC) system that can process vast amounts of data and perform complex calculations at incredibly high speeds. These machines are capable of executing quadrillions (thousands of trillions) of operations per second, far surpassing the capabilities of standard computers. Supercomputers are designed to tackle large-scale computations, simulations, and data processing tasks that would be impossible or take far too long on traditional systems.
How Does a Supercomputer Work?
At its core, a supercomputer works similarly to a regular computer but on a massive scale. The main difference, however, is that supercomputers rely on thousands or even millions of processors working in parallel rather using a single processor. These processors divide massive computational tasks into smaller pieces, solving them simultaneously. This parallel processing is what enables supercomputers to achieve their staggering performance levels.
A typical supercomputer is housed in a large facility and consists of many interconnected processing units, each tasked with handling a fraction of the overall workload. This interconnected network, often referred to as nodes, allows the machine to efficiently distribute computational tasks and process them concurrently.
Cooling systems, such as liquid cooling or advanced techniques like Direct-to-Chip Cooling, are essential to maintain performance, as the system generates significant heat due to its extreme power consumption.
Key Metrics of a Supercomputer
Supercomputers are often measured by the following metrics:
FLOPS (Floating Point Operations Per Second): FLOPS is the standard unit of measurement for supercomputer performance. Modern supercomputers can reach speeds in the range of petaflops (quadrillions of FLOPS) and exaflops (quintillions of FLOPS). it's worth noting that FLOPS can be expressed in FP8, FP16, FP32, FP64, where FP stands for floating point and the number stands for the number of bits of the floating point point. the number of digits in the floating point affect the accuracy of the computation as well as the computational load and time. Longer floats mean greater accuracy, but also imply higher computational load and longer time. In supercomputing FP64 is considered the benchmark if not specified otherwise.
Parallel Processing: This refers to the ability of the supercomputer to break down and process multiple tasks simultaneously across its many cores and processors.
Energy Efficiency: Given their immense power requirements, energy-efficient cooling and operation are critical factors in designing supercomputers. In fact, many Green Datacenters focus on improving the energy efficiency of supercomputing systems, harnessing renewable energy sources like hydropower and wind.
The fire power of a supercomputer is essentially the FLOPS, which is heavily governed by the number of GPUs operating in parallel times FLOPS of the single GPU. There are limitations on the number of GPUs that can be installed per system, primarely the latency amongst them, therefore the number of GPUs per system most frequently is in the order 10-15 thousands GPUs. Hence, most of the competition lies on increasing the FLOPs per GPU and improving the network communication amongst GPUs so that more GPUs can work as one.
Applications of Supercomputers
Supercomputers have applications across a wide range of industries, driving innovation and solving some of the world’s most complex problems. Here are some of the key areas where supercomputers are used:
Scientific Research: Supercomputers are invaluable in fields like physics, chemistry, and biology, where they enable detailed simulations, such as modeling the behavior of particles in quantum mechanics or simulating complex biological systems at the molecular level.
Climate Modeling: Climate scientists rely on supercomputers to model the Earth’s climate, simulating various scenarios of weather patterns and climate change over time. These models are essential for predicting future climate trends and helping to inform environmental policy decisions.
Artificial Intelligence: Supercomputers play a crucial role in training of LLMs. The enormous processing power of these machines is necessary to handle the vast datasets required for AI model training, making supercomputers an integral part of AI development.
National Security: Many governments use supercomputers for tasks related to cryptography, cybersecurity, and defense simulations. For example, they are used to simulate nuclear tests, ensuring the safety and reliability of nuclear stockpiles without the need for actual physical testing.
Medical Research: In healthcare, supercomputers assist in drug discovery, genomic research, and even personalized medicine. By analyzing massive datasets, they can identify patterns that may lead to breakthroughs in understanding diseases and developing treatments.
Supercomputers and Sustainability
As the demand for more computational power grows, so does the need for sustainable supercomputing infrastructure. One of the biggest challenges is the immense energy consumption required to power and cool these machines. This is where companies like Terakraft, with their focus on Green Datacenter solutions, come into play.
Terakraft’s datacenter infrastructure is naturally suited to host GPU workloads and HPC systems like supercomputers, with an emphasis on sustainability. By utilizing 100% renewable energy, such as hydropower and advanced cooling systems like lake water cooling, Terakraft minimizes the environmental impact of high-performance computing. This makes it possible for organizations to harness the power of supercomputers while remaining committed to eco-friendly practices.
Already most prominent supercomputers show committments towards the environment by using either renewable energy, using lake cooling for free-cooing, or recovering the excess heat for district heating.
The Future of Supercomputing
The future of supercomputing lies in exascale computing—systems capable of performing at least one exaflop, or a quintillion (10^18) calculations per second FP64. Exascale systems will open up new frontiers in fields like AI, scientific simulations, and even climate modeling, solving problems that are currently beyond the reach of today’s most powerful systems.
Moreover, as technology advances, the focus will increasingly shift toward developing more energy-efficient and sustainable supercomputing systems - an example of such technological leap forward is presented by the Blackwell GPU and architecture, but this will be covered in a dedicated post.
The Race to Supercomputing: How Europe is Leading in Performance and Sustainability
As the global race for supercomputing dominance intensifies, regions around the world are pushing the limits of computational power and efficiency. The TOP500 and Green500 lists track this race, showcasing the fastest and most energy-efficient supercomputers globally. There are speculations that some super computers may be off grid intentionally, to not showcase the full potential of certain players, whether governmental or enterprise. Nevertheless, such two lists are widely used to track super-computers capabilities across the globe.
Europe has emerged as a key player, not only in terms of raw performance but also in sustainability, with several of the continent’s systems standing out in both rankings - and creating a competition in the continent itself - as the three main supercoputers rank respectively 5th, 6th and 7th in the TOP550 of June2024.
LUMI in Finland is leading the race at the 5th position with 380 petaflops thanks to the 10,240 AMD Radeon Instinct MI250X GPUs. With its 100% renewable energy power source (7.1 MW) and advanced liquid cooling system, it’s also a regular feature in the Green500, making it one of the most sustainable high-performance systems on the planet. Similarly, Leonardo in Italy, a powerhouse for AI and scientific simulations - known for the training of the famous Mistral AI and for the European Centre for Medium-Range Weather Forecasts (ECMWF), demonstrates Europe’s strength in blending computational excellence with liquid technology. With 13,824 NVIDIA A100 GPUs Leonardo has a greater number of GPUs to LUMI and also a greater power consumption (7.5 MW), but a significantly lower computational power (241 petaflops rather than 380), ranking 7th the in the TOP500 - showing the difference it makes installing the latest GPUs.
Finally the latest Eurpean addition is the Alps supercomputer housed at the Swiss National Supercomputing Centre (CSCS) in Lugano. Officially inaugurated in September 2024, Alps is one of the world's most powerful supercomputers, ranking 6th in the TOP500 (right in between Leonardo and LUMI).
Built using HPE's Cray EX architecture, Alps integrates 10,752 NVIDIA Grace Hopper superchips, combining the power of both CPUs and GPUs to achieve performance levels of up to 270 petaflops and consuming only 5.2MW, surpassing Leonardo despite installing less GPUs and consuming less power.
In terms of efficiency, it's interesting to note that LUMI, despite not being the latest supercomputer out of the three, ranks as the most efficient super computer in Europe.
LUMI: 379.70 Pflops / 7.107 MW = 53.42 Pflops / MW
ALPS: 270.00 Pflops / 5.194 MW = 51.98 Pflops / MW
Leonardo: 241.20 Pflops / 7.494 MW = 32.18 Pflops / MW
These supercomputers are part of Europe’s broader strategy to lead in both performance and environmentally conscious computing, ensuring that the continent remains at the forefront of the supercomputing race while also addressing critical sustainability challenges.
How Does a Super Computer Architecture look like?
Most supercomputers share similar design and architectural traits. They typically require a dense power configuration to minimize distances and reduce latency between GPUs.
This often necessitates liquid cooling solutions, with Direct Liquid Cooling (DLC) commonly employed. In terms of layout, a two-floor design is typical, with liquid cooling pipes housed on the lower floor. This level frequently also accommodates the grey room, which contains various support infrastructure.
The upper floor is reserved for the white room, where the supercomputers are located. In the case of the Leonardo system, power is supplied to the racks from above. However, much of the infrastructure's complexity is concealed, allowing for aesthetically appealing images like those featured in this blog post, where the racks—often decorated—take center stage.
Positioning the supercomputer above the water cooling system is a prudent choice for risk mitigation, but it comes with significant structural challenges. Each rack weighs 1.4 tons, concentrated in less than a square meter of space.
The video below demonstrates how the pillars supporting the white room required reinforced concrete footings to distribute the immense weight. These pillars, placed in close proximity, bear a substantial load, necessitating robust structural support.
Most of the electrical power is consumed by the GPU racks, which are often not backed up by batteries or diesel generators. In contrast, the data racks are always supported by redundant power systems. This difference arises because, unlike cloud solutions, HPC systems use checkpointing as a mitigation strategy to manage power outages.
Most of the electrical power is consumed by the GPU racks, which are often not backed up by batteries or diesel generators. In contrast, data racks are always supported by redundant power systems. This is because, unlike cloud solutions, HPC systems use checkpointing as a strategy to manage power interruptions. Typically, only 5-20% of a supercomputer's power operates under Tier3/ Tier 4 redundancy, while the rest remains untiered.
In the event of a power failure, the maximum loss would be around 30 minutes of computations before the last save point.
A common misconception is that GPU racks, even when untiered, lack redundancies. In fact, GPUs must function as a cohesive unit, and the failure of a single GPU significantly impacts cluster performance. For this reason, GPU racks always have at least dual redundant power supplies or N+2 configurations.
Where supercomputer designs tend to differ most significantly is in their heat-rejection strategies. These systems generate substantial heat and require water cooling, but the method of cooling the water afterward varies. Typically, dry coolers are used when weather permits, and electric chillers during warmer months. This creates challenges and trade-offs, such as determining pipe sizing versus water temperature, as well as managing the capital (CAPEX) and operational (OPEX) costs of chillers, which can account for up to 40% of electricity expenses.
Alps stands out as the only supercomputer that uses lake water for cooling. This resource was considered so valuable that, after decades of air-cooling tradition, massive pipes were laid across nearly 3 kilometers of the city to bring lake water to the data center—despite significant disruptions to Lugano’s traffic during construction. The results, however, were exceptional: Lugano’s data center (CSCS) now boasts a PUE of less than 1.2, despite its relatively southern location.
Another notable example is LUMI, where heat is reused as district heating for nine months of the year, providing significant benefits to the local community and cost savings for the operations team at LUMI.
Conclusion
Supercomputers represent the pinnacle of modern computing power, enabling groundbreaking advancements in science, technology, and industry. As these machines continue to evolve, they will play an even greater role in driving innovation and solving some of humanity’s biggest challenges.
Beyond that, they represent the knowledge base for AI Data centers and their design, particularly when Training is involved. We have created a dedicated guide on this topic, that will leverage on the Super Computer concepts explained herein.
At the same time, the focus on sustainable computing solutions, like Terakraft’s green datacenter approach, ensures that we can harness this power responsibly.
コメント