We saw in the latest article, the best LLM models in 2024 previous to NVIDIA Blackwell GB200, believed to propel significantly computational power. In addition, clusters are growing tremendously in size. In this article we will go through the AI factories that rolled out these incredible models, or that are currently rolling new ones out, with the latest addition of Xai new and popular AI factory. Finding the information about these AI factories has been very difficult, as Hyperscalers are very protective about the location of their data centers and capabilities. Here we will uncover some of the locations like West Des Moines for Microsoft, New Albany for Google and Meta's AI Research Center.
Colossus: The Largest GPU Warehouse on the Planet
xAI's Colossus has emerged as a titan in the realm of AI supercomputers, at least in terms of ambition. Boasting a staggering 100,000 NVIDIA H100 GPUs and with even more ambitious plans to double that number to 200,000, it's easy to see why xAI, with enthusiastic endorsements from NVIDIA, declared it "the fastest supercomputer on the planet." Constructed in a remarkably short span of 122 days within a repurposed industrial building in Memphis, Tennessee, Colossus certainly seemed to embody xAI's commitment to rapid progress in AI development. However, it's not all gold what shines. In a move that defies typical data center practices, xAI seemingly prioritized speed over foresight, essentially buying a warehouse, filling it with GPUs, and declaring victory.
But as the saying goes, "the devil is in the details," such as how to power the servers! As It turns out xAI, wanted to deploy rapidly 100,000 GPUs, which requires significant amounts of electricity. Running 100000 GPUs requires approximately 100 MW of power and with only 8 MW coming the grid. Adding the required power capacity would take at least a year for transformer deliveries.
Therefore xAI had to deploy 14 massive portable power generators, adding another 65.8 MW of capacity, just to keep a fraction of Colossus running. Even with this unconventional setup, they are still woefully short of power for their current needs: Likely 1/3 of the GPUs are currently standby with an impressive CAPEX ammortization burn rate, as these devices become obsolete as the new models get rolled out (first deliveries of next generation of B200 currently ongoing).
The idea was really original, but also finding 66MW of readily available generators to rent, was no joke - so even with Elon's creativity the 100000 H100 in Colossus are partially standby at time of writing.
Now xAI has received a power reservation of 150 MW, which before becoming effective will take time. This power reservation will come together with the deployment of additional 100000 H100. Still the Generators will need to keep running to meet the total demand north of 200 MW.
Another funny anecdote for people in the industry is that Colossus started without UPSs. Indeed, according to the video below, xAI discovered data centers require uninterruptible power supply (UPS) as voltage fluctuations created by gas generators and the grid were causing errors during training, forcing them to install a Tesla Megapack as UPS battery system.
This "build first, power later" approach, with its lack of traditional UPS and power infrastructure typically seen in data centers, highlights a stark contrast to standard industry practices. It seems Elon Musk, in his characteristic fashion, has once again thrown out the rulebook, opting for speed and improvisation over meticulous planning, much like SpaceX's iterative approach to rocket development, embracing "rapid unscheduled explosive disassembly" as a learning opportunity. History has proven him right, being today SpaceX the leading space launch company in the world.
Whether this gamble will pay off also this time it remains to be seen; what is certain is that Musk has proven that Tier 3 is not required for training facilities - as discussed extensively here.
Despite a severe deficit in the grey room and power module design, planning, and execution, the white room itself appears to be a technological marvel. It features a state-of-the-art cooling system that utilizes both DLC and rear door heat exchangers to maintain optimal operating temperatures, a high-speed network powered by NVIDIA BlueField 3 DPUs for rapid data transfer between GPUs and storage, and of course, the crown jewels: 100,000 of NVIDIA's H100 GPUs. Each rack consumes approximately 80kW, which is a respectful density compared to other hyper scalers.
This powerful infrastructure is hoped to be used to train xAI's Grok, an AI chatbot with evolving capabilities, including recently acquired vision capabilities for analyzing and understanding images. With a recent valuation of $24 billion, xAI has garnered significant funding to support its ambitious projects.
From an environmental perspective, Colossus is currently not really doing great either with 68 MW of gas generators running in continuous, polluting the local air and raising significant criticism around the project as xAI installed the generators without the required permits. Once it will be connected to the grid, Colossus will be more environmentally friendly with a good portion of consumed electricity coming from Nuclear. All in all, not up to the Tesla's and Elon's green ambitions.
The entrance of Elon Musk in the data center space, certaintly will give origin to unconventional stories, full of improvisation, but that will allow to break the rules of the industry (such as building a data center and powering with rental units).
The Mystery in the Midwest: Unmasking the secret Home of ChatGPT in West Des Moines
For a while, the location where OpenAI trained its groundbreaking language models, including the famed ChatGPT, currently leading the LLM Rankings, was shrouded in secrecy. Rumors swirled, with whispers of massive server farms hidden in undisclosed locations, consuming vast amounts of energy and water. Then, like a bolt of lightning from the Iowa sky, the truth was revealed: West Des Moines.
Yes, that's right. The heartland of America, known for its cornfields and friendly folks, is also home to a critical component of the AI revolution. Microsoft, OpenAI's close partner and Azure provider, has built several data center campuses in West Des Moines, and it's here that the magic happens. Massive supercomputers, powered by cutting-edge Nvidia GPUs, crunch through mountains of data, tirelessly training the algorithms that fuel ChatGPT's impressive conversational abilities.
This revelation sparked a wave of curiosity and intrigue. Why West Des Moines? What makes this location so special for AI development? The answer lies in a combination of factors: abundant land, a skilled workforce, robust infrastructure, and a commitment to renewable energy.
However, Microsoft's claim "of matching 100% of the energy used by its Iowa datacenters with renewable energy" sounds a bit green-washing as it will be effectively be done via REC (renewable energy certificates) or PPAs (Power Purchase Agreements) which allow the buyer of such REC or PPA to claim to be renewable also when the electricity used burning coal to produce power as long as emissions are offsetted. Contractually the holder of the green-PPA or REC is officially green, but realistically the power consumed is the one of the grid or of the nearest power plant (if directly connected). For West Des Moines approximately 66% of the consumption is renewable, based on the Iowa power mix.
The power mix is not really what caught the attention of the public in terms of sustainability, which for the industry standards are actually very good. The intense computational demands of AI training require vast amounts of water for cooling, raising concerns about environmental impact and resource consumption. Reports indicate a significant increase in water usage at Microsoft's West Des Moines facility, prompting discussions about responsible AI development and the need for sustainable practices.
West Des Moines facility remains covered in mistery as no source of information is available online. We will therefore try to assess it's capabilities and organization using the power of satellites:
From what we can see from the satellite, West Des Moines consists of the following facilities:
Reception Building on the West side.
Grid Power Substation on the south.
12 Halls: each with 5 gen sets and adiabatic air to air cooing. Here Transformers are located outside.
4 small Halls: each one having a diesel gen set per hall. Like the 12 large white halls, these small halls have a similar number of Generators compared to Transformers (shown outside). They show water to air dry coolers/evaporative towers.
2 Large Grey Halls: they are detectable from the fact there 4 eveporative coolers every generator. They have a different cooling solution, but still using the same technology air-to-air evaporative cooling.
West Des Moines shows that the facility has been built respecting all the traditional Data Center design criteria, relying on air-cooling, with the only exception of the 4 small halls.
To estimate its capabilities we will count generators. Some assumptions will be needed there on about generator sizes and redundancy levels.
Generators are enclosed in 40foot containers - meaning they are standard size with their power level around 2.5-3.5MW per container.
Gen sets count | Installed power MW (assuming 2.5MW gen sets) | Installed power MW (assuming 3.5 MW gen sets) | |
White Halls | 60 | 150 | 210 |
Small Halls | 4 | 10 | 14 |
Grey halls | 20 | 50 | 70 |
Total | 84 | 210 | 294 |
This yields to a total of 200-300 MW range of gen sets capacity. Gen sets are normally used for back up power, as data centers normally draw power from the grid - with the notorious exception of Colossus.
Not all data center capacity may be under generators and normally such capacity is actually redundant some how.
From the analysis, the 12 white halls, each with generators could be in a Tier 3 configuration (N+1). Whilst in the Grey Halls, they may be in a HPC configuration where 20% of the Capacity is Tier 4 and 80% is untiered. Playing around with assumptions on the different redudancy levels anyway yields a power capacity in the order of 200 to 300 MW.
However the computing power used to train ChatGPT 4 seemed a lot lower - more in the order of 24MW IT (considering evaporative cooling we need to consider at least a 1.3 PUE hence 32MW total load). This estimate is based on the rumors that ChatGPT 4 has been trained with a cluster of 25000 A100 GPUs.
This explains why there are rumors about Sam Altman's (Openai CEO) concerns about xAI growing compute capabilities.
Recently Sam Altman pointed out that lack of compute capacity is slowing down ChatGPT5 release. We also know that Microsoft is adding additional capacity in West Des Moines with its latest project, as well as in other locations. In both Arizona and Wiscosin Microsoft has very aggressive plans. In Arizona 5 x >200 MW campuses are being built and rented from Neo Clouds providers such as Coreweave and Crusoe das well as Wiscosin, where 4 300 MW Halls are currently under construction (total 1200MW). Satellite imagery shows the most of the halls have already been built and that the project is on track for completion.
Meta AI Research Super Cluster: Another American Midland Mistery.
Llama 3.1, the best performing open source LLM as discussed here, has be trained by Meta. If you think that also the location of the AI Factory that produced is "Open Source", you will be mistaken. On the surface we know that Llama was trained in the AI Research SuperCluster (RSC).
The RSC is not your average supercomputer. Like all the hyperscalers, Meta designed it to be the most powerful in the world to win the race on the best large language models (LLMs). Originally used to train Llama 2 with 16000 Nvidia A100 cards and later for training Llama 3 with 48000 H100 GPUs, for a total estimated power north of 48MW. Mark Zuckenberg recently stated that Llama 4 is being trained on more than 100000, making it the largest super cluster known to date.
Differently from Microsoft West des Moines, Meta's RSC uses liquid cooling, in particular DLC.
Whilst Meta has taken a unique approach by releasing its Llama models for free, as well as the server designs nothing is known on Meta's RSC. The location is secret, as well as the technical details outside the Rack and cooling solution.
Searching online about RSC, no information about the size will come up - except Mark Zuckenberg statement of more than 100000 H100 - which makes us understand that the RSC is north of 100 MW capacity. Based on the RSC announcement, it's clear that the RSC has been built from a "clean slate" and works took place around 2020.
Looking at Meta's the environmental reports, Staton Spring shows a rapid increase in electricity consumption, starting from 0 in 2020 arriving to 968565 MWh in 2023, which equates to an average draw down of 108MW in 2023 and looks like a potential candidate.
The site show cases 50 diesel gen sets, which equate to approximately 125-150 MW of installed capacity. The timing, the installed power, and the consumption ramp up profile make Staton Spring a credible candidate to host the prestigious Meta's AI RSC.
Google, the AI Data Center Colossus
Google is a step ahead to whole the competition in terms of data center design. They are the only hyperscale having had historically liquid cooled reference designs for their data center. This advantage derives from the historical need for denser racks due to recommendation systems (Google) and streaming services (Youtube) and their commitment to sustainability.
Google AI Data Centers have unique designs, starting from the design of their GPUs, called TPUs, as well as to custom servers and custom racking solutions as seen in the 360 video tour below.
This technological advantage has lead Google to develop early on power intensive data centers, in very compact designs where gen sets are located next to each other, white rooms are really small with high density compute racks, thanks to their established liquid cooling experience.
Below is reported the satellite picture of New Albany obtained from this link as google map still conceals part of the infrastructure. Here are estimated 450 MW to be onsite.
Zooming in Google earth on the top right white hall, it's possible to see the density of the generators design. Only in the smallest hall there are 12 standard 40-foot generators, equating to 30-45MW. Considering the number and size of the other halls, there are definitely more than 350MW installed in New Albany.
Moving to the question "Where was Gemini trained?" - well the answer is in more than one location.
Google's introduction of Gemini marks its most significant endeavor to reestablish itself as a dominant force in artificial intelligence. This move comes after a couple of years where OpenAI's GPT models have led the generative AI sector de-throning Google from its historical leadership with DeepMinds.
To develop the Gemini model family, Google undertook an extensive infrastructure expansion, aiming to prove that large-scale AI models can be trained without depending on Nvidia's GPUs and using their in-house built TPUs.
While Google has been reticent about the specifics of building its Gemini models, DCD has gathered information from multiple sources
Training Across Multiple Data Centers
Gemini was trained across multiple data centers—a significant advancement from previous models like PaLM-2, which were confined to a single facility.
Google uses a technology called "multi-host" to distribute the training workload.
The reason is to ensure resiliency and redundancy across sites and distribute workloads to achive larger clusters.
Google later rolled out this multi-host technology to its Cloud customers, naming it Vertex AI.
To facilitate this distributed training, Google connects TPU SuperPods—each containing 4,096 chips—using its intra-cluster and inter-cluster network.
Jupiter Networking Platform and Optical Circuit Switching (OCS)
At the core of this advanced infrastructure is Google's networking platform, Jupiter, which leverages an in-house optical switching network known as Optical Circuit Switching (OCS). The OCS is used as a replacement of traditional data center spine architectures to meet the demanding requirements of large-scale AI training.
Challenges with Traditional Networking
In conventional data center networks, data transmission relies heavily on electronic packet switching. Signals frequently alternate between electrical and optical forms, necessitating multiple conversions:
Electrical to Optical Conversion: Data generated by servers is converted from electrical signals to optical signals for transmission over fiber-optic cables.
Optical to Electrical Conversion: Upon reaching a network switch, the optical signals are converted back to electrical signals for processing and routing.
Switching and Routing: Electronic switches route the data packets to their destination, a process that can introduce latency and consume significant power.
This continual conversion between electrical and optical domains introduces latency, increases energy consumption, and adds significant costs due to the need for high-speed electronic switches and transceivers.
How OCS Transforms Data Center Networking
Google's OCS technology addresses these challenges by maintaining data in the optical domain for as long as possible. Key features of OCS include:
Optical Cross-Connects: Instead of converting optical signals back to electrical form for switching, OCS uses optical components to redirect light signals directly. This is achieved through devices known as optical cross-connects.
MEMS Technology: Micro-Electro-Mechanical Systems (MEMS) mirrors are employed to steer beams of light from input fibers to output fibers. These tiny mirrors can adjust their angles to dynamically establish optical paths between any pair of ports.
Elimination of OEO Conversions: By avoiding optical-electrical-optical conversions, OCS reduces latency and power consumption significantly.
Dynamic Network Reconfiguration
One of the standout features of OCS is its ability to dynamically reconfigure the network topology based on workload demands:
Rapid Topology Changes: For Gemini Ultra, the OCS was capable of reconfiguring 4x4x4 chip cubes into arbitrary 3D torus topologies within approximately 10 seconds. This rapid reconfiguration allows the network to adapt to the specific communication patterns required by different AI training phases.
Optimized Data Paths: By tailoring the network topology, OCS ensures that data takes the most efficient path, reducing congestion and improving overall performance.
The research paper notes: "We decided to retain a small number of cubes per SuperPod to allow for hot standbys and rolling maintenance." This approach enhances the network's resilience and availability.
Benefits of OCS in AI Training
High Bandwidth and Low Latency: OCS provides direct optical paths between computing resources, which is crucial for the high-bandwidth, low-latency communication needed in distributed AI training.
Scalability: The optical network supports scaling up to connect thousands of TPU chips across multiple data centers, overcoming the limitations of traditional electronic switching networks.
Energy Efficiency: Optical switches consume less power than electronic switches, leading to reduced energy costs and lower heat generation, which is beneficial for data center cooling requirements.
Cost Reduction: By minimizing the reliance on expensive electronic switches and transceivers, OCS lowers the capital expenditure associated with building and upgrading data center networks.
Integration with Jupiter Networking Platform
The OCS is seamlessly integrated into Google's Jupiter networking platform, which orchestrates the overall network operations:
Hybrid Switching Fabric: Jupiter combines both packet switching (for general-purpose workloads) and circuit switching (via OCS for high-throughput workloads like AI training).
Software Control: Advanced software-defined networking (SDN) techniques are used to manage and optimize the network, dynamically allocating resources based on workload requirements.
Challenges and Solutions
Implementing OCS at scale presents its own set of challenges:
Complexity in Control Mechanisms: Managing a dynamic optical network requires sophisticated control systems to handle the rapid reconfiguration of optical paths without disrupting ongoing data transfers.
Physical Limitations: Optical components must be precisely engineered to maintain signal integrity over various distances and environmental conditions.
Fault Tolerance: Google's approach includes redundant pathways and hot standbys to mitigate the impact of hardware failures or maintenance activities.
OCS and Multi-Host Training
The combination of OCS and multi-host technology enables Google to distribute AI training workloads efficiently:
Distributed Computing Resources: Training tasks can be spread across different clusters and data centers without being bottlenecked by network limitations.
Resilience to Site-Specific Issues: If one site experiences issues like power outages, the network can reconfigure to maintain training operations elsewhere.
Evidence of Success: other market leaders chose Google for Training their Models
Google supremacy in AI Data Centers is proven by the fact that several other AI Companies chose Google to train their models, like Midjourney and Antrophic (pre-investment from AWS). And all this was achieved without NVIDIA chips, that are known to be the best perfoming platform for AI training.
Conclusion: Forging the Future: Balancing Innovation and Responsibility in AI Data Center Revolution
The race to develop the most powerful AI models has led to an unprecedented expansion in computational capabilities, pushing the boundaries of data center design, scale, and innovation. From Elon Musk's xAI and its ambitious yet unconventional Colossus supercomputer to Microsoft's secretive facilities in West Des Moines, Meta's enigmatic AI Research SuperCluster, and Google's cutting-edge multi-data center approach, the landscape of AI factories is as diverse as it is colossal.
Elon Musk's entry into the data center space with xAI demonstrates a willingness to challenge industry norms, prioritizing rapid deployment over meticulous planning. While this "build first, power later" approach has led to operational challenges and environmental concerns—such as reliance on massive gas generators and inadequate power infrastructure—it also highlights a disruptive mindset that could reshape industry practices. Musk's history with SpaceX suggests that such unconventional strategies can eventually lead to significant advancements.
Meanwhile, traditional hyperscalers like Microsoft, Meta, and Google are investing heavily in expanding their compute capacities. Microsoft's expansive data centers in West Des Moines and other locations reflect a strategic push to support models like ChatGPT5, despite environmental critiques over resource consumption and greenwashing claims. Meta's RSC, likely housed in facilities like Stanton Springs, underscores the company's commitment to open-source models like Llama while maintaining secrecy over its infrastructure. Google's development of Gemini across multiple data centers showcases its technological prowess, leveraging innovations like Optical Circuit Switching to overcome the limitations of traditional networking.
However, this relentless pursuit of computational power raises important questions about sustainability and environmental impact. The enormous energy consumption and resource demands of these AI factories necessitate a re-examination of renewable energy commitments and responsible practices. Claims of 100% renewable energy usage must be scrutinized beyond contractual agreements to assess their real-world environmental footprint. The use of vast amounts of water for cooling and the reliance on diesel generators in some facilities highlight the need for more sustainable solutions.
The secrecy surrounding these facilities also underscores the intensely competitive nature of AI development, where proprietary technologies and infrastructure are closely guarded. While this competition drives innovation, it can hinder collaborative efforts to address shared challenges, such as environmental sustainability and ethical considerations in AI deployment.
As the AI arms race intensifies, compute capacity has become a critical differentiator. The ability to train larger, more complex models hinges not just on advanced algorithms but also on the physical infrastructure that supports them. The convergence of AI development and data center innovation will continue to shape the future of technology, influencing everything from energy policies to global competitiveness in AI.
In this dynamic landscape, balancing rapid innovation with operational excellence and environmental responsibility is crucial. Whether through unconventional approaches that break industry norms or through meticulous planning that prioritizes sustainability, leaders in this space must navigate complex challenges. The ultimate success will not only be measured by technological breakthroughs but also by the ability to address broader implications for society and the planet. The future of AI depends on forging a path that harmonizes ambitious goals with responsible stewardship of resources.
Comments