
Auteur: ServerDirect
June 1, 2026
Public attention around AI is almost entirely focused on models, chatbots, and applications. What is rarely discussed is the physical reality that makes all of this software possible. AI models are trained on GPU clusters that require thousands of times more computational power than a standard server. Large-scale inference demands systems capable of handling hundreds of thousands of requests per second with latency below ten milliseconds. AI data pipelines process hundreds of terabytes of data every day. Without the right infrastructure, AI does not exist—it is merely code with nowhere to run.
For organizations making serious investments in AI, infrastructure is therefore not an optional consideration. It is the first question that should be asked before a single euro is invested in software, licenses, or data engineering talent. Choosing the wrong hardware, architecture, or vendor relationship can have consequences that persist for years.
This is not an exaggeration. A GPU cluster purchased too early or too late, a system lacking sufficient memory bandwidth for the intended model size, or a network architecture that limits inter-node communication are mistakes that often become visible only once production is already running. By then, redesigning the environment can be extremely costly.
Most AI initiatives begin as experiments: a team of data scientists with access to a cloud platform, a GPU instance, and a dataset. At this stage, infrastructure requirements are relatively manageable. Workloads are intermittent, costs are variable, and the consequences of suboptimal performance are limited.
Once a model proves valuable and the organization decides to deploy it in production, the requirements change fundamentally.
The focus is no longer on achieving maximum performance for a single training session. Instead, availability, predictability, security, and long-term cost management become critical. Cloud costs can rise exponentially when workloads become permanent. Latency issues emerge when inference is performed on shared cloud infrastructure. Compliance concerns appear as soon as sensitive data is involved. Governance challenges arise around who has access to the systems running the models.
For many organizations, carefully designed and scalable on-premises infrastructure becomes the most rational choice. For European organizations operating under GDPR, it is often the most compliant choice as well.
One of the most common mistakes in AI infrastructure design is treating training and inference as variations of the same problem. They are fundamentally different workloads with very different hardware requirements. An infrastructure optimized for one will inevitably be suboptimal for the other.
Training large models is a compute-intensive, communication-heavy workload. It requires GPUs with extremely high memory bandwidth, typically HBM memory delivering more than two terabytes per second of bandwidth per GPU, as well as high-speed interconnects between GPUs and nodes.
Training modern large language models containing hundreds of billions of parameters can take months on thousands of GPUs. Infrastructure efficiency directly impacts both development costs and time-to-market.
Inference presents a different challenge. The focus shifts toward low latency and high throughput. In many scenarios, smaller and more energy-efficient GPUs—or specialized inference accelerators—are more effective than training-focused GPUs.
Selecting the right inference hardware depends on model size, numerical precision, and throughput requirements.
In many AI infrastructure designs, storage is treated as a component that simply needs to be "good enough." In reality, storage is often the first and most significant bottleneck affecting actual training performance.
A GPU cluster waiting for data is not training—it is waiting. If storage I/O bandwidth cannot continuously feed data to the GPUs, the most expensive component in the infrastructure remains underutilized.
For typical AI training workloads, storage must be capable of delivering data at a rate that keeps pace with GPU memory bandwidth. In modern GPU clusters with aggregate memory bandwidth measured in terabytes per second, this requires parallel storage built on NVMe drives, high-speed networking between storage and compute nodes, and a parallel file system capable of serving hundreds of nodes simultaneously.
Lustre and BeeGFS are among the most widely deployed file systems in HPC and AI environments.
The hardware an organization selects for its AI infrastructure will have implications far beyond the first training run. GPU architecture, memory technology, cooling strategy, power delivery, and rack density collectively define the operational reality for the next three to five years.
Liquid cooling provides a clear example. Modern high-end GPUs typically consume between 400 and 700 watts per card. In an eight-GPU server, that translates into more than five kilowatts of heat generated by a single system.
Organizations investing in air-cooled systems today without considering their data center's future cooling capacity may face expensive retrofits within a few years.
For European organizations, infrastructure decisions involve an additional dimension that is often less urgent for their American counterparts: data sovereignty.
Organizations that train AI models on cloud platforms located outside the European Union place their training data—and therefore valuable model knowledge—on infrastructure operating outside European jurisdiction.
For organizations in healthcare, finance, government, and other regulated sectors, this is frequently unacceptable.
On-premises infrastructure, assembled and managed within the Netherlands or elsewhere in the EU, provides the level of control necessary to comply with European regulations and maintain internal governance over data, models, and systems.
It depends on model size and numerical precision. A 7-billion-parameter model requires approximately 14 GB of GPU memory when using FP16 precision, excluding activation memory. Larger models (70B+) do not fit on a single GPU and require multi-GPU configurations connected through high-speed interconnects.
A useful rule of thumb is:
Model parameters × 2 bytes = minimum GPU memory requirement
When workloads are predictable and continuous, data is sensitive, latency requirements are strict, or compliance requirements demand greater control.
For research institutions and enterprise environments with stable AI workloads, on-premises infrastructure is often more cost-effective and easier to manage over a one- to three-year period than equivalent cloud capacity.
Calculate the size of the dataset, add storage for checkpoints—typically two to five times the model size per checkpoint—account for multiple model versions, and include at least a 30% growth margin.
For parallel storage environments, bandwidth is often more important than capacity.
Supermicro, Dell Technologies, HPE, and Gigabyte are among the dominant platforms for GPU servers.
For training workloads, NVIDIA H100, H200, and Blackwell GPUs currently represent the industry benchmark. AMD Instinct is a strong alternative, although it operates within a smaller software ecosystem.
ServerDirect's engineers support organizations throughout the entire journey—from workload analysis and infrastructure design to deployment, optimization, and 24/7 onsite support.
Schedule a consultation:


Hebt u vragen of hulp nodig? Wij helpen u graag.