The Good Tech Companies - The MinIO DataPod: A Reference Architecture for Exascale Computing
Episode Date: August 20, 2024This story was originally published on HackerNoon at: https://hackernoon.com/the-minio-datapod-a-reference-architecture-for-exascale-computing. MinIO has created a compr...ehensive blueprint for data infrastructure to support exascale AI and other large scale data lake workloads. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #datapod, #exascale, #data, #data-infrastructure, #ai-ml, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. MinIO has created a comprehensive blueprint for data infrastructure to support exascale AI and other large scale data lake workloads. The MinIO DataPod offers an end-to-end architecture that enables infrastructure administrators to deploy cost-efficient solutions for a variety of AI and ML workloadS.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
The Minio DataPod, a reference architecture for exascale computing, by Minio.
The modern enterprise defines itself by its data. This requires a data infrastructure for AI,
ML as well as a data infrastructure that is the foundation for a modern data lake
capable of supporting business intelligence, data analytics, and data science.
This is true if they are behind, getting started or using AI for advanced insights.
For the foreseeable future, this will be the way that enterprises are perceived.
There are multiple dimensions or stages to the larger problem of how AI goes to market in the
enterprise. Those include data ingestion, transformation, training, inferencing,
production, and archiving,
with data shared across each stage. As these workloads scale the complexity of the underlying
AI data infrastructure increases. This creates the need for high-performance infrastructure while
minimizing total cost of ownership, TCO. Minio has created a comprehensive blueprint for data
infrastructure to support exascale AI and
other large-scale data lake workloads. It is called the MinIO Datapod. The unit of measurement
it uses is 100 Pebybytes. Why? Because the reality is that this is common today in the enterprise.
Here are some quick examples a North American automobile manufacturer with nearly an exabyte
of car video. A German automobile manufacturer with more than
50 petabytes of car telemetry. A biotech firm with more than 50 petabytes of biological,
chemical, and patient-centric data. A cybersecurity company with more than 500
petabytes of log files. A media streaming company with more than 200 petabytes of video.
A defense contractor with more than 80 petabytes of video. A defense contractor with more than 80 petabytes
of geospatial, log and telemetry data from aircraft. Even if they are not at 100 petabytes
today, they will be within a few quarters. Thieverge firm is growing at 42% a year,
data-centric firms are growing at twice that rate, if not more.
The Minio datapod reference architecture can be stacked in different ways to achieve almost any scale. Indeed we have customers that have built off of this blueprint,
all the way past an exabyte and with multiple hardware vendors. The Minio Datapod offers an
end-to-end architecture that enables infrastructure administrators to deploy cost-efficient solutions
for a variety of AI and ML workloads. Here's the rationale for our architecture. AI requires
disaggregated storage and compute. AI workloads, especially generative AI, inherently require GPUs
for compute. They are spectacular devices with incredible throughput, memory bandwidth and
parallel processing capabilities. Keeping up with GPUs that are getting faster and faster requires high-speed storage.
This is especially true when training data cannot fit into memory and training loops have to make
more calls to storage. Furthermore, enterprises require more than performance, they also need
security, replication, and resiliency. The enterprise storage requirement demands that
the architecture fully disaggregate storage from compute. This allows for storage to scale independently of the compute and given that
storage growth is generally one or more orders of magnitude more than compute growth,
this approach ensures the best economics through superior capacity utilization.
AI workloads demand a different class of networking. Networking infrastructure has
standardized on 100 gigabits per second, GBPs, bandwidth links for AI workload deployments. Modern-day NVMe drives provide 7 GBPs throughput
on average making the network bandwidth between the storage servers and the GPU compute servers
the bottleneck for AI pipeline execution performance. Solving this problem with complex
networking solutions like Infiniand ib hasreal limitations
we recommend that enterprises leverage existing industry standard ethernet based solutions e g
http over tcp that work out of the box to deliver data at high throughput for gpus for the following
reasons much larger and open ecosystem. Reduced network infrastructure cost,
high interconnect speeds, 800 GBE and beyond, with RDMA over Ethernet support, IE, ROCEV2.
Reuse existing expertise and tools in deploying, managing, and observing Ethernet.
Innovation around GPUs to storage server communication is happening on Ethernet-based
solutions. The requirements of AI demand object storage. It is not a coincidence that AI data
infrastructure in public clouds are all built on top of object stores. Nor is it a coincidence that
every major foundational model was trained on an object store. This is a function of the fact that
POSIX is too chatty to work at the data scale required
by AI, despite what the chorus of legacy filers will claim. The same architecture that delivers
AI in the public cloud should be applied to the private cloud and obviously the hybrid cloud.
Object stores excel at handling various data formats and large volumes of unstructured data
and can effortlessly scale to accommodate growing data without compromising performance. Their flat namespace and metadata capabilities enable efficient data management
and processing that is crucial for AI tasks requiring fast access to large datasets.
As high-speed GPUs evolve and network bandwidth standardizes at 200, 400, 800 GBP sand beyond,
modern object stores will be the only solution that meets the performance
SLAs and scale of AI workloads. Software defined everything we know that GPS are the star of the
show and that they are hardware. But Aventidia will tell you the secret sauce is CUDA. Move
outside the chip, however, and the infrastructure world is increasingly software defined.
Nowhere is this more true than storage. Software-defined
storage solutions are essential for scalability, flexibility, and cloud integration, surpassing
traditional appliance-based models for the following reasons cloud compatibility. Software-defined
storage aligns with cloud operations, unlike appliances that cannot run across multiple clouds.
Backslash dot containerization. Appliances cannot
be containerized, losing cloud-native advantages and preventing Kubernetes orchestration.
Backslash.hardware flexibility. Software-defined storage supports a wide range of hardware,
from edge to core, accommodating diverse IT environments. Backslash.adaptive performance.
Software-defined storage offers
unmatched flexibility, efficiently managing different capacities and performance needs
across various chipsets. Backslash dot. At exabyte scale, simplicity and a cloud-based
operating model are crucial. Object storage, as a software-defined solution, should work
seamlessly on commodity off-the-shelf, COTs, hardware and any compute platform, be it bare metal, virtual machines, or containers.
Custom-built hardware appliances for object storage often compensate for poorly designed
software with costly hardware and complex solutions, resulting in a high total cost
of ownership, TCO. Minio Data Pod Hardware Specification for AI.
Enterprise customers using Minio for AI initiatives build exabyte-scale data
infrastructure as repeatable units of 100 Pebybytes. This helps infrastructure administrators
ease the process of deployment, maintenance and scaling as the AI data grows exponentially over
a period of time. Below is the bill of materials, BOM, for building a 100
pebibytes scale data infrastructure. Cluster specification, component quantity, total number
of racks, 30 total number of storage servers, 330 total number of storage servers per rack, 11 total
number of tor switches, 60 total number of spine switches, 10 erasure code, stripe size switches 10 erasure code stripe size 10 erasure code parity 4.
Single rack specification.
Component description quantity rack enclosure 42U.
45U slot rack 1 storage server 2U form factor 11 top of the rack switches layer 2 switch 2 management switch combined layer 2 and layer 3 1 network cables AOC cables 30 to 40 power
dual power supply with RPDU 17 kilowatts to 20 kilowatts
storage server specification component specification server 2U single socket CPU 64 core
128 PCIe 4.0 lanes memory 256 gigabytes network dual port 200 GBE NIC drive base 24 hot swap 2 5 inches u 2 nv me drives 30 tb 24 nv me
power 1600 w redundant power supplies total raw capacity 720 terabytes storage server reference
dell power edge r76 15 rack server hpe hpe proliant dl 345 gen 11 super micro a plus server 2114 swn24rt network switch
specification component specification top of the rack tor switch 32 asterisk 100 GBE QSFP 28 Ports Cable 100 GQSFP 28 AOC Power 500 W per switch
price Minio has validated this architecture with multiple customers and would expect others to see
the following average price per terabyte per month. This is an average street price and the
actual price may vary depending on the configuration and the hardware vendor relationship.
Scale Storage Hardware Price Aster, per TB per month, asterisk.
Minio software price asterisk, per TB per month, asterisk.
100 Pebybytes $1. $50 3.
54 vendor-specific turnkey hardware appliances for AI will result in high TCO and is not scalable
from an unit economics standpoint
for large data AI initiatives at exabyte scale. Conclusion, data infrastructure setup at exabyte
scale while meeting the TCO objectives for all AI ML workloads can be complex and hard to get right.
Minio's Datapod infrastructure blueprint makes it simple and straightforward for
infrastructure administrators to set up the required commodity off-the-shelf hardware with highly scalable, performant cost-effective S3
compatible Minio Enterprise Object Store resulting in improved overall time to market and faster time
to value from AI initiatives across organizations within the enterprise landscape. Thank you for
listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com
to read, write, learn and publish.