The Good Tech Companies - The MinIO DataPod: A Reference Architecture for Exascale Computing

Episode Date: August 20, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/the-minio-datapod-a-reference-architecture-for-exascale-computing. MinIO has created a compr...ehensive blueprint for data infrastructure to support exascale AI and other large scale data lake workloads. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #datapod, #exascale, #data, #data-infrastructure, #ai-ml, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. MinIO has created a comprehensive blueprint for data infrastructure to support exascale AI and other large scale data lake workloads. The MinIO DataPod offers an end-to-end architecture that enables infrastructure administrators to deploy cost-efficient solutions for a variety of AI and ML workloadS.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. The Minio DataPod, a reference architecture for exascale computing, by Minio. The modern enterprise defines itself by its data. This requires a data infrastructure for AI, ML as well as a data infrastructure that is the foundation for a modern data lake capable of supporting business intelligence, data analytics, and data science. This is true if they are behind, getting started or using AI for advanced insights. For the foreseeable future, this will be the way that enterprises are perceived. There are multiple dimensions or stages to the larger problem of how AI goes to market in the
Starting point is 00:00:39 enterprise. Those include data ingestion, transformation, training, inferencing, production, and archiving, with data shared across each stage. As these workloads scale the complexity of the underlying AI data infrastructure increases. This creates the need for high-performance infrastructure while minimizing total cost of ownership, TCO. Minio has created a comprehensive blueprint for data infrastructure to support exascale AI and other large-scale data lake workloads. It is called the MinIO Datapod. The unit of measurement it uses is 100 Pebybytes. Why? Because the reality is that this is common today in the enterprise.
Starting point is 00:01:17 Here are some quick examples a North American automobile manufacturer with nearly an exabyte of car video. A German automobile manufacturer with more than 50 petabytes of car telemetry. A biotech firm with more than 50 petabytes of biological, chemical, and patient-centric data. A cybersecurity company with more than 500 petabytes of log files. A media streaming company with more than 200 petabytes of video. A defense contractor with more than 80 petabytes of video. A defense contractor with more than 80 petabytes of geospatial, log and telemetry data from aircraft. Even if they are not at 100 petabytes today, they will be within a few quarters. Thieverge firm is growing at 42% a year,
Starting point is 00:01:57 data-centric firms are growing at twice that rate, if not more. The Minio datapod reference architecture can be stacked in different ways to achieve almost any scale. Indeed we have customers that have built off of this blueprint, all the way past an exabyte and with multiple hardware vendors. The Minio Datapod offers an end-to-end architecture that enables infrastructure administrators to deploy cost-efficient solutions for a variety of AI and ML workloads. Here's the rationale for our architecture. AI requires disaggregated storage and compute. AI workloads, especially generative AI, inherently require GPUs for compute. They are spectacular devices with incredible throughput, memory bandwidth and parallel processing capabilities. Keeping up with GPUs that are getting faster and faster requires high-speed storage.
Starting point is 00:02:46 This is especially true when training data cannot fit into memory and training loops have to make more calls to storage. Furthermore, enterprises require more than performance, they also need security, replication, and resiliency. The enterprise storage requirement demands that the architecture fully disaggregate storage from compute. This allows for storage to scale independently of the compute and given that storage growth is generally one or more orders of magnitude more than compute growth, this approach ensures the best economics through superior capacity utilization. AI workloads demand a different class of networking. Networking infrastructure has standardized on 100 gigabits per second, GBPs, bandwidth links for AI workload deployments. Modern-day NVMe drives provide 7 GBPs throughput
Starting point is 00:03:32 on average making the network bandwidth between the storage servers and the GPU compute servers the bottleneck for AI pipeline execution performance. Solving this problem with complex networking solutions like Infiniand ib hasreal limitations we recommend that enterprises leverage existing industry standard ethernet based solutions e g http over tcp that work out of the box to deliver data at high throughput for gpus for the following reasons much larger and open ecosystem. Reduced network infrastructure cost, high interconnect speeds, 800 GBE and beyond, with RDMA over Ethernet support, IE, ROCEV2. Reuse existing expertise and tools in deploying, managing, and observing Ethernet.
Starting point is 00:04:20 Innovation around GPUs to storage server communication is happening on Ethernet-based solutions. The requirements of AI demand object storage. It is not a coincidence that AI data infrastructure in public clouds are all built on top of object stores. Nor is it a coincidence that every major foundational model was trained on an object store. This is a function of the fact that POSIX is too chatty to work at the data scale required by AI, despite what the chorus of legacy filers will claim. The same architecture that delivers AI in the public cloud should be applied to the private cloud and obviously the hybrid cloud. Object stores excel at handling various data formats and large volumes of unstructured data
Starting point is 00:05:01 and can effortlessly scale to accommodate growing data without compromising performance. Their flat namespace and metadata capabilities enable efficient data management and processing that is crucial for AI tasks requiring fast access to large datasets. As high-speed GPUs evolve and network bandwidth standardizes at 200, 400, 800 GBP sand beyond, modern object stores will be the only solution that meets the performance SLAs and scale of AI workloads. Software defined everything we know that GPS are the star of the show and that they are hardware. But Aventidia will tell you the secret sauce is CUDA. Move outside the chip, however, and the infrastructure world is increasingly software defined. Nowhere is this more true than storage. Software-defined
Starting point is 00:05:45 storage solutions are essential for scalability, flexibility, and cloud integration, surpassing traditional appliance-based models for the following reasons cloud compatibility. Software-defined storage aligns with cloud operations, unlike appliances that cannot run across multiple clouds. Backslash dot containerization. Appliances cannot be containerized, losing cloud-native advantages and preventing Kubernetes orchestration. Backslash.hardware flexibility. Software-defined storage supports a wide range of hardware, from edge to core, accommodating diverse IT environments. Backslash.adaptive performance. Software-defined storage offers
Starting point is 00:06:26 unmatched flexibility, efficiently managing different capacities and performance needs across various chipsets. Backslash dot. At exabyte scale, simplicity and a cloud-based operating model are crucial. Object storage, as a software-defined solution, should work seamlessly on commodity off-the-shelf, COTs, hardware and any compute platform, be it bare metal, virtual machines, or containers. Custom-built hardware appliances for object storage often compensate for poorly designed software with costly hardware and complex solutions, resulting in a high total cost of ownership, TCO. Minio Data Pod Hardware Specification for AI. Enterprise customers using Minio for AI initiatives build exabyte-scale data
Starting point is 00:07:10 infrastructure as repeatable units of 100 Pebybytes. This helps infrastructure administrators ease the process of deployment, maintenance and scaling as the AI data grows exponentially over a period of time. Below is the bill of materials, BOM, for building a 100 pebibytes scale data infrastructure. Cluster specification, component quantity, total number of racks, 30 total number of storage servers, 330 total number of storage servers per rack, 11 total number of tor switches, 60 total number of spine switches, 10 erasure code, stripe size switches 10 erasure code stripe size 10 erasure code parity 4. Single rack specification. Component description quantity rack enclosure 42U.
Starting point is 00:07:56 45U slot rack 1 storage server 2U form factor 11 top of the rack switches layer 2 switch 2 management switch combined layer 2 and layer 3 1 network cables AOC cables 30 to 40 power dual power supply with RPDU 17 kilowatts to 20 kilowatts storage server specification component specification server 2U single socket CPU 64 core 128 PCIe 4.0 lanes memory 256 gigabytes network dual port 200 GBE NIC drive base 24 hot swap 2 5 inches u 2 nv me drives 30 tb 24 nv me power 1600 w redundant power supplies total raw capacity 720 terabytes storage server reference dell power edge r76 15 rack server hpe hpe proliant dl 345 gen 11 super micro a plus server 2114 swn24rt network switch specification component specification top of the rack tor switch 32 asterisk 100 GBE QSFP 28 Ports Cable 100 GQSFP 28 AOC Power 500 W per switch price Minio has validated this architecture with multiple customers and would expect others to see
Starting point is 00:09:14 the following average price per terabyte per month. This is an average street price and the actual price may vary depending on the configuration and the hardware vendor relationship. Scale Storage Hardware Price Aster, per TB per month, asterisk. Minio software price asterisk, per TB per month, asterisk. 100 Pebybytes $1. $50 3. 54 vendor-specific turnkey hardware appliances for AI will result in high TCO and is not scalable from an unit economics standpoint for large data AI initiatives at exabyte scale. Conclusion, data infrastructure setup at exabyte
Starting point is 00:09:51 scale while meeting the TCO objectives for all AI ML workloads can be complex and hard to get right. Minio's Datapod infrastructure blueprint makes it simple and straightforward for infrastructure administrators to set up the required commodity off-the-shelf hardware with highly scalable, performant cost-effective S3 compatible Minio Enterprise Object Store resulting in improved overall time to market and faster time to value from AI initiatives across organizations within the enterprise landscape. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.