The Good Tech Companies - Architecting a Modern Data Lake in a Post-Hadoop World

Episode Date: September 13, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/architecting-a-modern-data-lake-in-a-post-hadoop-world. This paper talks to the rise and fal...l of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #modern-datalake, #data-lake, #data-science, #hadoop, #database, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. This paper talks to the rise and fall of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Architecting a Modern Data Lake in a Post-Hadoop World, by Minio. The Modern Data Lake is one half data warehouse and one half data lake and uses object storage for everything. The use of object storage to build a data warehouse is made possible by open table formats, OTFs, like Apache Iceberg, Apache Huddy, and Delta Lake, which are specifications that, once implemented, make it seamless for object storage to be used as the underlying storage solution for a data warehouse. These specifications also provide features that may not exist in a conventional data warehouse, for example, snapshots, also known as time travel,
Starting point is 00:00:42 schema evolution, partitions, partition evolution, and zero-copy branching. As organizations build modern data lakes, here are some of the key factors we think they should be considering. 1. Disaggregation of compute and storage. 2. Migration from monolithic frameworks to best-of-breed frameworks. 3. Data center consolidation. Re replace departmental solutions with a single corporate solution. 4. Seamless performance across small and large files, objects. 5. Software-defined, cloud-native solutions that scale horizontally. This paper talks to the rise and fall of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world.
Starting point is 00:01:24 Adoption of HADOOP with the expansion of internet applications, the first major data storage and aggregation challenges for advanced tech companies started 15 years ago. Traditional RDBMS, Relational Database Management System, could not be scaled to approach large amounts of data. Then came Hadoop, a highly scalable model. In the Hadoop model, a large amount of data. Then came Hadoop, a highly scalable model. In the Hadoop model, a large amount of data is divided into multiple inexpensive machines in a cluster which is then processed in parallel. The number of these machines or nodes can be increased or decreased as per the enterprise's requirements. Hadoop was open-source and used cost-effective commodity
Starting point is 00:02:01 hardware, which provided a cost-efficient model, unlike traditional relational databases, which require expensive hardware and high-end processors to deal with big data. Because it was so expensive to scale in the RDBMS model, enterprises started to remove the raw data. This led to suboptimal outcomes across a number of vectors. In this regard, Hadoop provided a significant advantage over the RDBMS approach. It was more scalable from a cost perspective, without sacrificing performance. The end of Hadoop The advent of newer technologies like change data capture, CDC, and streaming data, primarily generated from social media companies like Twitter and Facebook, altered how data is ingested and stored. This triggered challenges in processing
Starting point is 00:02:45 and consuming these even larger volumes of data. A key challenge was with batch processing. Batch processes run in the background and do not interact with the user. Hadoop was efficient with batch processing when it came to very large files but suffered with smaller files, both from an efficiency perspective as well as a latency perspective, effectively rendering IT obsolete as enterprises sought out processing and consumption frameworks that could ingest varied datasets large and small in batch, CDC, and real-time. Separating compute and storage simply makes sense today. Storage needs to outpace compute by as much as 10 to 1. This is highly inefficient in the Hadoop world, where you need one compute node for every storage node. Separating them means they can be tuned individually.
Starting point is 00:03:30 The compute nodes are stateless and can be optimized with more CPU cores and memory. The storage node is already stateful and can be IO optimized with a greater number of denser drives and higher bandwidth. By disaggregating, enterprises can achieve superior economics, better manageability, improved scalability, and enhanced total cost of ownership. HDFS cannot make this transition. When you leave data locality, Hadoop HDFS's strength becomes its weakness. Hadoop was designed for MapReduce computing, where data and compute had to be co-located. As a result, Hadoop needs its own job scheduler, resource manager, storage, and compute. This is fundamentally incompatible
Starting point is 00:04:11 with container-based architectures, where everything is elastic, lightweight, and multi-tenant. In contrast, Minio was born cloud-native and is designed for containers and orchestration via Kubernetes, making it the ideal technology to transition Tohen retiring legacy HDFS instances. This has given rise to the modern data lake. It takes advantage of using the commodity hardware approach inherited from Hadoop but disaggregates storage and compute, thereby changing how data is processed, analyzed, and consumed. Building a modern data lake with Minio MinIO is a high-performance object storage system that was built from scratch to be scalable and cloud-native. The team that built Minio also
Starting point is 00:04:51 built one of the most successful file systems, ClusterFS, before evolving their thinking on storage. Their deep understanding of file systems and which processes were expensive or inefficient informed the architecture of Minio, delivering performance and simplicity in the process. Minio uses erasure coding and provides a better set of algorithms to manage storage efficiency and provide resiliency. Typically, it's one, five times copy, unlike three times in Hadoop clusters. This alone already provides storage efficiency and reduces costs compared to Hadoop. From its inception, Minio was designed for the cloud operating model. As a result, it runs on every cloud, public, private, on-prem, bare metal, and edge. This makes it ideal for multi-cloud and hybrid cloud
Starting point is 00:05:37 deployments. With a hybrid configuration, Minio enables the migration of data analytics and data science workloads in accordance with approaches like the Strangler fig pattern popularized by Martin Fowler. Below are several other reasons why Minio is the basic building block for a modern data lake capable of supporting your IA data infrastructure as well as other analytical workloads such as business intelligence, data analytics, and data science. Modern data ready, Hadoop was purpose-built for data where unstructured data means large, gib-tody-bee-sized log files. When used as a general-purpose storage platform where true unstructured data is in play, the prevalence of small objects, KB to MB, greatly impairs Hadoop HDFS, as the name nodes were never designed to scale in this fashion. Minio excels at any file, object size, 8 Kibibytes to 5 Tebibytes.
Starting point is 00:06:30 Open source, the enterprises that adopted Hadoop did so out of a preference for open source technologies. The ability to inspect, the freedom from lock-in, and the comfort that comes from tens of thousands of users, has real value. Minio is also 100% open source, ensuring that organizations can stay true to their goals while upgrading their experience. Simple. Simplicity is hard. It takes work, discipline, and above all, commitment. Minio's simplicity is legendary and is the result of a philosophical commitment to making
Starting point is 00:07:01 our software easy to deploy, use, upgrade, and scale. Even Hadoop's fans will tell you it is complex. To do more with less, you need to migrate to MinIO. Performant, Hadoop rose to prominence because of its ability to deliver big data performance. They were, for the better part of a decade, the benchmark for enterprise-grade analytics. Not anymore, MinIO has proven in multiple benchmarks that it is materially faster than Hadoop. This means better performance for your modern data lake. Lightweight, Minio's server binary is all of less than 100 megabytes. Despite its size, it is powerful in auto-run the data center, yet still small enough to live comfortably at the edge.
Starting point is 00:07:41 There is no such alternative in the Hadoop world. What it means to enterprises that your S3 applications can access data anywhere, anytime, and with the same API. By deploying Minio to an edge location, you can capture and filter data at the edge and use Minio's replication capabilities to ship it to the your modern data lake for aggregation and further analytics. Resilient, Minio protects data with per-object, inline erasure coding, which is far more efficient than HDFS alternatives which came after replication and never gained adoption. In addition, Minio's bitrot detection ensures that it will never read corrupted data, capturing and healing corrupted objects on the fly. Minio also supports cross-region,
Starting point is 00:08:23 active-active replication. Finally, Minio supports a complete object locking framework offering both legal hold and retention, with governance and compliance modes. Software defined, Hadoop HDFS successor isn't a hardware appliance. It is software running on commodity hardware. That is what Minio is, software. Like Hadoop HDFS, Minio IS designed to take full advantage of commodity servers. With the ability to leverage NVMe drives and 100 GBE networking, Minio can shrink the data center, improving operational efficiency and manageability.
Starting point is 00:08:58 Secure, Minio supports multiple, sophisticated server-side encryption schemes to protect data, wherever it may be, in flight or at rest. Minio's approach assures confidentiality, integrity, and authenticity with negligible performance overhead. Server-side and client-side encryption are supported using AES-256G-CM, ChaCha-20Poly-1305, and AESCBC, ensuring application compatibility. Furthermore, Minio supports industry-leading key management systems, KMS. Migrating from Hadoop to Minio The Minio team has expertise in migrating from HDFS to Minio.
Starting point is 00:09:38 Customers that purchase an enterprise license can get assistance from our engineers. To learn more about using Minio to replace HDFS check out this collection of resources. Conclusion Every enterprise is a data enterprise at this point. The storage of that data and the subsequent analysis need to be seamless, scalable, secure, and performant. The analytical tools spawned by the Hadoop ecosystem, like Spark, are more effective and efficient when paired with object storage-based data lakes. Technologies like Flink improve the overall performance as it provides single runtime for the streaming as well as batch processing that didn't work well in the HDFS model. Frameworks like Apache Arrow are redefining how data is stored and processed, and Iceberg and
Starting point is 00:10:20 Hudi are redefining how table formats allow for the efficient querying of data. These technologies all require a modern, object-storage-based data lake where compute and storage are disaggregated and workload optimized. If you have any questions while architecting your own modern data lake, please feel free to reach out to us at helloadmin.io or on our Slack channel. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.