The Good Tech Companies - Architecting a Modern Data Lake in a Post-Hadoop World
Episode Date: September 13, 2024This story was originally published on HackerNoon at: https://hackernoon.com/architecting-a-modern-data-lake-in-a-post-hadoop-world. This paper talks to the rise and fal...l of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #modern-datalake, #data-lake, #data-science, #hadoop, #database, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. This paper talks to the rise and fall of Hadoop HDFS and why high-performance object storage is a natural successor in the big data world.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Architecting a Modern Data Lake in a Post-Hadoop World, by Minio.
The Modern Data Lake is one half data warehouse and one half data lake and uses object storage
for everything. The use of object storage to build a data warehouse is made possible by
open table formats, OTFs, like Apache Iceberg, Apache Huddy, and Delta Lake, which are specifications
that, once implemented, make it seamless for object storage to be used as the underlying
storage solution for a data warehouse. These specifications also provide features that may
not exist in a conventional data warehouse, for example, snapshots, also known as time travel,
schema evolution, partitions, partition evolution,
and zero-copy branching. As organizations build modern data lakes, here are some of the key
factors we think they should be considering. 1. Disaggregation of compute and storage.
2. Migration from monolithic frameworks to best-of-breed frameworks.
3. Data center consolidation. Re replace departmental solutions with a single corporate
solution. 4. Seamless performance across small and large files, objects. 5. Software-defined,
cloud-native solutions that scale horizontally. This paper talks to the rise and fall of Hadoop
HDFS and why high-performance object storage is a natural successor in the big data world.
Adoption of HADOOP with
the expansion of internet applications, the first major data storage and aggregation challenges for
advanced tech companies started 15 years ago. Traditional RDBMS, Relational Database Management
System, could not be scaled to approach large amounts of data. Then came Hadoop, a highly
scalable model. In the Hadoop model, a large amount of data. Then came Hadoop, a highly scalable model. In the Hadoop model,
a large amount of data is divided into multiple inexpensive machines in a cluster which is then
processed in parallel. The number of these machines or nodes can be increased or decreased
as per the enterprise's requirements. Hadoop was open-source and used cost-effective commodity
hardware, which provided a cost-efficient model, unlike traditional relational databases, which require expensive hardware and high-end processors to deal with
big data. Because it was so expensive to scale in the RDBMS model, enterprises started to remove
the raw data. This led to suboptimal outcomes across a number of vectors. In this regard,
Hadoop provided a significant advantage over the RDBMS approach.
It was more scalable from a cost perspective, without sacrificing performance.
The end of Hadoop The advent of newer technologies like change data capture,
CDC, and streaming data, primarily generated from social media companies like Twitter and Facebook,
altered how data is ingested and stored. This triggered challenges in processing
and consuming these even larger volumes of data. A key challenge was with batch processing.
Batch processes run in the background and do not interact with the user.
Hadoop was efficient with batch processing when it came to very large files but suffered with
smaller files, both from an efficiency perspective as well as a latency perspective, effectively rendering IT obsolete as enterprises sought out processing and consumption frameworks
that could ingest varied datasets large and small in batch, CDC, and real-time.
Separating compute and storage simply makes sense today. Storage needs to outpace compute by as much
as 10 to 1. This is highly inefficient in the Hadoop world, where you need one compute
node for every storage node. Separating them means they can be tuned individually.
The compute nodes are stateless and can be optimized with more CPU cores and memory.
The storage node is already stateful and can be IO optimized with a greater number of denser
drives and higher bandwidth. By disaggregating, enterprises can achieve
superior economics, better manageability, improved scalability, and enhanced total
cost of ownership. HDFS cannot make this transition. When you leave data locality,
Hadoop HDFS's strength becomes its weakness. Hadoop was designed for MapReduce computing,
where data and compute had to be co-located. As a result, Hadoop needs its
own job scheduler, resource manager, storage, and compute. This is fundamentally incompatible
with container-based architectures, where everything is elastic, lightweight, and
multi-tenant. In contrast, Minio was born cloud-native and is designed for containers
and orchestration via Kubernetes, making it the ideal technology to transition
Tohen retiring legacy HDFS instances. This has given rise to the modern data lake.
It takes advantage of using the commodity hardware approach inherited from Hadoop but
disaggregates storage and compute, thereby changing how data is processed, analyzed,
and consumed. Building a modern data lake with Minio MinIO is a high-performance object storage
system that was built from scratch to be scalable and cloud-native. The team that built Minio also
built one of the most successful file systems, ClusterFS, before evolving their thinking on
storage. Their deep understanding of file systems and which processes were expensive or inefficient
informed the architecture of Minio, delivering performance and simplicity in the process. Minio uses erasure coding and provides a better set of
algorithms to manage storage efficiency and provide resiliency. Typically, it's one, five times copy,
unlike three times in Hadoop clusters. This alone already provides storage efficiency and reduces
costs compared to Hadoop. From its inception,
Minio was designed for the cloud operating model. As a result, it runs on every cloud, public,
private, on-prem, bare metal, and edge. This makes it ideal for multi-cloud and hybrid cloud
deployments. With a hybrid configuration, Minio enables the migration of data analytics and data
science workloads in accordance with approaches like the Strangler fig pattern popularized by Martin Fowler. Below are several
other reasons why Minio is the basic building block for a modern data lake capable of supporting
your IA data infrastructure as well as other analytical workloads such as business intelligence,
data analytics, and data science. Modern data ready, Hadoop was purpose-built for data where
unstructured data means large, gib-tody-bee-sized log files. When used as a general-purpose storage
platform where true unstructured data is in play, the prevalence of small objects, KB to MB,
greatly impairs Hadoop HDFS, as the name nodes were never designed to scale in this fashion. Minio excels at any file, object size, 8 Kibibytes to 5 Tebibytes.
Open source, the enterprises that adopted Hadoop did so out of a preference for open
source technologies. The ability to inspect, the freedom from lock-in, and the comfort that
comes from tens of thousands of users, has real value. Minio is also 100% open source, ensuring that organizations can stay true to their goals
while upgrading their experience.
Simple.
Simplicity is hard.
It takes work, discipline, and above all, commitment.
Minio's simplicity is legendary and is the result of a philosophical commitment to making
our software easy to deploy, use, upgrade, and scale.
Even Hadoop's fans will tell you it is complex. To do more with less, you need to migrate to MinIO. Performant, Hadoop rose
to prominence because of its ability to deliver big data performance. They were, for the better
part of a decade, the benchmark for enterprise-grade analytics. Not anymore, MinIO has proven in
multiple benchmarks that it is
materially faster than Hadoop. This means better performance for your modern data lake.
Lightweight, Minio's server binary is all of less than 100 megabytes. Despite its size,
it is powerful in auto-run the data center, yet still small enough to live comfortably at the edge.
There is no such alternative in the Hadoop world. What it means
to enterprises that your S3 applications can access data anywhere, anytime, and with the same
API. By deploying Minio to an edge location, you can capture and filter data at the edge and use
Minio's replication capabilities to ship it to the your modern data lake for aggregation and
further analytics. Resilient, Minio protects data with per-object, inline
erasure coding, which is far more efficient than HDFS alternatives which came after replication
and never gained adoption. In addition, Minio's bitrot detection ensures that it will never read
corrupted data, capturing and healing corrupted objects on the fly. Minio also supports cross-region,
active-active replication. Finally, Minio supports a complete object locking framework offering both legal hold
and retention, with governance and compliance modes.
Software defined, Hadoop HDFS successor isn't a hardware appliance.
It is software running on commodity hardware.
That is what Minio is, software.
Like Hadoop HDFS, Minio IS designed to take full advantage
of commodity servers. With the ability to leverage NVMe drives and 100 GBE networking,
Minio can shrink the data center, improving operational efficiency and manageability.
Secure, Minio supports multiple, sophisticated server-side encryption schemes to protect data,
wherever it may be, in flight or at rest. Minio's approach assures confidentiality,
integrity, and authenticity with negligible performance overhead.
Server-side and client-side encryption are supported using AES-256G-CM,
ChaCha-20Poly-1305, and AESCBC, ensuring application compatibility.
Furthermore, Minio supports industry-leading key management systems, KMS.
Migrating from Hadoop to Minio
The Minio team has expertise in migrating from HDFS to Minio.
Customers that purchase an enterprise license can get assistance from our engineers.
To learn more about using Minio to replace HDFS
check out this collection of resources. Conclusion
Every enterprise is a data enterprise at this point. The storage of that data and the subsequent
analysis need to be seamless, scalable, secure, and performant. The analytical tools spawned by
the Hadoop ecosystem, like Spark, are more effective and efficient when paired with object storage-based data lakes. Technologies like Flink improve the overall performance as it provides
single runtime for the streaming as well as batch processing that didn't work well in the HDFS model.
Frameworks like Apache Arrow are redefining how data is stored and processed, and Iceberg and
Hudi are redefining how table formats allow for the efficient querying of data.
These technologies all require a modern, object-storage-based data lake where compute and storage are disaggregated and workload optimized. If you have any questions while
architecting your own modern data lake, please feel free to reach out to us at helloadmin.io
or on our Slack channel. Thank you for listening to this Hackernoon story,
read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.