The Good Tech Companies - Migrating From Hadoop Without Rip and Replace Is Possible — Here's How
Episode Date: May 31, 2024This story was originally published on HackerNoon at: https://hackernoon.com/migrating-from-hadoop-without-rip-and-replace-is-possible-heres-how. Here's how to migrate f...rom Hadoop without the need to completely overhaul your existing systems. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #hadoop, #apache-hadoop, #object-storage, #big-data, #s3-api, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. MinIO is a modern version of Hadoop that integrates with Spark and Hive. MinIO encrypts all data using per-object keys, ensuring robust protection against unauthorized access. S3a is an essential. endpoint for applications seeking to transition away from Hado.op, offering compatibility with a wide array of applications.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Migrating from Hadoop without rip and replace is possible. Here's how.
By MinIO, we are still amazed at the number of customers who come to us looking to migrate from
HDFS to modern object storage such as Minio. We thought by now that everyone had made the
transition, but every week, we speak to a major, highly technical organization that has decided to make the transition. Quite often, in those discussions, there are elements of their
infrastructure that they want to maintain after their migration. There are frameworks and software
that came out of the HDFS ecosystem that have a lot of developer buy-in and still have a place
in the modern data stack. Indeed, we've often said that a lot of good has come out of the HDFS
ecosystem. The fundamental issue is with the closely coupled storage and compute not necessarily
with the tools and services that came from the big data era. This blog post will focus on how
you can make that migration without ripping out and replacing tools and services that have value.
The reality is that if you don't modernize your infrastructure, you can't make the advancements in AI, ML that your organization requires, but you don't have to throw everything
out to get there. Disaggregation of storage and compute with Spark and Hive. We've already gone
through some strategies for a complete rip and replace migration, which in some cases is the
path forward. However, let's take a look at another way to modernize HDFS implementation.
This architecture involves Kubernetes managing Apache Spark and Apache Hive containers for data
processing. Spark integrates natively with Minio while Hiveus is Yarn. Minio handles object storage
in stateful containers and in this architecture relies on multi-tenant configurations for data
isolation. Architectural overview. Compute nodes.
Kubernetes efficiently manages stateless Apache Spark and Apache Hive containers on compute nodes,
ensuring optimal resource utilization and dynamic scaling. Backslash dot. Storage layer.
Minio erasure coding and bit rot protection means you may lose up to half of the number of drives
and still recover, all without the need to maintain the three copies of each block of data that Hadoop requires.
Backslash dot. Access layer. All access to Minio object storage is unified through the S3 API,
providing a seamless interface for interacting with stored data.
Backslash dot. Security layer. Data security is paramount. Minio encrypts all data using per object keys, ensuring robust protection against unauthorized
access.
Backslash dot, identity management.
Minio Enterprise fully integrates with identity providers such as WSO2, Keycloak, Okta, Ping
Identity to allow applications or users to authenticate.
Backslash dot. A fully modernized
replacement for Hadoop that allows your organization to keep Hive, Yarn and any other Hadoop ecosystem
data product that can integrate with the object storage which is almost everything in the modern
data stack. Interoperability in the access layer. S3A is an essential endpoint for applications
seeking to transition away from Hadoop, offering compatibility with a wide array of applications within the Hadoop ecosystem. Since 2006, S3-compatible object
storage backends have been seamlessly integrated into numerous data platforms within the Hadoop
ecosystem as a default feature. This integration traces back to the incorporation of an S3 client
implementation in emerging technologies.
Across all Hadoop-related platforms, the adoption of the module and IS standard practice,
ensuring robust support for the S3 API. This standardized approach facilitates the smooth transition of applications from HDFS and S3 storage backends. By simply specifying the
appropriate protocol, developers can effortlessly switch applications
from Hadoop to modern object storage. The protocol scheme for S3 is indicated by S3A colon slash
slash, while for HDFS, it is denoted AS HDFS colon slash slash. Benefits of migration. It's
possible to talk at length about the benefits of migrating off of Hadoop onto modern object storage.
If you're reading this, you are already largely aware that without migration off of legacy
platforms like Hadoop advances in AI and other modern data products will likely be off the
table.
The reason distills down to performance and scale.
There is absolutely no question that modern workloads require outstanding performance
to compete with the volume of data being processed and the complexity of tasks now being required. When performance is not just
about vanity benchmarking, but a hard requirement, the field of contenders for Hadoop replacements
drops off dramatically. The other element driving migrations forward is cloud native scale.
When the concept of the cloud is less of a physical location and more of an operating
model it becomes possible to do things like deploy an entire data stack in minutes from a single
YAML file, an implementation so swift that would make any Hadoop engineer fall off their chair.
Part and parcel of this concept is the economic benefits that come from the release from vendor
lock-in, which allows an organization to pick and choose best-in-class options for specific
workloads.
Not to mention, the release from maintaining three separate copies of data to protect it,
a thing of the past with active-active replication and erasure coding.
Investing in future-proof technology usually also means it's easier to find and recruit talented professionals to work on your infrastructure. People want to work on
things that drive a business forward, and there is almost nothing that does that better than data. Together, these factors contribute to
a data stack that is not only faster and cheaper but also better suited for today's and tomorrow's
data-driven needs. Getting started, before diving into the specifics of our architecture,
you'll need to get a few components up and running. To migrate off of Hadoop, you'll obviously have
had to have installed it to begin with. If migrate off of Hadoop, you'll obviously have had
to have installed it to begin with. If you want to simulate this experience, you can start this
tutorial by setting up the Hortonworks distribution of Hadoop here. Otherwise, you can begin with the
following installation steps 1. Set up Amberry. Next, install Amberry, which will simplify the
management of your services by automatically configuring Yarn for you.
Amberry provides a user-friendly dashboard to manage services in the Hadoop ecosystem and keep everything running smoothly. Backslash.2. Install Apache Spark.
Spark is essential for processing large-scale data.
Follow the standard installation procedures to get Spark up and running.
Backslash.3. Install Minio.
Depending on your environment, you can choose between two installation approaches,
Kubernetes or Helmchart.
Backslash dot. After successfully installing these elements,
you can configure Spark and Hive to use Minio instead of HDFS.
Navigate to the amberry uihtTP://amberryserver8080 and log in using the default credentials in Amberry.
Navigate to Services, then HDFS, then to the Configs panel as in the screenshot below.
In this section, you are configuring Amberry to use S3A with MinIO instead of HDFS.
Scroll down and navigate to.
This is where you'll configure S3A with MinIO instead of HDFS. Scroll down and navigate to. This is where you'll configure S3A.
From here, your configuration will depend on your infrastructure. But, the below could represent a
way for to configure S3A with MinIO running on 12 nodes in 1.2 TB of memory. There are quite a few
optimizations that can be explored by checking out the documentation on this migration pattern here, and also in Hadoop's documentation on S3 here and here. Restart all when you're satisfied with
the config. You'll also need to navigate to the Spark 2 config panel. Scroll down to and add the
following property to configure with Minio Restart all after the config changes have been applied.
Navigate to the Hive panel to finish up the configuration. Scroll down to and add the
following property you can find more fine-tuning configuration information here. Restart all after
the config changes have been made. That's it you can now test out your integration. Explore on your
own. This blog post has outlined a modern approach to migrating from Hadoop without the need to
completely overhaul your existing systems. By leveraging Kubernetes
to manage Apache Spark and Apache Hive, and integrating Minio for stateful object storage,
organizations can achieve a balanced architecture that supports dynamic scaling and efficient
resource utilization. This setup not only retains but enhances the capabilities of your data
processing environments, making them more robust and future-proof.
With Minio, you benefit from a storage solution that offers high performance on commodity hardware,
reduces costs through erasure coding, eliminating the redundancy of Hadoop's data replication,
and bypasses limitations like vendor lock-in and the need for Cassandra-based metadata stores.
These advantages are crucial for organizations looking to leverage advanced AI,
ML workloads without discarding the core elements of their existing data systems.
Feel free to reach out for more detailed discussions or specific guidance on how you can tailor this migration strategy to meet the unique needs of your organization.
Whether through email at hello at min.io or on our community Slack channel,
we're here to help you make the most
of your data infrastructure investments. Thank you for listening to this Hackernoon story,
read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.