The Good Tech Companies - Migrating From Hadoop Without Rip and Replace Is Possible — Here's How

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Migrating from Hadoop without rip and replace is possible. Here's how. By MinIO, we are still amazed at the number of customers who come to us looking to migrate from HDFS to modern object storage such as Minio. We thought by now that everyone had made the transition, but every week, we speak to a major, highly technical organization that has decided to make the transition. Quite often, in those discussions, there are elements of their infrastructure that they want to maintain after their migration. There are frameworks and software that came out of the HDFS ecosystem that have a lot of developer buy-in and still have a place in the modern data stack. Indeed, we've often said that a lot of good has come out of the HDFS

Starting point is 00:00:45 ecosystem. The fundamental issue is with the closely coupled storage and compute not necessarily with the tools and services that came from the big data era. This blog post will focus on how you can make that migration without ripping out and replacing tools and services that have value. The reality is that if you don't modernize your infrastructure, you can't make the advancements in AI, ML that your organization requires, but you don't have to throw everything out to get there. Disaggregation of storage and compute with Spark and Hive. We've already gone through some strategies for a complete rip and replace migration, which in some cases is the path forward. However, let's take a look at another way to modernize HDFS implementation. This architecture involves Kubernetes managing Apache Spark and Apache Hive containers for data

Starting point is 00:01:31 processing. Spark integrates natively with Minio while Hiveus is Yarn. Minio handles object storage in stateful containers and in this architecture relies on multi-tenant configurations for data isolation. Architectural overview. Compute nodes. Kubernetes efficiently manages stateless Apache Spark and Apache Hive containers on compute nodes, ensuring optimal resource utilization and dynamic scaling. Backslash dot. Storage layer. Minio erasure coding and bit rot protection means you may lose up to half of the number of drives and still recover, all without the need to maintain the three copies of each block of data that Hadoop requires. Backslash dot. Access layer. All access to Minio object storage is unified through the S3 API,

Starting point is 00:02:16 providing a seamless interface for interacting with stored data. Backslash dot. Security layer. Data security is paramount. Minio encrypts all data using per object keys, ensuring robust protection against unauthorized access. Backslash dot, identity management. Minio Enterprise fully integrates with identity providers such as WSO2, Keycloak, Okta, Ping Identity to allow applications or users to authenticate. Backslash dot. A fully modernized replacement for Hadoop that allows your organization to keep Hive, Yarn and any other Hadoop ecosystem

Starting point is 00:02:51 data product that can integrate with the object storage which is almost everything in the modern data stack. Interoperability in the access layer. S3A is an essential endpoint for applications seeking to transition away from Hadoop, offering compatibility with a wide array of applications within the Hadoop ecosystem. Since 2006, S3-compatible object storage backends have been seamlessly integrated into numerous data platforms within the Hadoop ecosystem as a default feature. This integration traces back to the incorporation of an S3 client implementation in emerging technologies. Across all Hadoop-related platforms, the adoption of the module and IS standard practice, ensuring robust support for the S3 API. This standardized approach facilitates the smooth transition of applications from HDFS and S3 storage backends. By simply specifying the

Starting point is 00:03:42 appropriate protocol, developers can effortlessly switch applications from Hadoop to modern object storage. The protocol scheme for S3 is indicated by S3A colon slash slash, while for HDFS, it is denoted AS HDFS colon slash slash. Benefits of migration. It's possible to talk at length about the benefits of migrating off of Hadoop onto modern object storage. If you're reading this, you are already largely aware that without migration off of legacy platforms like Hadoop advances in AI and other modern data products will likely be off the table. The reason distills down to performance and scale.

Starting point is 00:04:18 There is absolutely no question that modern workloads require outstanding performance to compete with the volume of data being processed and the complexity of tasks now being required. When performance is not just about vanity benchmarking, but a hard requirement, the field of contenders for Hadoop replacements drops off dramatically. The other element driving migrations forward is cloud native scale. When the concept of the cloud is less of a physical location and more of an operating model it becomes possible to do things like deploy an entire data stack in minutes from a single YAML file, an implementation so swift that would make any Hadoop engineer fall off their chair. Part and parcel of this concept is the economic benefits that come from the release from vendor

Starting point is 00:04:59 lock-in, which allows an organization to pick and choose best-in-class options for specific workloads. Not to mention, the release from maintaining three separate copies of data to protect it, a thing of the past with active-active replication and erasure coding. Investing in future-proof technology usually also means it's easier to find and recruit talented professionals to work on your infrastructure. People want to work on things that drive a business forward, and there is almost nothing that does that better than data. Together, these factors contribute to a data stack that is not only faster and cheaper but also better suited for today's and tomorrow's data-driven needs. Getting started, before diving into the specifics of our architecture,

Starting point is 00:05:39 you'll need to get a few components up and running. To migrate off of Hadoop, you'll obviously have had to have installed it to begin with. If migrate off of Hadoop, you'll obviously have had to have installed it to begin with. If you want to simulate this experience, you can start this tutorial by setting up the Hortonworks distribution of Hadoop here. Otherwise, you can begin with the following installation steps 1. Set up Amberry. Next, install Amberry, which will simplify the management of your services by automatically configuring Yarn for you. Amberry provides a user-friendly dashboard to manage services in the Hadoop ecosystem and keep everything running smoothly. Backslash.2. Install Apache Spark. Spark is essential for processing large-scale data.

Starting point is 00:06:19 Follow the standard installation procedures to get Spark up and running. Backslash.3. Install Minio. Depending on your environment, you can choose between two installation approaches, Kubernetes or Helmchart. Backslash dot. After successfully installing these elements, you can configure Spark and Hive to use Minio instead of HDFS. Navigate to the amberry uihtTP://amberryserver8080 and log in using the default credentials in Amberry. Navigate to Services, then HDFS, then to the Configs panel as in the screenshot below.

Starting point is 00:06:57 In this section, you are configuring Amberry to use S3A with MinIO instead of HDFS. Scroll down and navigate to. This is where you'll configure S3A with MinIO instead of HDFS. Scroll down and navigate to. This is where you'll configure S3A. From here, your configuration will depend on your infrastructure. But, the below could represent a way for to configure S3A with MinIO running on 12 nodes in 1.2 TB of memory. There are quite a few optimizations that can be explored by checking out the documentation on this migration pattern here, and also in Hadoop's documentation on S3 here and here. Restart all when you're satisfied with the config. You'll also need to navigate to the Spark 2 config panel. Scroll down to and add the following property to configure with Minio Restart all after the config changes have been applied.

Starting point is 00:07:41 Navigate to the Hive panel to finish up the configuration. Scroll down to and add the following property you can find more fine-tuning configuration information here. Restart all after the config changes have been made. That's it you can now test out your integration. Explore on your own. This blog post has outlined a modern approach to migrating from Hadoop without the need to completely overhaul your existing systems. By leveraging Kubernetes to manage Apache Spark and Apache Hive, and integrating Minio for stateful object storage, organizations can achieve a balanced architecture that supports dynamic scaling and efficient resource utilization. This setup not only retains but enhances the capabilities of your data

Starting point is 00:08:21 processing environments, making them more robust and future-proof. With Minio, you benefit from a storage solution that offers high performance on commodity hardware, reduces costs through erasure coding, eliminating the redundancy of Hadoop's data replication, and bypasses limitations like vendor lock-in and the need for Cassandra-based metadata stores. These advantages are crucial for organizations looking to leverage advanced AI, ML workloads without discarding the core elements of their existing data systems. Feel free to reach out for more detailed discussions or specific guidance on how you can tailor this migration strategy to meet the unique needs of your organization. Whether through email at hello at min.io or on our community Slack channel,

Starting point is 00:09:03 we're here to help you make the most of your data infrastructure investments. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Migrating From Hadoop Without Rip and Replace Is Possible — Here's How

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.