The Good Tech Companies - An Architect’s Guide to the Top 10 Tools Needed to Build the Modern Data Lake

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. An architect's guide to the top 10 tools needed to build the modern data lake, by Mineo. I previously wrote about the modern data lake reference architecture, addressing the challenges in every enterprise, more data, aging Hadoop tooling, specifically HDFS, and greater demands for RESTful APIs, S3, and performance. But I want to fill in some gaps. The modern data lake, sometimes referred to as a data lakehouse, is one-half data lake and one-half open-table format specification, OTF-based data warehouse. Both are built on modern object storage. Concurrently, we have thought deeply about how organizations can build AI data infrastructure

Starting point is 00:00:44 that can support all your AI, ML needs. not just the raw storage of your training sets, validation sets, and test sets. In other words, it shall contain the compute needed to train large language models, MLOPs tooling, distributed training, and more. As a result of this line of thought, we put together another paper on how to use the modern data lake reference architecture to support your AI ML needs. The graphic below illustrates the modern data lake reference architecture with the capabilities needed for generative AI highlighted. Source. AI ML within a modern data lake both those papers do not mention specific vendors or tools. I now want to discuss vendors and tools needed to build the modern data lake. In this top 10 list, each entry is a capability needed to support generative AI.

Starting point is 00:01:32 1. Data Lake Enterprise Data Lakes are built on object storage, not your old-school, appliance-based object storage that served cheap and deep archival use cases, but modern, performant, software-defined and Kubernetes-n native object stores, the cornerstones of the modern Gen AI stack they're available as a service. AWS, GCP, Azure, or on-premises or hybrid, both, such as Minio. These data lakes have to support streaming workloads, must have highly efficient encryption and erasure coding, need to store metadata atomically with the object and support technologies like Lambda Compute. Given these modern alternatives are cloud-native, they will integrate with the entire stack of other cloud-native technologies, from firewalls to observability to user and access management, right out of the box.

Starting point is 00:02:19 2. OTF-Based Data Warehouse Object Storage is also the underlying storage solution for an OTP-based data warehouse. Using object storage for a data warehouse may sound odd, but a data warehouse built this way represents the next generation of data warehouses. Thesis made possible by the OTF specifications authored by Netflix, Uber and Databricks, which make it seamless to employ object storage within a data warehouse. The OTFs, Apache Iceberg, Apache Huddy and Delta Lake, were written because there were no products on the market that could handle the creator's data needs. Essentially, what they all do, in different ways, is define a data warehouse that can be built on top of object storage. Object storage provides the combination of scalable capacity and high performance that

Starting point is 00:03:05 other storage solutions cannot. Since these are modern specifications, they have advanced features that old-fashioned data warehouses do not have such as partition evolution, schema evolution, and zero-copy branching. Two Minio partners that can run their OTF-based data warehouse on top of Minio or Dremio and Starburst. Dremio Sonar, data warehouse processing engine, Dremio Arctic, data warehouse catalog, Open Data Lakehouse, Starburst, catalog and processing engine. 3. Machine Learning Operations, ML OPS, ML OPS is to machine learning what DevOps is to traditional software development. Both are a set of practices and principles aimed at improving collaboration between engineering teams, the dev or ML, and IT operations, ops, teams.

Starting point is 00:03:51 The goal is to streamline the development lifecycle using automation, from planning and development to deployment and operations. One of the primary benefits of these approaches is continuous improvement. MLOP's techniques and features are constantly evolving. You want a tool that is backed by a major player, ensuring that the tool is under constant development and improvement and will offer long-term support. Each of these tools uses MinI under the hood to store artifacts used during a model's lifecycle. ML Run, Aguazio, acquired by McKinsey & Company, ML Flow, Databricks, Cubeflow, Google, 4. Machine Learning Framework Your machine learning framework is the library, usually for Python, that you use to create your models and write the code that trains them.

Starting point is 00:04:36 These libraries are rich in features as they provide a collection of different loss functions, optimizers, data transformation tools and pre-built layers for neural networks. The most important feature these two libraries provide is a tensor. Tensors are multi-dimensional arrays that can be moved onto the GPU. They also have auto-differentiation, which is used during model training. The two most popular machine learning frameworks today are PyTorch, from Facebook, and TensorFlow, from Google. PyTorch from Facebook and TensorFlow from Google. PyTorch TensorFlow 5 Distributed Training Distributed model training is the process of simultaneously training machine learning models across multiple computational devices or nodes. This approach speeds up the training process,

Starting point is 00:05:18 particularly when large datasets are needed to train complex models. In distributed model training, the dataset is divided into smaller subsets, and each subset is processed by different nodes in parallel. These nodes can be individual machines within a cluster, individual processes or individual pods within a Kubernetes cluster. They may have access to GPUs. Each node independently processes its subset of data and updates the model parameters accordingly. The five libraries below insulate developers from most of the complexity of distributed training. You can run them locally if you do not have a cluster, but you will need a cluster to see a

Starting point is 00:05:55 notable reduction in training time. DeepSpeed, from Microsoft, Horovod, from Uber, Ray, from AnyScale, Spark PyTorch Distributor, from Databricks, Spark TensorFlow Distributor, from Databricks, 6. ModelHub A model hub is not really part of the modern data lake reference architecture, but I am including it anyway because it is important for quickly getting started with generative AI. Hugging Face has become the place to go for large language models. Hugging Face hosts a model hub where engineers can download pre-trained models and share models they create themselves. Hugging Face is also the author of the Transformers and the Datasets

Starting point is 00:06:34 libraries, which work with large language models, LLMs, and the data used to train and fine-tune them. There are other model hubs. All the major cloud vendors have some way to upload and share models, but Hugging Face, with its collection of models and libraries, has become the leader in this space. Hugging Face 7. Application Framework and Application Framework helps incorporate an LLM into an application. Using Anlum is dissimilar to using a standard API. Much work must be done to turn a user request into something the LLM can understand and process. For example, if you build a chat application and you want to use retrieval augmented generation, RAG, then you will need to tokenize the request, turn the tokens into

Starting point is 00:07:17 a vector, integrate with a vector database, describe below, create a prompt, and then call your LLM. An application framework for generative AI will allow you to chain these actions together. The most widely used application framework today is Langchain. It has integrations with other technologies, for example, the Hugging Face Transformer library and Unstructured's library for document processing. It is feature-rich and can be a little complicated to use, so listed below are some alternatives for those who do not have complex requirements for document processing. It is feature-rich and can be a little complicated to use, so listed below are some alternatives for those who do not have complex requirements and want something simpler than Langchain. Langchain Agent GPT Auto GPT Baby Aggie Flowize Gradient JLama Index Langdoc Tensorflow, Keras API, 8. Document processing Most organizations do not have a single

Starting point is 00:08:03 repository with clean and accurate documents. Rather, documents are spread across the organization in various team portals in many formats. The first step when getting ready for generative AI is to build a pipeline that takes only documents that have been approved for use with generative AI and places them in your vector database. This could potentially be the hardest task of a generative AI solution for large global organizations. A document pipeline should convert the documents to text, chunk the document and run the chunked text through an embedding model so its vector representation can be saved into a vector database. Fortunately, a few open-source libraries can do this for many of the common document formats. A few libraries are

Starting point is 00:08:45 listed below. These libraries could be used with langchain to build a full document processing pipeline. Unstructured open parse 9. Vector databases. Vector databases facilitate semantic search. Understanding how this is denar requires a lot of mathematical background and is complicated. However, semantic search is conceptually easy to understand. Let's say you want to find all documents that discuss anything related to artificial intelligence. To do this in a conventional database, you would need to search for every possible abbreviation, synonym and related term of artificial intelligence. Your query would look something like this. Not only is this manual similarity search arduous and prone to error, but the search itself is very slow. A vector database can take a request like

Starting point is 00:09:30 below and run the query faster and with greater accuracy. The ability to run semantic queries quickly and accurately is important if you wish to use retrieval augmented generation. Four popular vector databases are listed below. Milvus PG Vector Pinecone WeAviate 10 Data exploration and visualization It is always a good idea to have tools that allow you to wrangle your data and visualize it in different ways. The Python libraries listed below provide data manipulation and visualization capabilities. These may seem like tools you only need for traditional AI, but they also come in handy with generative AI. For example, if you are doing sentiment analysis or emotion detection, then you should check your training, validation and test sets to make sure you have a proper

Starting point is 00:10:14 distribution across all your classes. Panda's Matplotlib Seaburn Streamlit Conclusion There you have it. 10 capabilities that can be found in the modern data lake reference architecture, along with concrete vendor products and libraries for each capability. Below is a table summarizing these tools. 1. Data Lake, Minio, AWS, GCP, Azure. 2. OTF-based data warehouse, Dremio, Dremio Sonar, Dremio Arctic, Starburst, Open Data Lakehouse, Starburst. 3. Machine Learning Framework, PyTorch, TensorFlow. 4. Machine Learning Operations, ML Run, McKinsey & Company, ML Flow, Databricks, Kubeflow, Google. 5. Distributed Training, Deep Speed, from Microsoft, Horovod,

Starting point is 00:11:02 from Uber, Ray, from AnyScale, Spark Py any scale spark pytorch distributor from databricks spark tensorflow distributor from databricks 6. model hub hugging face 7. application framework langchain agent gpt auto gpt baby aggie flow eyes gradient j llama index langdoc tensorflow Flowize, GradientJ, Llama Index, LangDoc, TensorFlow, Keras API. 8. Document Processing, Unstructured, OpenParse. 9. Vector Database, Milvus, PGVector, Pinecone, Weavey8. 10. Data Exploration and Visualization, Pandas, Matplotlib, Seaburn, Streamlit. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - An Architect’s Guide to the Top 10 Tools Needed to Build the Modern Data Lake

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.