The Good Tech Companies - How to Deploy MinIO and Trino with Kubernetes
Episode Date: May 23, 2024This story was originally published on HackerNoon at: https://hackernoon.com/how-to-deploy-minio-and-trino-with-kubernetes. With the ability to handle significant worklo...ads across AI/ML and analytics, MinIO effortlessly supports Trino queries and beyond. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #trino, #kubernetes, #sql-query-engine, #database, #aiml, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. With the ability to handle significant workloads across AI/ML and analytics, MinIO effortlessly supports Trino queries and beyond.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
How to deploy Minio and Trino with Kubernetes, by Minio.
Trino, formerly Presto, is a SQL query engine, not a SQL database.
Trino has issued the storage component of the SQL database to focus on only one thing
ultra-fast SQL querying. Trino is just a query engine and does not store data.
Instead, Trino interacts
with various databases or directly on object storage. Trino parses and analyzes the SQL query
you pass in, creates and optimizes a query execution plan that includes the data sources,
and then schedules worker nodes that are able to intelligently query the underlying databases
they connect to. Minio is frequently used to store data from AI,
ML workloads, data lakes to lakehouses whether it be Dremio, Hive, Huddy, StarRocks or any of
the other dozen or so great AI, ML tool solutions. Minio is more efficient when used as the primary
storage layer, which decreases total cost of ownership for the data stored, plus you get
the added benefits of writing data to Minio Thaddeus Immutable, versioned and protected by erasure coding.
In addition, saving data to Minio Object Storage makes it available to other cloud-native machine
learning and analytics applications.
In this tutorial, we'll deploy a cohesive system that allows distributed SQL querying
across large datasets stored in Minio, with Trino leveraging metadata
from Hive Metastore and table schemas from Redis. Components. Here are the different components and
what they do in our setup process we'll go through next. Minio. Minio can be used to store large
datasets, like the ones typically analyzed by Trino. Hive Metastore. Hive Metastore is a service that stores metadata for Hive tables,
like table schema. Trino can use Hive Metastore to determine the schema of tables when querying
datasets. PostgreSQL for Hive Metastore. This is the database backend for the Hive Metastore.
It's where the metadata is actually stored. Redis. In this setup, Redis for storing table scheme is for Trino. Trino.
Trino, formerly known as Presto, is a high-performance, distributed SQL query engine.
It allows querying data across various data sources like SQL databases,
NoSQL databases, and even object storage like Minio.
Prerequisites. Before starting, ensure you have the necessary tools installed
for managing your Kubernetes cluster Kubik L, the primary command line tool for managing
Kubernetes clusters. You can use it to inspect, manipulate, and administer cluster resources.
Helm. A package manager for Kubernetes, Helm allows you to deploy, upgrade, and manage
applications within your cluster using predefined
charts. Repository cloning to access the resources needed for deploying Trino on Kubernetes.
Clone the specific GitHub repository and navigate to the appropriate directory.
Kubernetes namespace creation namespaces in Kubernetes provide isolated environments for
applications. Create a new namespace for Trino to encapsulate its deployment Redis table
definition secret Redis will store table schemas used by Trino. Secure these schemas with a
Kubernetes secret. The following command creates a generic secret sourcing data from a JSON file.
Add Helm repositories. Helm repositories provide pre-packaged charts that simplify application
deployment. Add the Bitnami and
Trino repositories to your Helm configuration. Deploy Minio for data. S-T-O-R-A-G-E-I-N-I-T-I-A-L-I-Z-E
Minio. Prepare Minio within the Trino namespace. Create Minio tenant. Set up a multi-tenant
architecture for data storage. The example below creates a tenant named tenant1 with 4 servers, 4 storage volumes,
and a capacity of 4GB. Setup Hive Metastore
Trino utilizes Hive Metastore to store table metadata.
Deploy PostgreSQL to manage the metadata, then set up the Hive Metastore.
Install PostgreSQL. Deploy Hive Metastore
Use a pre-configured Helm chart to deploy Hive Metastore within the Trino namespace.
Deploying Minio and Trino with Kubernetes Trino and Minio create a powerful combination for
distributed SQL querying across large datasets. Follow these steps to deploy and configure the
system. Deploy Redis to store table schemas. Redis is a high-speed in-memory data store used to hold
Trino table schemas for enhanced query performance.
Deploy it in the Trino namespace using a Helm chart. Deploy Trino Deploy Trino as the distributed SQL query engine that will connect to Minio-Andoether
data sources Verify deployment
Confirm that all components are running correctly by listing the pods in the Trino
namespace Security review and adjustments
Review and adjust security settings as needed.
To disable SSL certificate
validation for S3 connections, update the additional catalog section of the values.
YAML file with the following property testing. Port forward to Minio tenant service. Port forward
to the Minio service of the tenant, enabling local access. Create alias and bucket for TRIN01.
Create alias. Establish an alias for the tenant using the
credentials from the MINIO deployment 2. Create bucket. Create a new bucket that Trino will use.
Access Trino UI via port FORWARD1. Obtain pod name. Retrieve the name of the Trino coordinator
pod 2. Port forward. Forward local port 8080 to the coordinator pod 3. Access UI.
Use the Trino UI in your browser by visiting http://127001-8080. Query Trino via CLI. Access
the Trino coordinator pod and start querying via the command line, confirm data in
Minio bucket after creating the bucket, confirm that the data is stored in Minio by listing the
contents with the MC command line tool. Use the following command it's as simple as that.
Final thoughts. When troubleshooting configuration issues, especially those concerning security,
thoroughly review the values. YAML files for each component to ensure
proper settings. Trino stands out for its ability to optimize queries across various data layers,
whether specialized databases or object storage. It aims to minimize data transfer by pushing down
queries to retrieve only the essential data required. This enables Trino to join datasets
from different sources, perform further processing,
or return precise results efficiently. Minio pairs exceptionally well with Trino due to its
industry-leading scalability and performance. With the ability to handle significant workloads
across AI, ML and analytics, Minio effortlessly supports Trino queries and beyond. In recent
benchmarks, Minio achieved an impressive 325 GB per second,
349 GB per second for GET operations and 165 GB per second, 177 GB per second for PUT operations
across just 32 nodes. This remarkable performance ensures that data stored in Minio remains readily
accessible, making Minio a reliable and high-performing choice for Trino without becoming a bottleneck.
If you have any questions on Minio and Trino be sure to reach out to us in Slack.
Thank you for listening to this HackerNoon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.
