The Good Tech Companies - How to Build a Production-Grade Text2SQL Engine

Episode Date: August 13, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/how-to-build-a-production-grade-text2sql-engine. Learn about the role of LLMs in text2SQL, d...iscuss the challenges inherent in this capability, and explore SherloQ, a new text2SQL engine from the Skypoint team. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #generative-ai, #sql, #database, #text2sql, #sqlcoder, #gpt-3.5, #palm-sql, #good-company, and more. This story was written by: @datastax. Learn more about this writer by checking @datastax's about page, and for more stories, please visit hackernoon.com. Learn about the role of LLMs in text2SQL, discuss the challenges inherent in this capability, and explore SherloQ, a new text2SQL engine from the Skypoint team.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. How to build a production-grade text-to-SQL engine by Datastacks Interacting with databases often requires a level of technical expertise that can put data out of easy reach for a lot of people. Consider a finance executive who needs to understand the company's financial numbers and trends. Traditionally, this executive would have to rely on SQL analysts to extract the necessary data from the database. This dependency can cause delays and communication gaps, especially if the executive needs to refine their queries multiple times to get the desired insights. But Text2SQL, a capability that converts natural language to
Starting point is 00:00:42 structured query language statements, has changed the game. With Text2SQL, the finance executive can directly interact with the database using natural language. For instance, OUSER can input a business-facing question such as, what was the average order value for each customer last month? The Text2SQL AI engine will process the question and generate the corresponding SQL query, then execute it against the database and display the results to the user. In this article, we'll explain the role of LLMs in Text2SQL, discuss the challenges inherent in this capability, and explore Sherlock, a highly accurate and robust Text2SQL engine developed by the SkyPoint team. The role of LLMs in Text-to-SQL. The ability to convert text-to-SQL has improved significantly with the help of large language models, LLMs. These models use vast
Starting point is 00:01:34 amounts of data and powerful neural network architectures to understand and generate human-like text. By training on diverse datasets, LLMs can generalize across various tasks, including the translation of natural language into SQL queries. For example, the paper, Language Models Are Few-Shot Learners, demonstrates how LLMs can perform tasks with minimal examples, highlighting their ability to adapt to new tasks with limited data. This approach significantly reduces the need for extensive task-specific data, making it easier to deploy LLMs in various applications.
Starting point is 00:02:14 The SPIDER, a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task, provides a comprehensive dataset for training and evaluating models on complex SQL queries across different domains. This dataset has been pivotal in advancing the state-of-the-art in Text2SQL by providing a robust benchmark for model performance. Additionally, POM, Scaling Language Modeling with Pathways, explores how advanced training techniques, such as scaling up model sizes and optimizing training pathways, can enhance model performance in various applications, including Text2SQL. While these LLMs are highly effective in controlled environments, they often face challenges in production settings. These include handling ambiguous prompts,
Starting point is 00:02:55 managing complex database schemas, and ensuring real-time performance. Additionally, integrating these models into existing systems requires significant effort and ongoing maintenance to adapt to changing data and use requirements. Here are three LLMs that work well for this task. SQL Coder The main purpose of the SQL Coder is to convert natural language input to SQL queries. Unlike the other general purpose models, SQL Coder has been refined on data specific to SQL, hence it is especially effective at understanding and generating SQL queries. SQL Coder demonstrates a notable performance on the SPIDER dataset, a complex and cross-domain benchmark for Text-to-SQL systems.
Starting point is 00:03:38 SQL Coder generates the correct SQL query for novel schemas not seen in training with an accuracy of 64.6%. It outperforms GPT-3, 5 Turbo and Text DaVinci 003, which are models more than 10x its size. This highlights SQL Coder's capability to handle diverse and intricate SQL queries, which is critical for production production environments. GPT-3, 5, GPT-4, GPT-3, 5 and GPT-4, generative pre-trained transformer, are some of the most advanced and effective general-purpose language models. Both models excel in few-shot learning, quickly adapting to new tasks with minimal examples, which is ideal for creating SQL queries from limited input. For instance, when evaluated on the SPDR dataset, GPT-3.5 Turbo has a percentage of correctly generated SQL queries on novel schemas not seen in the training of 60.6%, while GPT-4 achieves 74.3%. These models exhibit robust performance,
Starting point is 00:04:42 particularly in understanding context and generating accurate SQL queries from complex natural language inputs. However, their general purpose design sometimes requires additional fine-tuning for optimal results in specialized applications like SQL generation. PalmSQL PALM, Pathways Language Model, SQL is another powerful model developed by Google. Palm SQL's advanced capabilities in architecture make it highly efficient in translating natural language into SQL queries and handling complex and diverse database schemas with greater accuracy. Despite the advancements in LLMs, relying on a single LLM for text-to-SQL in production can be problematic. A single model will not be able to handle the wide variety of queries, database schemas, and real-time latency requirements of an enterprise
Starting point is 00:05:31 data environment effectively. Production environments demand robustness, adaptability, and the ability to deal with ambiguous prompts from real-world business users. Therefore, a Text2SQL engine should exhibit three properties to be considered production-grade. The ability to understand diverse query formulations. Syntactically, different user prompts can be reduced to the same SQL query. A good Text2 SQL engine should be able to understand the motivation behind a user prompt, keeping the context of the data model in mind, and should formulate the SQL query accordingly. The ability to work with ambiguous database schemas and data models, production data models are prone to be chaotic with multiple different teams within an
Starting point is 00:06:13 organization contributing to them and data changing ownership multiple times throughout its lifecycle. A good Text2 SQL engine should have the ability to disambiguate the data model for the user and ensure that the noise and ambiguity do not lead to hallucinations, a characteristic of current LLMs that deter a lot of production users. It should ensure that the workings of the engine do not add significant latencies to the execution of the query. Answers that a user expects in real time should return in real time. This means that the engine should formulate optimal queries within the first three tries at most. Introducing Sherlock, Sherlock is SkyPoint's text-to-SQL engine that translates queries from natural language to SQL. While leveraging a custom in-house LLM for query generation is a
Starting point is 00:06:57 significant part of its architecture, Sherlock's effectiveness comes from a combination of advanced components designed to enhance data querying capabilities. Sherlock boasts high accuracy in query translation, robust error handling, and seamless integration with production database systems, making it suitable for large scale data environments. In the next few sections, we delve into the internal architectural details of Sherlock and share some of the results that we have achieved using it in production settings. Architectural components of Sherlock. Sherlock's architecture contains multiple moving pieces that each work to improve the system's accuracy, reliability, and latency. Here's an overview of the architecture. User input. The user input is a natural language query. State agent EQDRR implementation of a lang chain interface that tracks the state
Starting point is 00:07:45 throughout the execution process. It leverages Redis and DataStacks AstraDB to track the state of reasoning and memory during execution. The Executor manages the flow of operations by coordinating between different modules. It ensures that user inputs are correctly parsed, processed, and forwarded to subsequent components, maintaining a flow from input to SQL query generation. State tool, an extended class of the langchain-based tool that has a state variable and passes that state to the respective tool. By maintaining a state variable, the state tool keeps track of the necessary data that needs to be forwarded to the respective tools. This ensures consistency in the data flow, preventing any loss of information during the transition between agents.
Starting point is 00:08:34 The state tools of Sherlock, the query generation tool uses the retrieved few-shot queries, data model context, and DB schema to generate the initial SQL query. It enhances the accuracy and relevance of generated SQL queries by using parsed schema information and learning from contextual examples. The retry tool analyzes any error thrown during query execution and regenerates a query that addresses the error. The DB tool helps fetch the database schema and corresponding metadata, annotations, comments, etc. that would be helpful for the query generation. The retrieved schema information is cached using a time-based caching mechanism to optimize performance and reduce database queries. It also executes the query against the database and returns the results or an error stack trace.
Starting point is 00:09:16 Data inputs and techniques for enhancing Sherlocku's performance. To enhance Sherlocku's performance, we employ several techniques and provide the model with important data inputs. These consist of structured decomposition, few-shot examples, data model context retrieval, reflection, and a retry mechanism. Each part is essential in improving the model's ability to produce precise and relevant SQL queries from natural language inputs. Structured decomposition, the user prompt is broken down into its fundamental parts. Focusing on and identifying the essential components of the query helps the model in producing accurate SQL. For example, input, what is the current quantity
Starting point is 00:09:58 available for all items supplied by XyCorp? Output, current quantity available, all items, supplied by XYZCorp. Few-shot examples. The model is given reference questions in the form of a few-shot SQL examples, which help it generate SQL queries based on similar patterns. These examples improve the model's ability to accurately generate new SQL queries for various prompts by helping it recognize the structure and format of the intended queries. Here are a few examples. Example 1 Input. Show items with a quantity of less than 20 units available. Output. Select asterisk from stock where quantity underscore available less than 20. Example 2 Input. List vendors with items priced above $100. Output. Select asterisk from vendors where vendor underscore
Starting point is 00:10:46 ID in. Select vendor underscore ID from items where unit underscore price greater than 100. We dynamically select the most similar few shot examples using semantic similarity matching from AstraDB which is our vector database. The vector database allows us to find the examples that are closest in structure and content to the new input query, ensuring that the model can leverage the most relevant patterns to generate accurate SQL queries. In our production environment, we average about too few shot examples per query. In our experience, simply adding more few shot examples to improve accuracy is not a scalable practice. Data model context Data model context contains
Starting point is 00:11:26 domain-specific details that can be useful in creating an SQL query. For example, in the context of financial data for a hospital network, these can be things like finance metric codes and their descriptions. This context is used by the model to make sure that the SQL queries that are created match the data structure of the domain. The inclusion of data model context is optional and is required only when the table is complex and requires domain knowledge to form in SQL query. For example, metric codes, MGMTFEE, to description, fees collected for managing properties, management fees, SQL DB schema. The SQL DB schema is the structured representation of the available data. The schema is enriched by annotating both the table and its columns.
Starting point is 00:12:11 It includes table names and descriptions, along with the columns, their descriptions, and data types. Providing the schema helps the model understand the database structure and the meaning associated with each table and column, ensuring that the generated SQL queries are syntactically correct and utilize the correct database elements. Reflection Reflection refers to the model's ability to assess and evaluate its own past responses, behavior, or generated content. It allows the model to refine its responses by identifying and correcting errors or weaknesses. The process includes considering queries previously seen by the Sherlock engine along with feedback during the query generation phase. These scenarios help the model to replicate the successes or avoid the failures of similar questions that it has seen in
Starting point is 00:12:55 the past. This step also leverages AstraDB to find the most semantically similar queries to the current one. Retry mechanism SherlowQ incorporates a retry mechanism using the errors received from the database. When a generated SQL query results in an error, the engine uses an agent with a predefined template to correct the query. Your task is to fix an incorrect SQL query generated from a question to make it compatible with ANSI SQL. Follow these guidelines. Carefully analyze the question, database schema, and the error message received to ensure accurate answers. Utilize table aliases to avoid confusion. For example, when calculating rations,
Starting point is 00:13:36 always convert the numerator to a float. Use the below format. Task. Generate an SQL query to address the question. Question. Question. The query to address the question, question, question, the query to be correct is. SQL underscore query. The error message received during the last execution is. Error underscore message. Response. Based on the above inputs, here is the corrected SQL query. Response underscore SQL. Comparing results. Before and after Sherlock implementation. With the use of advanced LLMs and a well-designed system, Sherlock produces SQL queries that are more precise and relevant to the context.
Starting point is 00:14:12 Here, we compare the results of user prompts before and after Sherlock using the metrics of latency and accuracy. Comparison using one production question here, we look at one of our production customers, a well-known senior living operator. One of their use cases is their finance team using SkyPoint's platform to understand and reason over the consolidated financial metrics of their different senior living facilities. User prompt. Determine the quarterly trend of housekeeping expenses for facility CM throughout 2022. Before Sherlock generated query select sum total underscore amount underscore numerator sum total underscore amount underscore denominator as housekeeping
Starting point is 00:14:52 underscore expenses from skypoint underscore metric underscore fact underscore denormalized underscore vw where facility underscore name equals cm and year equals 2022 and metric underscore name equals housekeeping expense group by quarter underscore number. Issues identified the query groups by quarter underscore number but does not select it which can cause incomplete results. Use of metric underscore name instead of a more specific identifier like metric underscore code. Potential division by zero is not handled in sum, total underscore amount underscore denominator. Backslash dot dot, average latency. 29 seconds the before Sherlock setup used a combination of GPT 3.5 along with prompt engineering and 5 plus few shot queries per
Starting point is 00:15:39 user prompt. With Sherlock generated query select year, quarter underscore number, sum, total underscore amount underscore numerator, coalesce, sum, null if, total underscore amount underscore denominator, 0, 1, as housekeeping underscore expenses from skypoint underscore metric underscore fact underscore denormalized underscore vwwRE metric underscore code equals EXPOTHOU and facility underscore name equals CM and year equals 2022 group by quarter underscore number year order by quarter underscore number ASC. Improvements noted, the query includes quarter underscore number, providing the necessary quarterly breakdown. The metric underscore code field is used, providing a more precise identifier for the metric. The coalesce, sum, nullif,
Starting point is 00:16:31 total underscore amount underscore denominator, 0, 1, function handles potential division by zero errors. The results are ordered by quarter underscore number to reflect the quarterly trend. Average latency, 10 seconds. Based on benchmarks on production workloads, below are results that we report on accuracy and reliability before and with Sherlock. Before Sherlock accuracy, 65%. Reliability, 60%. With Sherlock accuracy, 92%. Reliability, 90%. The above results were drawn from an internal benchmark suite that executes every prompt 100 times with separate identifiers to negate the effects of caching in our internal systems as well as the models. The suite measures accuracy by comparing the
Starting point is 00:17:16 returned response against a benchmark response and reliability by measuring how often it would return similar responses. The comparison clearly illustrates the advantages of Sherlock in transforming natural language queries into accurate SQL queries. Overall performance improved by 30% after Sherlock. The queries generated earlier suffered from issues such as incomplete results and a lack of error handling, impacting both accuracy and reliability. With Sherlock, the generated queries are more precise, efficient, and robust, with notable improvements in latency, accuracy, and reliability. Thys enhancement shows Sherlock's capability to deliver dependable data retrieval,
Starting point is 00:17:56 making it a valuable tool for organizations seeking to optimize their data querying processes. To explore other parts of the SkyPoint platform or to book a demo of SkyPoint AI, visit the SkyPoint website. By Alec Raj, Lead AI Engineer, SkyPoint, and Sayan Dipser-Kar, Head of Engineering. SkyPoint thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.