SQLPilot — Text2SQL application with RAG

Shobhit Bhosure
3 min readAug 10, 2024

--

Hi everyone!

I want to introduce you to the AI SQL Editor SQLPilot and it’s working

Motivation behind building SQLPilot

while writing queries users often take help from ChatGPT or other LLM models, and the model gives better results if you provide it with the information about the schema or tables you are generating the query for. copy and pasting this manually becomes repetitive and tedious and even after providing the schema, you are not guaranteed to get accurate results.

There are many tools out there which solves this problem by allowing you to share the entire schema of your database. this is then shared with LLM models. again the problem here is that most of these services are web applications. so you are sharing your schema with third party apps other than Open AI.

So to highlight the problems

  1. You are sharing your schema with third party (privacy concern)
  2. If you connect your database with these web apps, you are trusting them with your creds (security issue)
  3. LLMs don’t have much context of your database, what table is for what purpose and how to answer some business specific queries. e.g urgent orders in real life could be priority=0 instead of is_urgent=true in the orders table. (accuracy and inconsistent results)

SQLPilot solves these points by implementing RAG pipeline, let’s see how

The main component enabling RAG here is the knowledge base

Our knowledge base consists of 3 components

  1. Labelled database — here we will label the tables we mostly are going to work with, we need to mention what the table is for and purpose it serves. this helps LLM understand the database better and allows it to pick up relevant tables along with their schema for the user‘s query.
  2. Examples — If we already have queries written by users which we run on daily basis to get some results, we should add these here. SQLPilot will pick up most relevant examples to the user’s query and these top examples for the LLM, so while generating a query things are more clear to LLM on what needs to be done.
  3. Rules / Instructions — You can add custom instructions or rules to the LLM. many times we have businesses specific terms that LLM might now know. like previous e.g Urgent orders could be column with priority=0 and not is_urgent=true. You can add such instructions here.

Process for constructing the prompt for the LLM

  1. we will get all the examples and create embeddings for them using Open AI’s embedding function. these embeddings are stored in the In-memory vector database.
  2. we then will create a vector for our question and query our vector database with that vector. this will return matching examples to our question. so now we have most relevant examples to our question.
  3. We will send our labelled database along with the user’s question to the LLM and ask it return the relevant tables to the user’s question.
  4. from the table names returned, we will now have the schema of all the relevant tables for the query, few shot examples and the rules / instructions for the LLM
  5. We will the send all of these details in a prompt to the LLM and get the SQL query in return!

So finally we have the output we need through the above pipeline which allows the LLM to have maximum context of our database without sharing any data.

You can add your own Open AI key and start using SQLPilot! or create an account on https://sqlpilot.ai and start using with free credits

--

--

No responses yet