CHESS - Contextual Harnessing for Efficient SQL Synthesis
Reproduction:Sources
Original-repo: Link
Paper: Link
Datasets
BIRD: Link
Spider: Link
- We need a path to save our dataset(s) to. The
CHESS
repo provides adata
directory for this purpose where we can download the dev set of BIRD to work with more or less out of the box, but we assume the path to download to be the placeholder<dataset-path>
to allow for any other storage options. We create the subdirectoriesBIRD
andSpider
for both datasets:
mkdir <dataset-path>/BIRD <dataset-path>/Spider
- We download the dataset(s) with
curl
or withwget
, depending on which one is installed on our system, with the<dataset-path>
from above (skip BIRD train (second line) if not needed):
With curl
:
curl -L 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip' -o <dataset-path>/BIRD/dev.zip ; \
curl -L 'https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip' -o <dataset-path>/BIRD/train.zip ; \
curl -L 'https://drive.usercontent.google.com/download?id=1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J&confirm=t' -o <dataset-path>/Spider/spider_data.zip ; \
With wget
:
wget -O <dataset-path>/BIRD/dev.zip 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip' ; \
wget -O <dataset-path>/BIRD/train.zip 'https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip' ; \
wget -O <dataset-path>/Spider/spider_data.zip 'https://drive.usercontent.google.com/download?id=1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J&confirm=t' ; \
- Extracting the zip archives we just downloaded (adjusting for BIRD/train.zip as needed):
unzip <dataset-path>/BIRD/dev.zip -d <dataset-path>/BIRD
unzip <dataset-path>/BIRD/train.zip -d <dataset-path>/BIRD
unzip <dataset-path>/Spider/spider_data.zip -d <dataset-path>/Spider
- For the BIRD dataset, we rename the extracted subdirectory to
dev
for ease of use, then we to extract the databases in a subdirectory separately (skip BIRD train as needed, third line and remove\
):
mv <dataset-path>/BIRD/dev_20240627 <dataset-path>/BIRD/dev ; \
unzip <dataset-path>/BIRD/dev_databases.zip -d <dataset-path>/BIRD/dev ; \
unzip <dataset-path>/BIRD/train_databases.zip -d <dataset-path>/BIRD/train
- To configure the datasets, we copy the
dotenv_copy
file provided by the CHESS authors to a.env
file (or we copy the.env.llama3.2.example
to.env
provided by the replication authors):
Copy dotenv_copy
to .env
:
cp CHESS/dotenv_copy CHESS/.env
- Editing the
.env
config with our favourite command line editor, we can now set the dataset locations as follows:
-
DB_ROOT_PATH
should be the parent directory of the database directory, e. g."<dataset-path>/BIRD/dev"
for the BIRDdev
dataset, with the subdirectorydev_databases
(where the.sqlite
and description.csv
files are stored) -
DATA_MODE
possible values are"dev"
or"train"
, switches between inference mode and train mode -
DATA_PATH
should be the json file, providing the user questions and the expected SQL results, e. g."<dataset-path>/BIRD/dev/dev.json"
for the BIRD dev dataset or the reduced SDS dataset provided by the authors of CHESS (in the repository under"data/dev/sub_sampled_bird_dev_set.json"
) -
DB_ROOT_DIRECTORY
should be the database directory, e. g."<dataset-path>/BIRD/dev/dev_databases"
for the BIRDdev
dataset (where the.sqlite
and description.csv
files are stored in subdirectories for each database) -
INDEX_SERVER_HOST
TODO: Find out what that does, left to provided value"localhost"
-
INDEX_SERVER_PORT
TODO: Find out what that does, left to provided value12345
-
OPENAI_API_KEY
should be the OpenAI API key, if you plan on using that. Leave it to any other (non-emtpy) value -
GCP_PROJECT
should be set to the Google Cloud Project used for Gemini requests (left empty, i. e. set to value''
by replication project, unnecessary for Ollama set up) -
GCP_REGION
should be set to the region of the Google Cloud Project used for Gemini requests (left at provided value, i. e. set to value'us-central1'
by replication project, unnecessary for Ollama set up) -
GCP_CREDENTIALS
should be set to the credentials needed to authorize with the Google Cloud Project used fo Gemini requests (left empty, i. e. set to value''
by replication project, unnecessary for Ollama set up) -
GOOGLE_CLOUD_PROJECT
probably also should be set to the Google Cloud Project used for Gemini requests (left empty, i. e. set to value''
by replication project, unnecessary for Ollama set up) -
PATH
,OLLAMA_HOST
andOLLAMA_MODELS
optional in.env.llama3.2
for ease of use with Ollama, can stay commented out
Installation of Ollama (for Linux)
Download for mac and windows can be found here: Link
Documentation for installation under Linux: general GitHub-Link, GitHub, Permalink FAQ on execution with a different path to where the model is stored (possibly needed for large persistent data storage on institute server): general GitHub-Link, GitHub, Permalink
- Figuring out which one of
curl
orwget
is installed withwhich curl
,which wget
(should print an install path), then download with the appropriate command below. We assume that<download-path>
is a placeholder for the path where the archive file will be stored. We suggest using the providedollama
directory for testing, but we can specify any other directory that fits our needs.
with curl
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o <download-path>/ollama-linux-amd64.tgz
With wget
wget -O <download-path/ollama-linux-amd64.tgz https://ollama.com/download/ollama-linux-amd64.tgz
- Extracting the tar file to a path of our choice by replacing the placeholder
<install-path>
with our path. We suggest the providedollama
directory for testing purposes. Applysudo
as needed if we prefer following the official documentation.
tar -C <install-path> -xzf ollama-linux-amd64.tgz
- Setting the
OLLAMA_MODELS
environment variable to the path where we would like to store our models (placeholder<model-path>
), then start the ollama web service to run as a background job in bash. Here's a reminder that we can add environment vars to our.bashrc
.
OLLAMA_MODELS=<model-path> ; <install-path>/./bin/ollama serve &
- Then, downloading the model of our choice (placeholder , e. g.
llama3:70b
(48 GB VRAM) orllama3.2:3b
(3 GB VRAM) ; only necessary once, as long asOLLAMA_MODELS
is set to the correct<model-path>
.
<install-path>/./bin/ollama pull <model>
- Ollama is now ready to recieve requests to
<model>
by CHESS. We can also start a chat session withollama run <model>
to check that everything works. To stop the web service, get the job to the foreground withfg
, then stop withCtrl
+C
. To restart the web service, simply run (only) step 3 again. Remove/Uninstall withrm <download-path>/ollama-linux-amd64.tgz
,rm -r <install-path>/*
andrm -r <model-path>/*
.
Setting up the venv
- We change to the
CHESS
directory withcd CHESS
. There we create a python virtual environment namedvenv
with the command (using the appropriate python version on our system):
python -m venv venv
- We activate the environment with the following command, adding
(venv)
as a prefix to our command prompt
source venv/bin/activate
- We install the required python packages with:
pip install -r requirements.txt
- We are ready to execute any further steps below. Once we're finished, we can deactivate the virtual environment (which removes the
(venv )
prefix) by running:
deactivate
Configuring the preprocessing
The CHESS framework uses the langchain python package to connect to a LLM via the API of a web service. The Ollama integration for langchain was added to the requirements.txt
file as langchain_ollama==0.1.3
(version 0.1.3 because of its compatiblity with the existing requirements).
The preprocessing calls the LLM for embedding of the database and column descriptions, thus the file CHESS/src/database_utils/db_catalog/preprocessing.py was edited, adding the import from langchain_ollama import OllamaEmbeddings
and commenting out the existing EMBEDDING_FUNCTION
to replace it with EMBEDDING_FUNCTION = OllamaEmbeddings(model="llama3.2")
.
-
To use a different model with Ollama for embedding, the Parameter
model=llama3.2
must be edited tomodel=<model>
. -
Ensure that the Ollama web service is running (
OLLAMA_MODELS=<model-path> ; ollama/bin/./ollama serve &
) -
Run the preprocessing by changing into the
CHESS
directory withcd CHESS
, assuming you are in the replication's repository root. Then run:
/run/./run_preprocess.sh
Configuring the agents
To configure the model for the agents, we need to add a new engine/model configuration to CHESS/src/llm/engine_configs.py. For Llama3.2 with Ollama, we added the following configuration to the ENGINE_CONFIGS
:
"meta-llama/llama3-2": {
"constructor": ChatOllama,
"params": {
"model": "llama3.2",
"temperature": 0,
"model_kwargs": {
"stop": [""],
},
"num_ctx": 32768
}
}
- The
constructor
will be the langchain constructorChatOllama
for API calls to the LLM. - The
params
are the constructor parameters - The
model
is the model used by Ollama - The
temperature
is the default model temperature. This gets overwritten by the config of the agents. - The
model_kwargs
are copied from the existing Llama config, including thestop
entry. - The
num_ctx
is the context used by the model. Ollama defaults to a context size of 2048 tokens. We observed context sizes of about 15 000 tokens in the warnings from Ollama, Therefore, we set a context of about twice that. Note that Llama3.2 allows for a context size of up to 128 000, where as Llama3-70B only allows for a context size of 8192 tokens. Check with the model you would like to run with Ollama. - For any other parameters, check the langchain_ollama documentation on
ChatOllama
- Adding another model for Ollama will be configured similarly to the above config.
To configure the agents, a .yaml
configuration file and a shell script are needed. For testing purposes, the authors of the replication copied the CHESS/run/configs/CHESS_IR_CG_UT.yaml config file to CHESS/run/configs/CHESS_IR_CG_UT_LLAMA3-2.yaml and the CHESS/run/configs/CHESS_IR_SS_CG.yaml config file to CHESS/run/configs/CHESS_IR_SS_CG_LLAMA3-2.yaml and replaced every engine
and engine_name
config with the meta-llama/llama3-2
model as configured above.
- Copying the appropriate
.yaml
file inCHESS/run/configs
and setting allengine
andengine_name
parameters to the model of choice will configure the agents to one of the two workflows in the original CHESS paper, just with the model of choice.
Similarly, the shell scripts to run the agents were copied for testing purposes by the authors of the replication, copying CHESS/run/run_main_ir_cg_ut.sh to CHESS/run/run_main_ig_cg_ut_llama3.2.sh and CHESS/run/run_main_ir_ss_cg.sh to CHESS/run/run_main_ir_ss_cg_llama3.2.sh in CHESS/run
. The config
variable was changed to the appropriate path of the agent configuration file:
In CHESS/run/run_main_ig_cg_ut_llama3.2.sh, the line:
config="./run/config/CHESS_IR_CG_UT.yaml"
Was changed to the line:
config="./run/config/CHESS_IR_CG_UT_LLAMA3-2.yaml"
- Copying a run script in the directory
CHESS/run
and adjusting theconfig
variable to the appropriate config file created in step 2 makes the framework runnable with a custom configuration.
In the information retriever agent (IR) there is another call to the embed
-API that is not covered by the config in the previous steps. In the retrieve_entity
tool, the in the file CHESS/src/workflow/agents/information_retriever/tool_kit/retrieve_entity.py, the replication authors added the import from langchain_ollama import OllamaEmbeddings
and the property embedding_function
of class RetrieveEntity
(line 34, self.embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")
) was adapted to the OllamaEmbeddings
: self.embedding_function = OllamaEmbeddings(model="llama3.2")
-
Changing the model for the embedding calls of the
retrieve_entity
tool in the information retriever agent to the model of choice will ensure all API calls are directed to the appropriate LLMs. -
Run the shell script for your configuration from the
CHESS
directory, e. g. for the Llama3.2 testing config of the replication authors:
run/./run_main_ir_ss_cg_llama3.2.sh