Added documentation on how to get and config the datasets

25544e75 · Erik Jonas Hartnick · 4861ec0f · 25544e75 · 25544e75
Commit 25544e75 authored 2 months ago by Erik Jonas Hartnick
--- a/CHESS/.env.ollama3.2.example
+++ b/CHESS/.env.ollama3.2.example
@@ -13,8 +13,8 @@ GCP_REGION='us-central1'
 GCP_CREDENTIALS=''
 GOOGLE_CLOUD_PROJECT=''

-PATH="$PATH:$PWD/ollama/bin"
 # uncomment these and configure Ollama to source before running ollama serve, pull or run
+# PATH="$PATH:$PWD/ollama/bin"
 # OLLAMA_HOST="127.0.0.1:11434"
 # OLLAMA_MODELS="~/.ollama/models"

--- a/README.md
+++ b/README.md
@@ -12,6 +12,70 @@ ShareLatex-Project: https://sharelatex.informatik.uni-halle.de/3866392724cmvnzhy

 BIRD: [Link](https://bird-bench.github.io/)

+Spider: [Link](https://yale-lily.github.io/spider)
+
+1. We need a path to save our dataset(s) to. The `CHESS` repo provides a `data` directory for this purpose where we can download the dev set of BIRD to work with more or less out of the box, but we assume the path to download to be the placeholder `<dataset-path>` to allow for any other storage options. We create the subdirectories `BIRD` and `Spider` for both datasets:
+
+```bash
+mkdir <dataset-path>/BIRD <dataset-path>/Spider
+```
+
+2. We download the dataset(s) with `curl` or with `wget`, depending on which one is installed on our system, with the `<dataset-path>` from above (skip BIRD train (second line) if not needed):
+
+With `curl`:
+
+```bash
+curl -L 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip' -o <dataset-path>/BIRD/dev.zip ; \
+curl -L 'https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip' -o <dataset-path>/BIRD/train.zip ; \
+curl -L 'https://drive.usercontent.google.com/download?id=1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J&confirm=t' -o <dataset-path>/Spider/spider_data.zip ; \
+```
+
+With `wget`:
+
+```bash
+wget -O <dataset-path>/BIRD/dev.zip 'https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip' ; \
+wget -O <dataset-path>/BIRD/train.zip 'https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip' ; \
+wget -O <dataset-path>/Spider/spider_data.zip 'https://drive.usercontent.google.com/download?id=1403EGqzIDoHMdQF4c9Bkyl7dZLZ5Wt6J&confirm=t' ; \
+```
+
+3. Extracting the zip archives we just downloaded (adjusting for BIRD/train.zip as needed):
+
+```bash
+unzip <dataset-path>/BIRD/dev.zip -d <dataset-path>/BIRD
+unzip <dataset-path>/BIRD/train.zip -d <dataset-path>/BIRD
+unzip <dataset-path>/Spider/spider_data.zip -d <dataset-path>/Spider
+```
+
+4. For the BIRD dataset, we rename the extracted subdirectory to `dev` for ease of use, then we to extract the databases in a subdirectory separately (skip BIRD train as needed, third line and remove `\`):
+
+```bash
+mv <dataset-path>/BIRD/dev_20240627 <dataset-path>/BIRD/dev ; \
+unzip <dataset-path>/BIRD/dev_databases.zip -d <dataset-path>/BIRD/dev ; \
+unzip <dataset-path>/BIRD/train_databases.zip -d <dataset-path>/BIRD/train
+```
+
+5. To configure the datasets, we copy the `dotenv_copy` file provided by the CHESS authors to a `.env` file (or we copy the `.env.llama3.2.example` to `.env` provided by the replication authors):
+
+Copy `dotenv_copy` to `.env`:
+
+```bash
+cp CHESS/dotenv_copy CHESS/.env
+```
+
+6. Editing the `.env` config with our favourite command line editor, we can now set the dataset locations as follows:
+  - `DB_ROOT_PATH` should be the parent directory of the database directory, e. g. `"<dataset-path>/BIRD/dev"` for the BIRD `dev` dataset, with the subdirectory `dev_databases` (where the `.sqlite` and description `.csv` files are stored)
+  - `DATA_MODE` possible values are `"dev"` or `"train"`, switches between inference mode and train mode
+  - `DATA_PATH` should be the json file, providing the user questions and the expected SQL results, e. g. `"<dataset-path>/BIRD/dev/dev.json"` for the BIRD dev dataset or the reduced SDS dataset provided by the authors of CHESS (in the repository under `"data/dev/sub_sampled_bird_dev_set.json"`)
+  - `DB_ROOT_DIRECTORY` should be the database directory, e. g. `"<dataset-path>/BIRD/dev/dev_databases"` for the BIRD `dev` dataset (where the `.sqlite` and description `.csv` files are stored in subdirectories for each database)
+  - `INDEX_SERVER_HOST` TODO: Find out what that does, left to provided value `"localhost"`
+  - `INDEX_SERVER_PORT` TODO: Find out what that does, left to provided value `12345`
+  - `OPENAI_API_KEY` should be the OpenAI API key, if you plan on using that. Leave it to any other (non-emtpy) value
+  - `GCP_PROJECT` should be set to the Google Cloud Project used for Gemini requests (left empty, i. e. set to value `''` by replication project, unnecessary for Ollama set up)
+  - `GCP_REGION` should be set to the region of the Google Cloud Project used for Gemini requests (left at provided value, i. e. set to value `'us-central1'` by replication project, unnecessary for Ollama set up)
+  - `GCP_CREDENTIALS` should be set to the credentials needed to authorize with the Google Cloud Project used fo Gemini requests (left empty, i. e. set to value `''` by replication project, unnecessary for Ollama set up)
+  - `GOOGLE_CLOUD_PROJECT` probably also should be set to the Google Cloud Project used for Gemini requests (left empty, i. e. set to value `''` by replication project, unnecessary for Ollama set up)
+  - `PATH`, `OLLAMA_HOST` and `OLLAMA_MODELS` optional in `.env.llama3.2` for ease of use with Ollama, can stay commented out
+
 ## Installation of Ollama (for Linux)

 >Download for mac and windows can be found here: [Link](https://ollama.com/download)
@@ -19,7 +83,7 @@ BIRD: [Link](https://bird-bench.github.io/)
 >Documentation for installation under Linux: [general GitHub-Link](https://github.com/ollama/ollama/blob/main/docs/linux.md), [GitHub, Permalink](https://github.com/ollama/ollama/blob/1c198977ecdd471aee827a378080ace73c02fa8d/docs/linux.md)
 >FAQ on execution with a different path to where the model is stored (possibly needed for large persistent data storage on institute server): [general GitHub-Link](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-set-them-to-a-different-location), [GitHub, Permalink](https://github.com/ollama/ollama/blob/1c198977ecdd471aee827a378080ace73c02fa8d/docs/faq.md#how-do-i-set-them-to-a-different-location)

-1. Figure out which one of `curl` or `wget` is installed with `which curl`, `which wget` (should print an install path), then download with the appropriate command below. We assume that `<download-path>` is a placeholder for the path where the archive file will be stored. We suggest using the provided `ollama` directory for testing, but you can specify any other directory that fits your needs.
+1. Figuring out which one of `curl` or `wget` is installed with `which curl`, `which wget` (should print an install path), then download with the appropriate command below. We assume that `<download-path>` is a placeholder for the path where the archive file will be stored. We suggest using the provided `ollama` directory for testing, but we can specify any other directory that fits our needs.

 with `curl`
 ```bash
@@ -28,27 +92,35 @@ curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o <download-path>/ol

 With `wget`
 ```bash
-cd <download-path>; wget https://ollama.com/download/ollama-linux-amd64.tgz; cd $OLDPWD
+wget -O <download-path/ollama-linux-amd64.tgz https://ollama.com/download/ollama-linux-amd64.tgz
 ```

-2. Extract the tar file to a path of your choice by replacing the placeholder `<install-path>` with your path. We suggest the provided `ollama` directory for testing purposes. Apply `sudo` as needed if you prefer following the [official documentation](https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install).
+2. Extracting the tar file to a path of our choice by replacing the placeholder `<install-path>` with our path. We suggest the provided `ollama` directory for testing purposes. Apply `sudo` as needed if we prefer following the [official documentation](https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install).

 ```bash
 tar -C <install-path> -xzf ollama-linux-amd64.tgz
 ```

-3. Set the `OLLAMA_MODELS` environment variable to the path where you would like to store your models (placeholder `<model-path>`), then start the ollama web service to run as a background job in bash. Here's a reminder that you can add environment vars to your `.bashrc`.
+3. Setting the `OLLAMA_MODELS` environment variable to the path where we would like to store our models (placeholder `<model-path>`), then start the ollama web service to run as a background job in bash. Here's a reminder that we can add environment vars to our `.bashrc`.

 ```bash
 OLLAMA_MODELS=<model-path> ; <install-path>/./bin/ollama serve &
 ```

-4. Then, download the model of your choice (placeholder <model>, e. g. `llama3:70b` (48 GB VRAM) or `llama3.2:3b` (3 GB VRAM) ; only necessary once, as long as `OLLAMA_MODELS` is set to the correct `<model-path>`.
+4. Then, downloading the model of our choice (placeholder <model>, e. g. `llama3:70b` (48 GB VRAM) or `llama3.2:3b` (3 GB VRAM) ; only necessary once, as long as `OLLAMA_MODELS` is set to the correct `<model-path>`.

 ```bash
 <install-path>/./bin/ollama pull <model>
 ```

-5. Ollama is now ready to recieve requests to `<model>` by CHESS. You can also start a chat session with `ollama run <model>` to check that everything works. To stop the web service, get the job to the foreground with `fg`, then stop with `Ctrl` + `C`. To restart the web service, simply run (only) step 3 again. Remove/Uninstall with `rm <download-path>/ollama-linux-amd64.tgz`, `rm -r <install-path>/*` and `rm -r <model-path>/*`.
+5. Ollama is now ready to recieve requests to `<model>` by CHESS. We can also start a chat session with `ollama run <model>` to check that everything works. To stop the web service, get the job to the foreground with `fg`, then stop with `Ctrl` + `C`. To restart the web service, simply run (only) step 3 again. Remove/Uninstall with `rm <download-path>/ollama-linux-amd64.tgz`, `rm -r <install-path>/*` and `rm -r <model-path>/*`.
+
+## Configuring the preprocessing
+
+- TODO
+
+## Configuring the agents
+
+- TODO

 ## ...