diff --git a/README.md b/README.md index e3867ac1aa4656cca04b2f9171727566644d8edc..132eb75de774a68da26657b799bf30060030bb2c 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,148 @@ Original-repo: [Link](https://github.com/ShayanTalaei/CHESS) Paper: [Link](https://arxiv.org/abs/2405.16755) -ShareLatex-Project: https://sharelatex.informatik.uni-halle.de/3866392724cmvnzhyfwdkk +## Datasets -## ... \ No newline at end of file +BIRD: [Link](https://bird-bench.github.io/) + +Spider: [Link](https://yale-lily.github.io/spider) + +## Repo-Structure + +There are 3 Branches next to the `main`-Branch: + +- [2-llama3-2-konfigurationstest](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/tree/2-llama3-2-konfigurationstest) with the llama3.2 and local test setup for graphics cards with 8 GB or more, note that tool calls will not work properly due to model Llama3.2-3B being too small +- [3-konfigurieren-fur-llama3-70b-und-slrum](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/tree/3-konfigurieren-fur-llama3-70b-und-slrum) with the setup for subsampled BIRD dev set +- [5-konfiguration-an-spider-datensatz-anpassen](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/tree/5-konfiguration-an-spider-datensatz-anpassen) with the setup for subsampled Spider test set and full Spider test set + +> Note that documenation on the last two branches is unchanged to the first one and does not include documentation specific to those branches. +> Also note that the BIRD branch does not contain the changes necessary to run the Spider test set, whereas the changes should be compatible with the BIRD dev set (same for the subsampled variants) + +## Results/data + +The raw output of the CHESS framework and the stdout/stderr of the runs used for the report can be found here: [Link](https://cloud.informatik.uni-halle.de/s/x5xY2pNMceYGs2P) + +## Setting up `sampling_count` for the revise tool + +To set up the `sampling_count` for the revise tool in the BIRD SDS, switch to branch [3-konfigurieren-fur-llama3-70b-und-slrum](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/tree/3-konfigurieren-fur-llama3-70b-und-slrum), then edit the [CHESS/run/configs/CHESS_IR_SS_CG_BIRD_OSS.yaml](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/blob/3-konfigurieren-fur-llama3-70b-und-slrum/CHESS/run/configs/CHESS_IR_SS_CG_BIRD_OSS.yaml?ref_type=heads#L63) file in line 63 (at the end of the file, under `revise` > `sampling_count`). Set it to the desired value: + +```yaml + revise: + template_name: 'revise_two' + engine_config: + engine_name: 'meta-llama/Meta-Llama-3-70B-Instruct' + temperature: 0.0 + parser_name: 'revise' + sampling_count: 3 +``` + +Here, the `sampling_count` for revise is 3, we used 1 and 0 for the ablation studies, 0 will result in no revisions being generated. + +## Setting up the embedding models + +The set up is the same for both the Spider and the BIRD versions, we simply provide links to both branches below for ease of access. +There are two places in which embedding models are used: 1. during preprocessing, 2. in the `retrieve_entity` tool call. + +1. During preprocessing: In the `CHESS/src/database_utils/db_catalog/preprocess.py` file, we used `mxbai-embed-large` but also tried `Llama3-70B` for embedding, the latter is commented out. To switch between the models, simply comment in the one you would like to use and comment out the others (lines 31-35). Note that Ollama needs to have the models downloaded separately before they can be used: + +```python +# EMBEDDING_FUNCTION = VertexAIEmbeddings(model_name="text-embedding-004")#OpenAIEmbeddings(model="text-embedding-3-large") +# EMBEDDING_FUNCTION = OpenAIEmbeddings(model="text-embedding-3-large") +# EMBEDDING_FUNCTION = OllamaEmbeddings(model="llama3.2") +EMBEDDING_FUNCTION = OllamaEmbeddings(model="mxbai-embed-large") +# EMBEDDING_FUNCTION = OllamaEmbeddings(model="llama3:70b") +``` + +Here the embedding in the preprocessing will be provided by `mxbai-embed-large`. + +- [Link](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/blob/3-konfigurieren-fur-llama3-70b-und-slrum/CHESS/src/database_utils/db_catalog/preprocess.py?ref_type=heads#L34) to the file on the BIRD branch +- [Link](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/blob/5-konfiguration-an-spider-datensatz-anpassen/CHESS/src/database_utils/db_catalog/preprocess.py?ref_type=heads#L34) to the file on the Spider branch + +2. In retrieve_entity: In the `CHESS/src/workflow/agents/information_retriever/tool_kit/retrieve_entity.py` file, we used `nomic-embed-text` but also tried `Llama3-70B` for embedding, the latter is commented out. To switch between the models, simply comment in the one you would like to use and comment out the others (lines 35-38). Note that Ollama needs to have the models downloaded separately before they can be used: + +```python + # self.embedding_function = OpenAIEmbeddings(model="text-embedding-3-small") + # self.embedding_function = OllamaEmbeddings(model="llama3.2") + self.embedding_function = OllamaEmbeddings(model="nomic-embed-text") + # self.embedding_function = OllamaEmbeddings(model="llama3:70b") +``` + +Here the embedding in the preprocessing will be provided by `nomic-embed-text`. + +- [Link](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/blob/3-konfigurieren-fur-llama3-70b-und-slrum/CHESS/src/workflow/agents/information_retriever/tool_kit/retrieve_entity.py#L37) to the file on the BIRD branch +- [Link](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/blob/5-konfiguration-an-spider-datensatz-anpassen/CHESS/src/workflow/agents/information_retriever/tool_kit/retrieve_entity.py?ref_type=heads#L37) to the file on the Spider branch + +## CHESS's `.env`-file + +On each branch, we provided a `.env`-file that is compatible with the CHESS set up using slurm. + +The [3-konfigurieren-fur-llama3-70b-und-slrum](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/tree/3-konfigurieren-fur-llama3-70b-und-slrum) branch is set up for the subsampled BIRD dev set. To use the full BIRD dev set with our slurm script, set `DATA_PATH` to `"./data/BIRD/dev/dev.json"`: + +```bash +OPENAI_API_KEY="OPEN AI API KEY" + +DB_ROOT_PATH="./data/BIRD/dev" # this directory should be the parent of test_databases + +DATA_MODE="dev" +# DATA_PATH="./data/BIRD/dev/dev.json" +DATA_PATH="./data/BIRD/dev/sub_sampled_bird_dev_set.json" +DB_ROOT_DIRECTORY="./data/BIRD/dev/dev_databases" +DATA_TABLES_PATH="./data/BIRD/dev/dev_tables.json" +INDEX_SERVER_HOST='localhost' +INDEX_SERVER_PORT=12345 + +OPENAI_API_KEY='EMPTY' +GCP_PROJECT='' +GCP_REGION='us-central1' +GCP_CREDENTIALS='' +GOOGLE_CLOUD_PROJECT='' + +# PATH="$PATH:$PWD/ollama/bin" +# OLLAMA_HOST="127.0.0.1:11434" +# OLLAMA_MODELS="~/.ollama/models" +``` + +The [5-konfiguration-an-spider-datensatz-anpassen](https://gitlab.informatik.uni-halle.de/aktxt/re-chess/-/tree/5-konfiguration-an-spider-datensatz-anpassen) branch is set up for the full Spider test set. To use the subsampled Spider test set with our slurm script, set `DATA_PATH` to `"./data/Spider/spider_data/sub_sampled_spider_test_set.json"`: + +```bash +OPENAI_API_KEY="OPEN AI API KEY" + +DB_ROOT_PATH="./data/Spider/spider_data" # this directory should be the parent of test_databases + +DATA_MODE="test" +DATA_PATH="./data/Spider/spider_data/test.json" +# DATA_PATH="./data/Spider/spider_data/sub_sampled_spider_test_set.json" +DB_ROOT_DIRECTORY="./data/Spider/spider_data/test_databases" # changed in slurm script +DATA_TABLES_PATH="./data/Spider/spider_data/test_tables.json" +INDEX_SERVER_HOST='localhost' +INDEX_SERVER_PORT=12345 + +OPENAI_API_KEY='EMPTY' +GCP_PROJECT='' +GCP_REGION='us-central1' +GCP_CREDENTIALS='' +GOOGLE_CLOUD_PROJECT='' + +# PATH="$PATH:$PWD/ollama/bin" +# OLLAMA_HOST="127.0.0.1:11434" +# OLLAMA_MODELS="~/.ollama/models" +``` + +> Note that the `test_databases` directory will be renamed from `test_database` by our slurm script. Adjust as necessary. + +## Using the repo with McGarret/Slurm +1. Download or clone the repo to a place in your home directory on the uni's servers, take note where you left it, `<repo-path>` will be the placeholder for the repo's root. +2. Select the branch you would like to work on (BIRD, Spider, see above) from the drop down menu in gitlab and go to the root of the repository if you are not already +3. Download a zip file of the repo: click the blue `Code` button in the to right corner and under `Download source code`, select `zip` +4. Wait for the Download and extract the zip file. +5. Rename the parent directory of the extracted `CHESS` directory to `re-chess` (it must have this exact name to work with the provided slurm scripts) +6. Rezip the re-named `re-chess` directory to any zip file name, e. g. using `zip -r re-chess.zip re-chess` form where the `re-chess` directory is located. The note that the root in the created zip file will now named `re-chess`, without the branch name attached to it +7. Upload the newly created zip file to your home directory on the uni's servers as well, take note of the path you left it in, this will be the placeholder `<zip-path>` below +8. Log in to McGarret, change directory to `<repo-path>/scripts` +9. Start the script, using your uni provided shorthand as a directory name for the placeholder `<shorthand>`: + +```bash +sbatch copyrepo.sh <shorthand> <zip-path> +``` + +> Note that the stdout and stderr files `slurm-<jobid>.out` and `slurm-<jobid>.err`, as well as the `results-<jobid>.zip` file will be written to the `<repo-path>/scripts` directory.