Skip to content
Snippets Groups Projects
Till-Ole Herbst's avatar
Till-Ole Herbst authored
56226c27
History

Finetuning LLMs for Text2SQL

Grundlegende Frameworks

DB-GPT-Hub und Spider

Pakete

"transformers>=4.41.2",
"datasets>=2.14.6",
"tiktoken>=0.7.0",
"torch>=2.2.1",
"peft>=0.4.0",
"trl>=0.5.0",
"prettytable",
"func-timeout",
"sqlparse",
"jsonlines",
"rouge-chinese>=1.0.3",
"jieba>=0.42.1",
"nltk>=3.8.1",
"matplotlib>=3.8.1",
"bitsandbytes>=0.39.0",
"accelerate"

Datensatz formatieren

Zuerst muss der Spider Datensatz heruntergeladen und in den Ordner data/ kopiert werden. Dann folgenden Befehl ausführen:

python -m src.sql_data_process

Gegebenenfalls --code_representation für Code Representation.

Pipeline

Siehe pipeline.sh:

# FINETUNING
wandb offline # Close wandb

current_date=$(date +"%Y%m%d_%H%M")
train_log="output/logs/train_sft_test_${current_date}.log"
start_time=$(date +%s)
echo " Train Start time: $(date -d @$start_time +'%Y-%m-%d %H:%M:%S')" >>${train_log}

dataset_train="example_text2sql_train"
dataset_dev="data/example_text2sql_dev.json"

model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct"
output_dir="output/adapter/llama3_instruct_qlora"

# the default param set could be run in a server with one a100(40G) gpu, if your server not support the set,you can set smaller param such as  lora_rank and use qlora with quant 4 eg...
CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=memory.free,index --format=csv,nounits,noheader | sort -nr | head -1 | awk '{ print $NF }') python -m src.sft_train \
    --model_name_or_path $model_name_or_path \
    --quantization_bit 4 \
    --do_train \
    --dataset $dataset_train \
    --max_source_length 2048 \
    --max_target_length 512 \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --template llama3 \
    --lora_rank 64 \
    --lora_alpha 32 \
    --output_dir $output_dir \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --lr_scheduler_type cosine_with_restarts \
    --logging_steps 50 \
    --save_steps 2000 \
    --learning_rate 2e-4 \
    --num_train_epoch 8 \
    --plot_loss \
    --bf16  >> ${train_log}
    # --bf16#v100不支持bf16
    
echo "############train end###############" >>${train_log}
echo "Train End time: $(date)" >>${train_log}
end_time=$(date +%s)
duration=$((end_time - start_time))
hours=$((duration / 3600))
min=$(( (duration % 3600) / 60))
echo "Time elapsed: ${hour}  hour $min min " >>${train_log}

# PREDICT

current_date=$(date +"%Y%m%d_%H%M")
pred_log="output/logs/pred_test_${current_date}.log"
start_time=$(date +%s)
echo " Pred Start time: $(date -d @$start_time +'%Y-%m-%d %H:%M:%S')" >>${pred_log}

CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=memory.free,index --format=csv,nounits,noheader | sort -nr | head -1 | awk '{ print $NF }')  python -m src.predict \
    --model_name_or_path $model_name_or_path \
    --template llama3 \
    --quantization_bit 4 \
    --finetuning_type lora \
    --predicted_input_filename $dataset_dev \
    --checkpoint_dir $output_dir \
    --predicted_out_filename output/pred/pred_llama3_instruct_qlora.sql >> ${pred_log}

echo "############pred end###############" >>${pred_log}
echo "pred End time: $(date)" >>${pred_log}
end_time=$(date +%s)
duration=$((end_time - start_time))
hours=$((duration / 3600))
min=$(( (duration % 3600) / 60))
echo "Time elapsed: ${hour}  hour $min min " >>${pred_log}

Finetuning

  1. dataset_train und dataset_dev anpassen für Datensatz (zu finden in data/)
  2. model_name_or_path zugrunde liegendes Modell von HuggingFace auswählen
  3. output_dir umbenennen (output/adapter/ bleibt dabei fest)
  4. --quantization_bit für QLoRA (4 oder 8 Bit), für LoRA diesen Parameter einfach entfernen
  5. --finetuning_type lora für LoRA und QLoRA, fullfür Full Parameter
  6. --template auf entweder llama2oder llama3 entsprechend setzen (siehe src/data_utils.py oder src/chat_model.py für mehr Templates)

Predict

  1. --template anpassen
  2. --quantization_bit wieder für QLoRA
  3. --predicted_out_filename Filename ändern für generierte Anfragen

Ein beispielhafter Output für CodeLlama 7B ist in output/pred zu finden.

Job auf GPU legen

siehe batch_pipeline.sh

#!/bin/bash
#SBATCH --job-name=pipe1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-gpu=8
#SBATCH --mem-per-cpu=4G

cd /path/to/repo
source dbgpt-hub/bin/activate

sh pipeline.sh

--gres=gpu:1 ist dabei die Anzahl der angeforderten GPUs /path/to/repo anpassen

Dann mit sbatch Job auf GPU legen:

sbatch --nodelist=workg01 batch_pipeline.sh

Zur Wahl stehen zurzeit workg01 (40GB/GPU) und workg02 (80GB/GPU) mit jeweils 4 GPUs

Mit squeue können dann laufende Jobs gecheckt werden

Evaluation

cd data/spider
python spider/evaluation.py --gold [gold file] --pred [predicted file] --etype [evaluation type] --db [database dir] --table [table file]

arguments:
  [gold file]        gold.sql file where each line is `a gold SQL \t db_id`
  [predicted file]   predicted sql file where each line is a predicted SQL
  [evaluation type]  "match" for exact set matching score, "exec" for execution score, and "all" for both
  [database dir]     directory which contains sub-directories where each SQLite3 database is stored
  [table file]       table.json file which includes foreign key info of each database

Bsp.:
 python spider/evaluation.py --gold dev_gold.sql --pred ../../output/pred/pred_sqlcoder_llama3_2.sql --etype "match" --db database --table tables.json