Finetuning LLMs for Text2SQL
Grundlegende Frameworks
DB-GPT-Hub und Spider
Pakete
"transformers>=4.41.2",
"datasets>=2.14.6",
"tiktoken>=0.7.0",
"torch>=2.2.1",
"peft>=0.4.0",
"trl>=0.5.0",
"prettytable",
"func-timeout",
"sqlparse",
"jsonlines",
"rouge-chinese>=1.0.3",
"jieba>=0.42.1",
"nltk>=3.8.1",
"matplotlib>=3.8.1",
"bitsandbytes>=0.39.0",
"accelerate"
Datensatz formatieren
Zuerst muss der Spider Datensatz heruntergeladen und in den Ordner data/
kopiert werden. Dann folgenden Befehl ausführen:
python -m src.sql_data_process
Gegebenenfalls --code_representation
für Code Representation.
Pipeline
Siehe pipeline.sh
:
# FINETUNING
wandb offline # Close wandb
current_date=$(date +"%Y%m%d_%H%M")
train_log="output/logs/train_sft_test_${current_date}.log"
start_time=$(date +%s)
echo " Train Start time: $(date -d @$start_time +'%Y-%m-%d %H:%M:%S')" >>${train_log}
dataset_train="example_text2sql_train"
dataset_dev="data/example_text2sql_dev.json"
model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct"
output_dir="output/adapter/llama3_instruct_qlora"
# the default param set could be run in a server with one a100(40G) gpu, if your server not support the set,you can set smaller param such as lora_rank and use qlora with quant 4 eg...
CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=memory.free,index --format=csv,nounits,noheader | sort -nr | head -1 | awk '{ print $NF }') python -m src.sft_train \
--model_name_or_path $model_name_or_path \
--quantization_bit 4 \
--do_train \
--dataset $dataset_train \
--max_source_length 2048 \
--max_target_length 512 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--template llama3 \
--lora_rank 64 \
--lora_alpha 32 \
--output_dir $output_dir \
--overwrite_cache \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--lr_scheduler_type cosine_with_restarts \
--logging_steps 50 \
--save_steps 2000 \
--learning_rate 2e-4 \
--num_train_epoch 8 \
--plot_loss \
--bf16 >> ${train_log}
# --bf16#v100不支持bf16
echo "############train end###############" >>${train_log}
echo "Train End time: $(date)" >>${train_log}
end_time=$(date +%s)
duration=$((end_time - start_time))
hours=$((duration / 3600))
min=$(( (duration % 3600) / 60))
echo "Time elapsed: ${hour} hour $min min " >>${train_log}
# PREDICT
current_date=$(date +"%Y%m%d_%H%M")
pred_log="output/logs/pred_test_${current_date}.log"
start_time=$(date +%s)
echo " Pred Start time: $(date -d @$start_time +'%Y-%m-%d %H:%M:%S')" >>${pred_log}
CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=memory.free,index --format=csv,nounits,noheader | sort -nr | head -1 | awk '{ print $NF }') python -m src.predict \
--model_name_or_path $model_name_or_path \
--template llama3 \
--quantization_bit 4 \
--finetuning_type lora \
--predicted_input_filename $dataset_dev \
--checkpoint_dir $output_dir \
--predicted_out_filename output/pred/pred_llama3_instruct_qlora.sql >> ${pred_log}
echo "############pred end###############" >>${pred_log}
echo "pred End time: $(date)" >>${pred_log}
end_time=$(date +%s)
duration=$((end_time - start_time))
hours=$((duration / 3600))
min=$(( (duration % 3600) / 60))
echo "Time elapsed: ${hour} hour $min min " >>${pred_log}
Finetuning
-
dataset_train
unddataset_dev
anpassen für Datensatz (zu finden indata/
) -
model_name_or_path
zugrunde liegendes Modell von HuggingFace auswählen -
output_dir
umbenennen (output/adapter/
bleibt dabei fest) -
--quantization_bit
für QLoRA (4 oder 8 Bit), für LoRA diesen Parameter einfach entfernen -
--finetuning_type lora
für LoRA und QLoRA,full
für Full Parameter -
--template
auf entwederllama2
oderllama3
entsprechend setzen (siehesrc/data_utils.py
odersrc/chat_model.py
für mehr Templates)
Predict
-
--template
anpassen -
--quantization_bit
wieder für QLoRA -
--predicted_out_filename
Filename ändern für generierte Anfragen
Ein beispielhafter Output für CodeLlama 7B ist in output/pred
zu finden.
Job auf GPU legen
siehe batch_pipeline.sh
#!/bin/bash
#SBATCH --job-name=pipe1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-gpu=8
#SBATCH --mem-per-cpu=4G
cd /path/to/repo
source dbgpt-hub/bin/activate
sh pipeline.sh
--gres=gpu:1
ist dabei die Anzahl der angeforderten GPUs
/path/to/repo
anpassen
Dann mit sbatch
Job auf GPU legen:
sbatch --nodelist=workg01 batch_pipeline.sh
Zur Wahl stehen zurzeit workg01
(40GB/GPU) und workg02
(80GB/GPU) mit jeweils 4 GPUs
Mit squeue
können dann laufende Jobs gecheckt werden
Evaluation
cd data/spider
python spider/evaluation.py --gold [gold file] --pred [predicted file] --etype [evaluation type] --db [database dir] --table [table file]
arguments:
[gold file] gold.sql file where each line is `a gold SQL \t db_id`
[predicted file] predicted sql file where each line is a predicted SQL
[evaluation type] "match" for exact set matching score, "exec" for execution score, and "all" for both
[database dir] directory which contains sub-directories where each SQLite3 database is stored
[table file] table.json file which includes foreign key info of each database
Bsp.:
python spider/evaluation.py --gold dev_gold.sql --pred ../../output/pred/pred_sqlcoder_llama3_2.sql --etype "match" --db database --table tables.json