Skip to content
Snippets Groups Projects
Forked from dbis-public / test-gpu-container
16 commits ahead of the upstream repository.
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Training Script Documentation

Overview

This script is designed to facilitate model training with various configurations. Users can specify multiple parameters, including the training mode, optimizer, model type, and other training settings.

Usage

python train.py --mode <mode> --optimizer <optimizer> --model <model> [other options]

Input Parameters

Required Parameters

Parameter Type Choices Description
--mode string pretraining, finetuning Specifies the training mode.
--optimizer string lora, galore, galore8bit, lora+galore8bit, baseline Selects the optimizer type.
--model string llama_60m, llama_1b, llama_7b, roberta, gpt2 Defines the model to train.

Optional Parameters

Parameter Type Default Choices Description
--batch_size int 16 N/A Number of samples per batch.
--num_epochs int 30 N/A Number of training epochs.
--max_length int 512 N/A Maximum token length per input.
--num_training_tokens int 1e9 N/A Number of training tokens (only for pretraining).
--shuffle string true true, false Whether to shuffle training data (not applicable in streaming mode).
--dtype string fp16 bf16, fp16 Data type for training (currently only bf16 is working).
--lr float 4e-4 N/A Learning rate for optimizer.
--weight_decay float 0.01 N/A Weight decay for optimizer.
--tmax int 30 N/A Tmax for scheduler.
--lora_config string config/lora_config.json N/A Path to the LoRa configuration file.
--galore_config string config/galore_config.json N/A Path to the GaLore configuration file.
--test string false true, false Whether to enable test mode. Takes only 1000 tokens of dataset for pretraining and accelerator without bf16 (useful only for A100 GPUs).

Example Command

python train.py --mode pretraining --optimizer lora --model llama_1b --batch_size 32 --num_epochs 20 --shuffle false --lr 3e-4

This command runs the script in pretraining mode using the LoRa optimizer on the llama_1b model with a batch size of 32, 20 epochs, no data shuffling, and a learning rate of 3e-4.