demo
- a directory that you will have to create yourself. In addition, you will define an environment variable called PARENT_CS
to return to this parent directory at the anytime.
1. Create a parent directory, demo
, to include all the data, code, and checkpoints. Export an environment variable PARENT_CS
with the full path to the parent directory. This environment variable will be helpful when pointing to the absolute path during the execution.
modelzoo
inside the demo
parent directory.
create_hdf5_dataset.py
script that tokenizes the input documents and creates token IDs, labels, and attention masks. For more information, refer to hdf5_dataset_gpt.data
folder, and in the folder, download the dataset’s zip
file containing the dataset.
wikitext-2-raw-v1
folder, you will find three files corresponding to train, validation, and test splits. Explore these files to understand the raw dataset format. The data is in txt
format.
create_hdf5_dataset.py
script expects all the files in the same folder to belong to the same split. Therefore, create three folders - train
, test
, and valid
, and move the corresponding files to each folder. In addition, add a .txt
file extension to each file since each file is in a raw format.
create_hdf5_dataset.py
script also parses file types .json
, .jsonl.zst
, and .jsonl.zst.tar
. For more information on the input format, refer to hdf5_dataset_gpt
demo
. For more informaion on HDF5 format, click here.
(venv_cerebras_pt)
environment.
venv_cerebras_pt
virtual environment has been set up and activated, launch the create_hdf5_dataset.py
script. This script resides in the modelzoo
folder in the Cerebras Model Zoo repository. Launch this script for every data split you intend to use. In this case, you will preprocess the train
and valid
splits and save the preprocessed data in an HDF5 format in the data/wikitext-2-hdf5/
folder.
Prepare the data for the train set:
create_hdf5_dataset.py
is instrumented with multiple flags. Our version of the Python script uses the following flags:
Flag | Description |
---|---|
LMData | For processing language modelling datasets that are in .jsonl or .txt format. |
--params ./configs/autoregressive_lm_preprocessing.yaml | Path to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments. |
--input_dir data/wikitext-2-raw/train/ | Folder where all the .txt files are located. In this case, you only have one txt file |
--output_dir data/wikitext-2-hdf5/train/ | Destination folder |
--files_per_record 100 | Files per record is set to 100 in comparison to the default setting, which is 50,000. Given that there are 1180 training samples, the number of files per record should be smaller than the total number of samples. After preprocessing, the total number of samples will be floor(#samples/files_per_record) * files_per_record |
data/wikitext-2-hdf5/train/
. The following list shows the content of the output directory:
--input_dir
to data/wikitext-2-raw/valid/
and --output_dir
to data/wikitext-2-hdf5/valid/
:
data/wikitext-2-hdf5/valid/
. The following list shows the content of the output directory:
create_hdf5_dataset.py
script only tokenizes your dataset. Perform data cleaning and shuffling before preparing the data in the HDF5 format, as it depends on the quality of your dataset. Additional resources available in Cerebras Model Zoo can be found in Data processing and dataloaders.venv_cerebras_pt
is active.
data/wikitext-2-hdf5/train
. You can obtain the absolute path using realpath
, or appending data/wikitext-2-hdf5/train/
to the absolute path in $PARENT_CS
.
custom_config_GPT111M.yaml
file.
data_dir
flag for training and evaluation inputs:
max_steps
for training, the eval_steps
for evaluation, and the checkpoint frequency.
train_wsc
.
screen -S train_wsc
6. Inside this screen session, if not already active, activate the Cerebras virtual environment venv_cerebras_pt
.
run.py
script file associated with the GPT-3 models in Cerebras Model Zoo.
Each model in Cerebras Model Zoo contains a run.py
script instrumented to easily launch training and evaluation in the Cerebras Wafer-Scale cluster and other AI accelerators. To learn more on how to launch a training, visit Launch your job, and to view the run.py
code for the GPT-3 model, visit here.
Flag | Description |
---|---|
--params custom_config_GPT111M.yaml | Points to the YAML configuration file you customed with the training data paths |
--num_csx=1 | Number of CS-X systems used in the run |
--model_dir train_from_scratch_GPT111M | New directory containing all the checkpoints and logging information |
--mode train | Specifies that you are training the model as opposed to evaluation |
--mount_dirs $PARENT_CS $PARENT_CS/modelzoo | Mounts directories to the Cerebras Wafer-Scale cluster. In this case, all the data and code is in the parent directory demo (with absolute path $PARENT_CS ) |
--python_paths $PARENT_CS/modelzoo | Enables adding the Cerebras Model Zoo to the list of paths to be exported to PYTHONPATH |
--job_labels model=GPT111M | A list of equal-sign-separated key value pairs served as job label, to query using csctl. |
CTRL+A D
and reattach the screen using screen -r train_wsc
.
Once the training job is complete, you will find inside the train_from_scratch_GPT111M
folder:
latest_compile
and cerebras_logs
contain Cerebras-specific information collected while compiling and executing the model, checkpoint_10.mdl
is the checkpoint saved after ten steps, train
contains the metrics logged during execution, and the YAML configuration file, and run_<data>_######.log
contains the command line output during execution.
--mode
flag to train_and_eval
and use the following additional flags in the runconfig
to configure the run:
Runconfig Param | Description |
---|---|
eval_frequency | Frequency at which the evaluation loop runs during training |
eval_steps | Number of steps to run in each evaluation loop |
eval_steps
isn’t specified, it will loop over the entire evaluation dataloader if the dataloader has a length. Otherwise, it will error out.
train_and_eval
mode, make sure that checkpoint_steps
is either a multiple of eval_frequency
or vice versa, so that the frequency at which evaluations occurs lines up with the frequency at which checkpoints are taken.
eval_frequency=5
, checkpoint_steps=5
Take a checkpoint every 5 steps and run an evaluation every 5 steps.
eval_frequency=5
, checkpoint_steps=10
Take a checkpoint every 10 steps, but perform an evaluation every 5 steps. Note that at steps 5, 15, 25, … there are no checkpoints available, but evaluation results are available.
eval_frequency=10
, checkpoint_steps=5
Take a checkpoint every 5 steps, but perform an evaluation every 10 steps.
checkpoint_steps
is set to a positive number, regardless of whether the num_steps
is a multiple of checkpoint_steps
. Additionally, in the “Train and Eval” mode, an evaluation will always run at the final step, regardless of whether the num_steps
is a multiple of the eval_frequency
.
2. Evaluation Code in the Model:
Ensure that evaluation-specific code(such as eval metrics) is guarded by the PyTorch module’s training flag to distinguish between training and evaluation.
For example:
train_and_eval
mode is not yet supported.
3. Runconfig Limitations: Settings in runconfig
(such as precision_opt_level
or POL) affect both training and evaluation graphs without nested configurations. This means that the settings specified in runconfig
apply to both training and evaluation (e.g., POL). Note, however, that micro_batch_size
is part of train_input
/ eval_input
. Therefore, it is possible to specify different micro-batch size settings for the train and eval
mode provided that the micro_batch_size
for evaluation is less than or equal to the micro_batch_size
for training.
4. Cluster Configuration: It’s not currently possible to run training and evaluation on separate portions of a cluster. Separate run.py
jobs must be manually executed for this.
train_wsc
screen, use
screen -r train_wsc
2. Inside this screen session, if not already active, you will activate the Cerebras virtual environment venv_cerebras_pt
.
run.py
script used during the training from scratch with new flags.
3. Change the following flags:
Flag | Description |
---|---|
--model_dir finetune_GPT111M | Specifies a different model directory to save checkpoints and logging information |
--checkpoint_path Cerebras-checkpoint/cerebras-gpt-dense-111m -sp-checkpoint_final.mdl | Specifies the path where the Cerebras-GPT 111M model is. You will be using this checkpoint to initialize the model weights |
--load_checkpoint_states="model" | Flag used in conjunction with checkpoint_path , to enforce resetting of optimizer states and training steps after loading a given checkpoint. By setting this flag, all the model weights are initialized from checkpoint provided by checkpoint_path , training starts from step 0, and optimizer states present in the checkpoint are ignored. Useful for fine-tuning runs on different tasks (e.g., classification, Q&A, etc.) where weights from a pre-trained model trained on language modeling (LM) tasks are loaded or fine-tuning on a different dataset on the same LM task. |
finetune_GPT111M
folder:
run_<data>_######.log
contains the command line output generated during execution.
run.py
script found in the Cerebras Model Zoo for evaluation purposes (i.e., only forward pass). GPT style models use the data specified in eval_input.data_dir
, which you had set up in step 2. The run.py
script provides three types of evaluation with the --mode
flag:
Flag | Description |
---|---|
eval | Evaluates a specific checkpoint. The latest checkpoint will be used if you don’t provide the --checkpoint_path flag |
eval_all | Evaluates all the checkpoints inside a model directory once the model has been trained |
train | Evaluates the model periodically during the training process. |
train_and_eval | Evaluates a model at a fixed frequency during training. This is convenient for identifying issues early in long training runs |
train
and eval
modes require different fabric programming in the CS-X system. Therefore, using train_and_eval
mode in the Cerebras Wafer-Scale cluster results in additional overheads any time training is stopped to perform evaluation. When possible, we recommend using the eval_all
mode instead.eval_all
mode. To learn more about the different types of evaluation, visit eval.
run.py
, the latest saved checkpoint will be used by default. If no checkpoint exists, then weights will be initialized as stated in the YAML file, and the model will be evaluated using these weights. If you want to evaluate a previously trained model, make sure that the checkpoints are available in the model_dir
or provide the --checkpoint_path
flag.train_wsc
screen.
venv_cerebras_pt
.
run.py
script associated with the GPT-3 models in the Cerebras Model Zoo. This is the same run.py
script used during training from scratch, but you will change the following flag:
--mode eval_all | Specifies the evaluation mode |
train_from_scratch_GPT111M
folder, including an eval
folder that contains the metrics logged during the model evaluation and a new run_<data>_######.log
with the command line output during model evaluation. Here is an example of the command line output.
run.py
script associated with the GPT-3 models in the Cerebras Model Zoo. This is the same run.py
script used during evaluating model trained from scratch, but you will change the following flag:
--model_dir finetune_GPT111M | Specifies the model directory that contains the checkpoints from the fine-tuned model |
finetune_GPT_111M
folder, including an eval
folder that contains the metrics logged during the model evaluation and a new run_<data>_######.log
with the command line output during model evaluation. Here is an example of the command line output.
Port 6006 is in use by another program. Either identify and stop that program, or start the server with a different port.
. As a work around, you can launch TensorBoard by specifying a different port with the --port
flag.train_from_scratch_GPT111M
) and fine-tuning from Cerebras-GPT checkpoint (finetune_GPT111M
)#