Downstream Validation Using Eleuther Eval Harness

EleutherAI’s Evaluation Harness (EEH) is a popular framework for evaluating large language models across various different datasets and tasks.

You can configure EEH as part of your training workflow. See Pretraining with Downstream Validation.

The examples in this guide will perform downstream validation on LLaMA3 8B.

Prerequisites

Please ensure that you have installed the Cerebras Model Zoo package by going through the installation guide. Note that EEH version tested and packaged in the Cerebras Model Zoo is the official release v0.4.7.

Please also read through the Trainer Overview and Trainer Configuration Overview, as these guides will help understand how to configure running EEH standalone.

Configure the Run

This section covers the required steps for setting up an EEH run to perform standalone downstream validation on various tasks.

In particular, you will need to write a YAML configuration file to configure an instance of the Trainer callback.

The example in this section configures evaluation for LLaMA3 8B via the multiple choice (non-generative) eval harness task winogrande using a single CSX.

If you aren’t interested in seeing the break down of the configuration, feel free to skip ahead to the Putting it All Together section to see the full YAML configuration.

Configure the CSX Backend

The first step is to specify the CSX backend and resources required for the run.

Create a YAML configuration file with the following cluster config:

trainer:
  init:
    backend:
      backend_type: CSX
      cluster_config:
        num_csx: 1

This example uses a single CSX, but you can readily update num_csx to run EEH on multiple CSXs for improved performance.

Configure the Model

Next, please add the following model configuration in the YAML for LLaMA3 8B with 8K context length:

trainer:
  init:
    backend:  # CSX
      ...
    model:
      name: llama # This setting is required

      # Embedding
      vocab_size: 128256
      hidden_size: 4096
      position_embedding_type: "rotary"
      pos_scaling_factor: 1.0
      rope_theta: 500000.0
      rotary_dim: 128
      share_embedding_weights: false
      max_position_embeddings: 8192
      embedding_dropout_rate: 0.0
      embedding_layer_norm: false

      # Decoder
      num_hidden_layers: 32
      dropout_rate: 0.0
      layer_norm_epsilon: 1.0e-5
      norm_type: "rmsnorm"

      # Decoder - Attention
      num_heads: 32
      attention_type: "scaled_dot_product"
      attention_module: "multiquery_attention"
      attention_dropout_rate: 0.0
      use_projection_bias_in_attention: false
      use_ffn_bias_in_attention: false
      extra_attention_params:
        num_kv_groups: 8

      # Decoder - ffn
      filter_size: 14336
      nonlinearity: "swiglu"
      use_ffn_bias: false

      # Task-specific
      use_bias_in_output: false
      loss_scaling: "num_tokens"
      loss_weight: 1.0

      # Initializer
      initializer_range: 0.02

To run downstream validation harness, you must specify the name setting in the model configuration. Valid names corresponding to the supported models include:

btlm
bloom
gpt2
gptj
falcon
gpt3
gpt-neox
llama
mistral
mpt
jais
santacoder
starcoder

Configure the EEH Callback

EEH is implemented as an extension to the Trainer callback.

Add the following section in the YAML to set up the EleutherEvalHarness callback:

trainer:
  init:
    backend: # CSX
      ...
    model: # Llama3-8B
      name: llama # This setting is required
      ...
    callbacks:
    - EleutherEvalHarness:
        # Eleuther Eval Harness settings (also exposed via CLI)
        eeh_args:
          tasks: winogrande
          num_fewshot: 0
        # CSX-specific eval harness settings (also exposed via CLI)
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
        # Eval Harness Flags
        flags:
          csx.performance.micro_batch_size: null

The eeh_args section exposes the following settings to configure the EEH run:

Eleuther Eval Harness CLI Arguments	Description
`--tasks`	Comma separated string specifying Eleuther Eval Harness tasks. To get full list of tasks, use the command `lm-eval --tasks list` from within your python venv.
`--num_fewshot`	Number of examples to be added to the fewshot context string. Defaults to 0
`--output_path`	The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.
`--limit`	Accepts an integer, or a float between 0.0 and 1.0. This limits the number of documents to evaluate per task to the first X documents (if an integer) or first X% of documents. This is useful for debugging.
`--use_cache`	A path to a sqlite db file for caching model responses. None if not caching.
`--cache_requests {true,refresh,delete}`	Speed up evaluation by caching the building of dataset requests. None if not caching.
`--check_integrity`	Whether to run the relevant part of the test suite for the tasks.
`--write_out`	Prints the prompt for the first few documents. Defaults to False.
`--log_samples`	If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Defaults to False.
`--show_config`	If True, shows the the full config of all tasks at the end of the evaluation. Defaults to False.
`--include_path`	Additional path to include if there are external tasks to include.
`--predict_only`	Use with –log_samples. Only model outputs will be saved and metrics will not be evaluated.
`--seed`	Set seed for python’s random, numpy and torch.
`--temperature`	Sampling temperature used for generation (autoregressive, generate_until tasks only).
`--top_p`	Top-p parameter used for nucleus sampling (autoregressive, generate_until tasks only).
`--top_k`	Top-k parameter used for generation (autoregressive, generate_until tasks only).

You can either specify the settings here or pass them via CLI arguments to the standalone EEH run script.

The callback configuration also accepts dataloader settings that you must specify in the YAML to set up input data preprocessing for the run:

DataLoader Settings	Description
`data_dir`	This setting is required. Provide a path to the mounted directory visible to the worker containers where eval harness task data samples are dumped after preprocessing. Use the `mount_dirs` argument to specify a dir mount, similar to our existing flows.
`tokenizer_file_path`	Path to a custom tokenizer (JSON) file. If you provide a custom tokenizer, then you must also specify `eos_id`; otherwise, you must provide a pretrained tokenizer from Hugging Face in `pretrained_model_name_or_path`.
`pretrained_model_name_or_path`	Hugging Face (HF) pretrained model name or path. This setting is required if you do not specify `tokenizer_file_path`. For detailed description, see HF AutoTokenizers.
`eos_id`	End-of-sentence (eos) token ID to signal the termination of a sequence. This setting is required if you specify a custom tokenizer in `tokenizer_file_path`. You can set this by looking for the ID corresponding to the eos token in the custom tokenizer JSON file.
`max_sequence_length`	Maximum length of the input sequence. This setting is required for preprocessing input data samples from the specified eval harness tasks. You should align the `max_sequence_length` field to the `max_position_embeddings` value in the model configuration of the YAML. If you don’t specify `max_sequence_length`, the flow defaults to this `max_position_embeddings` setting.

Additionally, you may optionally specify the following, CSX-specific eval harness setting:

keep_data_dir: Use this to preserve the preprocessed eval harness task data samples, i.e. the directory specified under data_dir. Defaults to False, i.e. data samples are deleted after the run.

(Optional) Configure HuggingFace (HF) Cache Directory

EEH utilizes HF’s APIs to download task data and other configurations. This data is by default cached under $HOME/.cache/huggingface.

However, you may choose to specify a different directory for this cached data via the HFCacheDir callback:

trainer:
  init:
    backend: # CSX
      ...
    model: # Llama3-8B
      ...
    callbacks:
    - EleutherEvalHarness:
        ...
    - HFCacheDir:
        cache_dir: <path_to_directory_for_caching_HF_data>

Putting it All Together

Here’s what the full YAML configuration looks like once you follow this guide for configuring the individual pieces:

trainer:
  init:
    backend:
      backend_type: CSX
      cluster_config:
        num_csx: 1
        mount_dirs: <path(s)_to_mount_to_appliance_containers>
    model:
      name: llama

      # Embedding
      vocab_size: 128256
      hidden_size: 4096
      position_embedding_type: "rotary"
      pos_scaling_factor: 1.0
      rope_theta: 500000.0
      rotary_dim: 128
      share_embedding_weights: false
      max_position_embeddings: 8192
      embedding_dropout_rate: 0.0
      embedding_layer_norm: false

      # Decoder
      num_hidden_layers: 32
      dropout_rate: 0.0
      layer_norm_epsilon: 1.0e-5
      norm_type: "rmsnorm"

      # Decoder - Attention
      num_heads: 32
      attention_type: "scaled_dot_product"
      attention_module: "multiquery_attention"
      attention_dropout_rate: 0.0
      use_projection_bias_in_attention: false
      use_ffn_bias_in_attention: false
      extra_attention_params:
        num_kv_groups: 8

      # Decoder - ffn
      filter_size: 14336
      nonlinearity: "swiglu"
      use_ffn_bias: false

      # Task-specific
      use_bias_in_output: false
      loss_scaling: "num_tokens"
      loss_weight: 1.0

      # Initializer
      initializer_range: 0.02
    callbacks:
    - EleutherEvalHarness:
        eeh_args:
          tasks: winogrande
          num_fewshot: 0
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
        # Eval Harness Flags
        flags:
          csx.performance.micro_batch_size: null

Running EEH on CS-X

Now that the the YAML configuration is complete, use the Model Zoo CLI to run EEH on various tasks.

This script accepts the following arguments:

lm_eval

cszoo lm_eval --help

usage: cszoo lm_eval [-h] [--tasks task1,task2] [--num_fewshot N] [--output_path DIR|DIR/file.json] [--limit N|0<N<1]
                     [--use_cache DIR] [--cache_requests {true,refresh,delete}] [--check_integrity] [--write_out]
                     [--log_samples] [--system_instruction SYSTEM_INSTRUCTION] [--apply_chat_template]
                     [--fewshot_as_multiturn] [--show_config] [--include_path DIR] [--predict_only] [--seed SEED]
                     [--trust_remote_code] [--max_tokens MAX_TOKENS] [--temperature TEMPERATURE] [--top_p TOP_P]
                     [--top_k TOP_K] [--keep_data_dir] [--target_device {CSX}] [-o MODEL_DIR]
                     [--checkpoint_path CHECKPOINT_PATH] [--load_checkpoint_states LOAD_CHECKPOINT_STATES]
                     [--logging LOGGING] [--compile_only] [--validate_only] [--job_labels JOB_LABELS [JOB_LABELS ...]]
                     [--job_priority {p1,p2,p3}] [--mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]]
                     [--python_paths PYTHON_PATHS [PYTHON_PATHS ...]] [--credentials_path CREDENTIALS_PATH]
                     [--mgmt_address MGMT_ADDRESS] [--disable_version_check] [--num_csx NUM_CSX]
                     [--debug_args [DEBUG_ARGS [DEBUG_ARGS ...]]] [--debug_args_path DEBUG_ARGS_PATH]
                     [--ini [INI [INI ...]]] [--transfer_processes TRANSFER_PROCESSES] [--config CONFIG]
                     params

positional arguments:
  params                Path to .yaml file with model parameters.

optional arguments:
  -h, --help            show this help message and exit
  --system_instruction SYSTEM_INSTRUCTION
                        System instruction to be used in the prompt
  --apply_chat_template
                        If True, applies the chat template to the prompt
  --fewshot_as_multiturn
                        If True, uses the fewshot as a multi-turn conversation
  --target_device {CSX}
                        Target device to run on. Can be one of CSX.
  -o MODEL_DIR, --model_dir MODEL_DIR
                        Model directory where checkpoints will be written.
  --checkpoint_path CHECKPOINT_PATH
                        Checkpoint to initialize weights from.
  --load_checkpoint_states LOAD_CHECKPOINT_STATES
                        Comma-separated string of keys to explicitly specify the components whose state should be loaded if present in a checkpoint. If this flag is used, then all component states that exist in a checkpoint, but are not specified to load via the flag will be ignored. For example, for fine-tuning runs on a different dataset, setting `--load_checkpoint_states="model" will only load the model state; any `optimizer` or `dataloader` state present in the checkpoint will not be loaded. By default, the config is `all`, i.e. everything present in the checkpoint is loaded.
  --logging LOGGING     Specifies the default logging level. Defaults to INFO.
  --compile_only        Enables compile only workflow.
  --validate_only       Enables validate only workflowvalidate_only stops the compilation at ws_km stage for weight streaming mode.
  --job_labels JOB_LABELS [JOB_LABELS ...]
                        A list of equal-sign-separated key value pairs served as job labels.
  --job_priority {p1,p2,p3}
                        Priority of the job. When launching jobs, valid priority should be between p1 and p3, where p1 is highest priority.
  --mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]
                        A list of paths to be mounted to the appliance containers. It should generally contain path to the directory containing the Cerebras modelzoo.
  --python_paths PYTHON_PATHS [PYTHON_PATHS ...]
                        A list of paths to be exported into PYTHONPATH for worker containers. It should generally contain path to the directory containing the Cerebras modelzoo, as well as any external python packages needed.
  --credentials_path CREDENTIALS_PATH
                        Credentials for cluster access. Defaults to None. If None, the value from a pre-configured location will be used if available.
  --mgmt_address MGMT_ADDRESS
                        <host>:<port> for cluster management. If None, the value from a pre-configured location will be used if available. Defaults to None.
  --disable_version_check
                        Disable version check for local experimentation and debugging
  --num_csx NUM_CSX     Number of CS nodes. Defaults to 1
  --debug_args [DEBUG_ARGS [DEBUG_ARGS ...]]
                        DebugArgs to pass to the Cerebras compile and execution, pass as --debug_args sub.object.key=value, where value can be bool int, float or str
  --debug_args_path DEBUG_ARGS_PATH
                        Path to debugs args file. Defaults to None.
  --ini [INI [INI ...]]
                        Debug INI settings to pass to the Cerebras compile and execution, pass as --ini key=value, where value can be bool, int, float or str
  --transfer_processes TRANSFER_PROCESSES
                        Number of processes to use when transferring weights.
  --config CONFIG       Specifies a specific key of the params file to return.

Eleuther Eval Harness Arguments:
  --tasks task1,task2, -t task1,task2
                        Comma-separated list of task names or task groupings to evaluate on.
                        To get full list of tasks, use one of the commands `lm-eval --tasks {{list_groups,list_subtasks,list_tags,list}}` to list out all available names for task groupings; only (sub)tasks; tags; or all of the above
  --num_fewshot N, -f N
                        Number of examples in few-shot context
  --output_path DIR|DIR/file.json
                        The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.
  --limit N|0<N<1, -L N|0<N<1
                        Limit the number of examples per task. If <1, limit is a percentage of the total number of examples.
  --use_cache DIR       A path to a sqlite db file for caching model responses. `None` if not caching.
  --cache_requests {true,refresh,delete}
                        Speed up evaluation by caching the building of dataset requests. `None` if not caching.
  --check_integrity     Whether to run the relevant part of the test suite for the tasks.
  --write_out, -w       Prints the prompt for the first few documents.
  --log_samples, -s     If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.
  --show_config         If True, shows the the full config of all tasks at the end of the evaluation.
  --include_path DIR    Additional path to include if there are external tasks to include.
  --predict_only, -x    Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.
  --seed SEED           Set seed for python's random, numpy, torch, and fewshot sampling.
                        Accepts a comma-separated list of 4 values for python's random, numpy, torch, and fewshot sampling seeds, respectively, or a single integer to set the same seed for all four.
                        The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234,1234` (for backward compatibility).
                        E.g. `--seed 0,None,8,52` sets `random.seed(0)`, `torch.manual_seed(8)`, and fewshot sampling seed to 52. Here numpy's seed is not set since the second value is `None`.
                        E.g, `--seed 42` sets all four seeds to 42.
  --trust_remote_code   Sets trust_remote_code to True to execute code to create HF Datasets from the Hub
  --max_tokens MAX_TOKENS
                        Maximum number of tokens to generate.
  --temperature TEMPERATURE
                        Sampling temperature used for generation.
  --top_p TOP_P         Top-p parameter used for nucleus sampling.
  --top_k TOP_K         Top-k parameter used for generation.
  --keep_data_dir       Specifies whether dumped data samples should be kept for reuse. Defaults to False, i.e. data samples are deleted after the run.

We support a subset of Eleuther’s command line interface (CLI) arguments above. For a more detailed descrition of these supported arguments, see the EEH documentation.
You may also specify these arguments in the YAML under the eeh_args key of the EleutherEvalHarness configuration, but please note that the CLI setting will override the settings in the YAML.
The params CLI argument is required. Use it to specify the path to the YAML configuration file.
Use the --checkpoint_path CLI argument to specify the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specified model_dir.

Supported Tasks

We support lm_eval@v0.4.7 .
You may perform downstream validation on all EEH tasks with output_type: loglikelihood or output_type: multiple_choice in the task specification. See asdiv and arc_easy for respective examples. You may specify each of these types of tasks separately or together in a single EleutherEvalHarness callback.
We currently do not support eval harness tasks with output_type: loglikelihood_rolling.

Adding New Tasks

Please refer to Eleuther’s new task implementation guide here to add new tasks.

Limitations

We currently do not support running multiple generative eval harness tasks in the same callback.
EEH task groups, such as agieval, comprise multiple generative sub tasks that you will have to configure in the YAML via separate callbacks.
Please turn on grad accumulation and choose a small micro batch size (between 16 to 32) under the flags configuration of the EleutherEvalHarness callback of the YAML,

Examples

Single Non-generative Task

Let’s assume that the YAML configuration file above is written to ./llama3_8B_eeh.yaml. Then, to run evaluation for task winogrande, please set up a bash script as follows:

cszoo lm_eval ./llama3_8B_eeh.yaml \
  --target_device CSX \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \

The output logs are as follows:

...
2024-06-24 11:06:49,596 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=632, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-06-24 11:06:49,625 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=633, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-06-24 11:07:04,333 INFO:   <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7ff89ae3ceb0> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-24 11:07:04,696 INFO:
|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.7284|±  |0.0141|

Multiple Non-generative Tasks

To run evaluation on more non-generative tasks, you may update the run script to the following:

cszoo lm_eval ./llama3_8B_eeh.yaml \
  --target_device CSX \
  --tasks arc_challenge,hellaswag,openbookqa,piqa,winogrande \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \

The output logs are as follows:

...
2024-06-24 12:40:44,896 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=2979, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
2024-06-24 12:40:57,106 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=2980, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
2024-06-24 12:41:10,333 INFO:   <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7ff89ae3ceb0> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-24 12:41:10,700 INFO:
|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|-------|------|-----:|--------|-----:|---|-----:|
|arc_challenge|Yaml   |none  |     0|acc     |0.5334|±  |0.0145|
|             |       |none  |     0|acc_norm|0.5300|±  |0.0146|
|hellaswag    |Yaml   |none  |     0|acc     |0.5716|±  |0.0049|
|             |       |none  |     0|acc_norm|0.7917|±  |0.0043|
|openbookqa   |Yaml   |none  |     0|acc     |0.3210|±  |0.0207|
|             |       |none  |     0|acc_norm|0.4520|±  |0.0223|
|piqa         |Yaml   |none  |     0|acc     |0.7997|±  |0.0097|
|             |       |none  |     0|acc_norm|0.8090|±  |0.0095|
|winogrande   |Yaml   |none  |     0|acc     |0.7284|±  |0.0130|

Generative Task

The EEH flow also supports running generative (autoregressive) eval harness tasks, i.e. tasks that specify output_type: generate_until in the task specification, such as triviaqa or drop. Refer to task triviqa for an example specification from the official EEH repository.

In order to run generative inference on CSX, you must specify the following inference settings in the model config of YAML file:

start_token - ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.
stop_sequences - List of sequences (each one being a list of token IDs). If any one of these sequences is emitted by the model, inference will stop for that sample. For example, suppose you would like to stop inferring after either a newline character (e.g. token id 1), or a combination of a period (e.g. token id 2) followed by a space (e.g. token id 3). In this case, set stop_sequences to [[1], [2, 3]]. To stop inferring after seeing a newline character only, set stop_sequences to [[1]]. To disable this feature, set stop_sequences to an empty list []. Additionally, the following optional parameters may be set:
max_tokens - Maximum tokens to infer for each sample.
loop_dim - Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e. sequence X samples instead of samples X sequence).

For your LLaMA3 8B example, please update the YAML file ./llama3_8B_eeh.yaml to add these inference settings:

trainer:
  init:
    backend: # CSX
      ...
    model: # Llama3-8B
      name: llama
      ...
      # Inference Settings
      start_token: 128256                 # Set to `vocab_size`
      stop_sequences: [[198], [13], [11]] # Respective tokens for "\n", "." and ","
      max_tokens: 256                     # Default from HF implementations
      loop_dim: 1
    callbacks:
    - EleutherEvalHarness:
        # Eleuther Eval Harness settings
        eeh_args:
          tasks: triviaqa
          num_fewshot: 0
      ...
  ...

For start_token, it is ideal to choose a value that’s not going to be generated by the model, i.e. vocab_size in the example above.
The generative task itself defines stop_sequences under setting generation_kwargs.until of the task spec. For instance, triviqa specifies "\n", "." and "," as the stop tokens. The EEH flow will internally override the stop_sequences config with the value from the task, so you can also specify an arbitrary value in the YAML.

Finally, please update the bash script as follows:

cszoo lm_eval ./llama3_8B_eeh.yaml \
  --target_device CSX \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \

The output logs are as follows:

...
2024-06-26 11:22:47,530 INFO:   | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=4480, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 11:22:47,553 INFO:   | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=4485, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 11:38:15,868 INFO:   <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7f9e305fa730> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-26 11:38:16,232 INFO:
| Tasks  |Version|     Filter      |n-shot|  Metric   |Value |   |Stderr|
|--------|------:|-----------------|-----:|-----------|-----:|---|-----:|
|triviaqa|      3|remove_whitespace|     0|exact_match|0.1120|±  |0.0001|

By default, the model will perform greedy sampling of the inferred tokens, i.e. for all of the model’s outputs, pick the token with the highest probability.

In order to perform non-greedy sampling, you can pass in temperature, top_k or top_p to either the bash script or under eeh_args of the YAML. For example:

cszoo lm_eval ./llama3_8B_eeh.yaml \
  --target_device CSX \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \
  --temperature 0.7 \
  --top_k 10 \
  --top_p 0.95 \

Non-generative and Generative Tasks

You can combine evaluation for generative and non-generative eval harness tasks simply via updating the bash script as follows:

cszoo lm_eval ./llama3_8B_eeh.yaml \
  --target_device CSX \
  --tasks winogrande,drop \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \

The output logs are as follows:

...
2024-06-26 13:57:52,177 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=632, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-06-26 13:57:52,206 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=633, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-06-26 13:57:52,333 INFO:   Running generate_until requests
...
2024-06-26 14:00:03,499 INFO:   | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=2382, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 14:00:03,519 INFO:   | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=2383, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 14:00:31,890 INFO:   <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7f156522c490> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-26 14:00:32,351 INFO:
|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.7284|±  |0.0141|
|drop      |      3|none  |     0|em    |0.0072|±  |0.0008|
|          |       |none  |     0|f1    |0.0371|±  |0.0013|

Since inference settings are baked into the model configuration and that inference requires different resources, there is a separate compile and execution for running downstream validation on generative tasks.
You may specify at most one generative task at a time (per callback).

Run Multiple Generative Tasks

In order to run multiple generative tasks, you may update the run script to the following:

cszoo lm_eval ./llama3_8B_eeh.yaml \
  --target_device CSX \
  --tasks drop,gsm8k \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \

What’s next?

To run downstream validation on code generation tasks, please see check out:

Downstream Validation using BigCode Eval Harness

You can also perform downstream validation using EEH as part of your pretraining runs with upstream validation. Check out the following guide:

Pretraining with Downstream Validation

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Downstream Validation Using Eleuther Eval Harness

Prerequisites

Configure the Run

Configure the CSX Backend

Configure the Model

Configure the EEH Callback

(Optional) Configure HuggingFace (HF) Cache Directory

Putting it All Together

Running EEH on CS-X

Supported Tasks

Adding New Tasks

Limitations

Examples

What’s next?

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Prerequisites

​Configure the Run

​Configure the CSX Backend

​Configure the Model

​Configure the EEH Callback

​(Optional) Configure HuggingFace (HF) Cache Directory

​Putting it All Together

​Running EEH on CS-X

​Supported Tasks

​Adding New Tasks

​Limitations

​Examples

​What’s next?

Prerequisites

Configure the Run

Configure the CSX Backend

Configure the Model

Configure the EEH Callback

(Optional) Configure HuggingFace (HF) Cache Directory

Putting it All Together

Running EEH on CS-X

Supported Tasks

Adding New Tasks

Limitations

Examples

What’s next?