Semantic regions provide a generalized approach to data preprocessing for machine learning models, particularly for instruction-following datasets. Traditional methods hard-code masking for user and assistant sections, but this approach offers more flexibility.

What is a Semantic Region?

Typically, machine learning models trained on instruction-following datasets mask out user input to focus on generating assistant responses. Existing libraries often use a rigid, hard-coded approach to this masking.

In our data preprocessing, we generalized this idea with the concept of semantic regions. A semantic region is a section of data that shares a common meaning. This allows for more nuanced data preprocessing across different types of datasets.

Why Use Semantic Regions?

Let’s consider the user and assistant sections of a dialogue as two special cases of a semantic region. The user section shows the model an example question, which is distinct from the assistant response that demonstrates the desired output of the model. For pedagogical purposes, we introduce a simple representation to visualize semantic regions.

Below, the tuple contains (text, loss_mask_value):


    [
        ("Hello how are you doing?", 0)
        ("I'm great, how are you?",  1)
    ]

The above case fits easily into existing frameworks. However, consider a medical question-answering dataset with three key components:

  • A medical passage

  • A related question

  • An answer

In older, hard-coded systems, you would have to:

  1. Combine the passage and question into a single “user” region

  2. Lose the ability to learn from the medical passage during training

Our semantic regions approach solves this by allowing granular separation:


    [
        ("The patient's TSH levels are elevated due to hypothyroidism", 1)
        ("What is the relation between TSH and hypothyroidism?",        0)
        ("Hypothyroidism is associated with elevated TSH levels",       1)
    ]

Similarly, consider the case where we have question-answering on Confluence documents. There are ‘structure tokens’ that represent section headers and metadata (Date, Author, or Last Updated) that are not useful to learn to predict. We can separate out structure tokens into semantic regions that get loss-masked, but include loss over the useful content in the user** section:


    [
        ("Date: 2024-05-06",           0)
        ("Author: jquxizop",           0)
        ("This feature improves performance by 3x.",         1)
        ("Viewed by: 1,520",           0)
        ("How does the feature work?", 0)
        ("The features works by...",   1)
    ]

We currently do not offer the ability to divide inputs into different semantic regions in pretraining mode. We offer this capability in finetuning mode, for both text and multi-modal datasets.

Semantic Data Arrays

We also introduce a data specification for our processing pipeline, called the semantic data array. Input data can come in a variety of formats, but we require a standard format to correctly parse it into semantic regions so that we can apply the corresponding attributes such as loss mask.


    [
        {
            "type": "...",
            "content": [...],
            "semantic_loss_mask" (Optional): [...]
            "semantic_drop_mask" (Optional): [...]
            "semantic_attention_mask" (Optional) [...]
        },
        ...
        {
            "type": "...",
            "content": [...],
            "semantic_loss_mask" (Optional): [...]
            "semantic_drop_mask" (Optional): [...]
            "semantic_attention_mask" (Optional) [...]
        },
    ]

The type field controls behavior for chat templates (more details here), and can take values of "system", "prompt", "completion", "user", "assistant". The difference between prompt/completion and user/asssistant is whether we apply a chat-template: prompt/completion does not apply the template, and user/assistant does apply the chat template.

The content field is a list of dictionaries, where the key is the name of the semantic region, and the value is the content of the semantic. Currently, the image semantic region is special and the content would have to be a string that represents the path to the image. Other regions names besides image will be interpreted as text.

The semantic_{loss/drop/attention}_mask are optional, and have default values according to the type field. If specified, they should be a list that has the same number of entries as the content list.

By default, completion and assistant are not loss-masked, i.e. have semantic_loss_mask = 1. The system, prompt, and user types have default of semantic_loss_mask = 0.

All types have default of semantic_attention_mask = 1, i.e. they have attention paid to them.

All types also have defaults of semantic_drop_mask = False, which means they are not dropped. The popular LLaVA model dropped the text from their prompts in one phase of training, so we introduced this feature to support the ability to drop arbitrary semantic regions according to a desired scheme (more details here).

Now let us represent some of the pedagogical examples from above into real semantic data arrays:


    [
        {
            "type": "user",
            "content": [
                "passage": "The patient's TSH levels are elevated due to hypothyroidism",
                "question": "What is the relation between TSH and hypothyroidism?"
            ],
            "semantic_loss_mask": [1, 0]
        },
        {
            "type": "assistant",
            "content": [
                "text": "Hypothyroidism is associated with elevated TSH levels"
            ],
            "semantic_loss_mask": [1]
        },
    ]

Read Hooks to Organize Semantic Data Arrays

We use read hooks to convert from different input formats to our semantic data array. Pre-built hooks are provided for standard input formats and masking schemes. But the hooks also allow the user to write code to transform arbitrary inputs into any valid configuration of the semantic data array.


    [
        {
            "type": "prompt",
            "content": [
                {"image": "path/to/image.jpg"},
                {"text": "User's text before and after image"}
            ],
            "semantic_drop_mask": [False, True],
            "semantic_attention_mask": [True, False],
            "semantic_loss_mask": [0, 1],
        },
        {
            "type": "completion",
            "content": [{"text": "Assistant's response"}],
            "semantic_drop_mask": [False],
            "semantic_attention_mask": [True],
            "semantic_loss_mask": [1],
        }
    ]

In this example, the drop mask is set to True for the text region of “prompt,” indicating that this text portion will be dropped from the dataset and not tokenized. The semantic attention mask determines which regions contribute to the final attention mask passed to the model. A loss mask of 0 for a region means that the label tokens corresponding to that region will not be included in the loss calculation.

The value of the semantic loss weight should be either 0 or 1.

Learn more about pre-built and custom read hooks here.