Model Description

The Vision Transformer (ViT) architecture applies transformer-based modeling, originally developed for NLP, to sequences of image patches for visual tasks. Instead of using convolutional layers, ViT treats an image as a sequence of non-overlapping patches, embeds them, and feeds them into a standard transformer encoder.

This implementation supports ViT models of various sizes trained on ImageNet-1K and provides flexible configuration options for patch sizes, model depth, and hidden dimensions. The transformer layers operate over patch embeddings with added positional information, enabling strong performance in image classification tasks when pretrained on large datasets.

Code Structure

The code for this model is located in the vit directory within ModelZoo. Here’s how it’s organized:

  • configs/: Contains YAML configuration files for different ViT variants.
  • model.py: Entry point that initializes and builds the model components used for training and evaluation.
  • ViTModel.py: Core implementation of the ViT architecture, including patch embedding, transformer encoder blocks, and classification head.
  • ViTClassificationModel.py: Wraps ViTModel for classification tasks, managing preprocessing, logits generation, and loss computation.

Available Configurations

ConfigurationDescription
params_vit_base_patch_16_imagenet_1k.yamlViT-Base model with 16×16 patch size trained on ImageNet-1K.
params_vit_huge_patch_16_imagenet_1k.yamlViT-Huge model with 16×16 patch size trained on ImageNet-1K.

References