ViT

On this page

Model Description
Code Structure
Available Configurations
References

Model Description

The Vision Transformer (ViT) architecture applies transformer-based modeling, originally developed for NLP, to sequences of image patches for visual tasks. Instead of using convolutional layers, ViT treats an image as a sequence of non-overlapping patches, embeds them, and feeds them into a standard transformer encoder. This implementation supports ViT models of various sizes trained on ImageNet-1K and provides flexible configuration options for patch sizes, model depth, and hidden dimensions. The transformer layers operate over patch embeddings with added positional information, enabling strong performance in image classification tasks when pretrained on large datasets.

Code Structure

The code for this model is located in the vit directory within ModelZoo. Here’s how it’s organized:

configs/: Contains YAML configuration files for different ViT variants.
model.py: Entry point that initializes and builds the model components used for training and evaluation.
ViTModel.py: Core implementation of the ViT architecture, including patch embedding, transformer encoder blocks, and classification head.
ViTClassificationModel.py: Wraps ViTModel for classification tasks, managing preprocessing, logits generation, and loss computation.

Available Configurations

Configuration	Description
`params_vit_base_patch_16_imagenet_1k.yaml`	ViT-Base model with 16×16 patch size trained on ImageNet-1K.
`params_vit_huge_patch_16_imagenet_1k.yaml`	ViT-Huge model with 16×16 patch size trained on ImageNet-1K.

References

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Diffusion Transformer LLaVA

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Model Description

Code Structure

Available Configurations

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​Code Structure

​Available Configurations

​References

Model Description

Code Structure

Available Configurations

References