Open Source AI Project


ViTAE, a project collaborated by the University of Sydney and, rethinks transformer block design by introducing two new fundamental units: reduction cells and n...


The ViTAE project represents a significant advancement in the design of vision transformers, a type of neural network architecture that has gained prominence for its effectiveness in handling various computer vision tasks. This project is a collaborative effort between the University of Sydney and, a major e-commerce company in China. The core innovation of the ViTAE project lies in its reimagined transformer block design, which is articulated through the introduction of two novel structural units: reduction cells and normal cells.

Reduction cells and normal cells are designed to embed two critical types of inductive biases – locality and scale-invariance – directly into the transformer architecture. Inductive biases are assumptions made by a model about the data to improve its learning efficiency and generalization capabilities. Locality refers to the principle that elements close to each other in the input data are likely to be semantically related, a concept that is fundamental to many computer vision tasks. Scale-invariance, on the other hand, ensures that the model’s performance remains consistent across different scales or sizes of input images, which is crucial for robustly handling real-world visual data that can vary significantly in size and detail.

By integrating these inductive biases, the ViTAE model aims to overcome some of the limitations of traditional transformer architectures, which often require large amounts of data and computational resources to train effectively. The innovative design of the reduction and normal cells allows the ViTAE transformer to be both simpler and more effective compared to baseline transformer models and concurrent works in the field.

The project claims superior performance of the ViTAE model over existing baseline transformers and concurrent models, particularly in the context of ImageNet, a large-scale dataset widely used for benchmarking the performance of computer vision models. Additionally, the model demonstrates impressive capabilities on various downstream tasks, indicating its versatility and effectiveness across a range of vision-related applications.

To support the broader research community and facilitate further advancements in the field, the project team has committed to sharing the source code and pre-trained models on GitHub. This open-source approach not only enables other researchers to reproduce and validate the reported results but also provides a foundation for future innovations in vision transformer design and application.

Relevant Navigation

No comments

No comments...