Open Source AI Project


From the National University of Singapore, Refiner focuses on refining the self-attention mechanism of Vision Transformers (ViTs), a crucial factor distinguishing ViTs...


The project, named Refiner, developed by the National University of Singapore, aims to improve the functionality of Vision Transformers (ViTs) specifically in the area of image classification. Vision Transformers have emerged as a powerful tool in this field, known for their impressive accuracy. However, they have a notable limitation: the need for a substantial amount of data for pre-training.

Refiner addresses this limitation by focusing on the self-attention mechanism, a key component that differentiates ViTs from conventional Convolutional Neural Networks (CNNs). The self-attention mechanism in ViTs is crucial for understanding the global context of images, which is vital for accurate image classification.

The primary innovation of Refiner is in the enhancement of the self-attention mechanism. It does this by expanding the self-attention maps into a higher-dimensional space. This expansion is designed to encourage diversity in the attention patterns that the ViT model learns. By having more diverse patterns, the model can better understand and classify a wider range of images, even with less pre-training data.

Another significant feature of Refiner is the use of convolutions to strengthen local patterns within the attention maps. This approach is beneficial because it allows the model to first focus on and aggregate local features within an image. Once these local features are aggregated, the model then applies the self-attention mechanism for global aggregation. This two-step process of first focusing on local details and then on the global context helps in achieving more precise image classification.

The effectiveness of Refiner is demonstrated through its performance. It achieves a top-1 classification accuracy of 86% on the ImageNet dataset, a standard benchmark in image classification tasks. Remarkably, it accomplishes this with only 81 million parameters. This is a significant achievement because it indicates a higher efficiency in the model’s architecture – obtaining high accuracy with a relatively lower number of parameters suggests that the model is not just powerful, but also resource-efficient.

In summary, Refiner enhances the capabilities of Vision Transformers in image classification by innovatively refining their self-attention mechanism. This is achieved through the expansion of self-attention maps into a higher-dimensional space and the incorporation of convolutions to emphasize local patterns. These improvements result in a more efficient and effective model, capable of high accuracy with fewer parameters.

Relevant Navigation

No comments

No comments...