Open Source AI Project


Tricksy enables fast approximate inference on a single GPU, with support for sparsity-aware offloading.


The Tricksy project on GitHub is designed to enhance the efficiency and speed of inference tasks on a single GPU by utilizing sparsity-aware offloading. It achieves this by focusing on the natural sparsity present in MLP (multi-layer perceptron) layers of large language models. For instance, in some layers, a high percentage of neurons may have no effect due to the activation function, creating an opportunity for optimization.

Tricksy significantly reduces CPU-GPU data transfer by storing a subset of each MLP layer and full attention layers on the GPU, while keeping the full MLP layers in CPU RAM. It predicts active MLP neurons based on the attention layer input and asynchronously prepares the required data on the CPU to be transferred to the GPU. This approach includes a caching mechanism for neuron indices currently on the GPU, optimizing the process of updating and managing this data.

Despite its efficiencies, the project does have limitations. It relies on approximate inference, leading to potential slight accuracy degradation, and it benefits mostly from models using the ReLU activation function. However, the project suggests that similar strategies could be applied to models without ReLU by focusing on neurons with the highest output norms.

Potential improvements for Tricksy include enhancing evaluations for measuring accuracy degradation, optimizing indexing from CPU RAM, and modifying neuron buffer allocations based on layer sparsity. Moreover, addressing the issue with intermediate copies in PyTorch when applying an advanced index to a pinned tensor could also yield performance improvements.

For more details, you can visit the Tricksy GitHub page: Tricksy GitHub Repository.

Relevant Navigation

No comments

No comments...