Sunday, July 2, 2023

flash attention

 FlashAttention is an **IO-aware exact attention algorithm** that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It is designed to be **fast and memory-efficient**¹.


It is optimal for a range of SRAM sizes and requires fewer HBM accesses than standard attention¹. FlashAttention trains Transformers faster than existing baselines and enables longer context in Transformers, yielding higher quality models¹.


Would you like more information on this topic?


Source: Conversation with Bing, 7/2/2023

(1) [2205.14135] FlashAttention: Fast and Memory-Efficient Exact Attention .... https://arxiv.org/abs/2205.14135.

(2) arXiv:2205.14135v2 [cs.LG] 23 Jun 2022. https://arxiv.org/pdf/2205.14135.pdf.

(3) Introducing Lightning Flash — From Deep Learning Baseline To ... - Medium. https://medium.com/pytorch/introducing-lightning-flash-the-fastest-way-to-get-started-with-deep-learning-202f196b3b98.

(4) Attention in Neural Networks - 1. Introduction to attention mechanism. https://buomsoo-kim.github.io/attention/2020/01/01/Attention-mechanism-1.md/.

No comments:

Post a Comment