How to Build Memory-Efficient Transformers with xFormers Using Packed Sequences, GQA, ALiBi, SwiGLU, and Causal Attention
In this tutorial, we implement xFormers: a practical toolkit for building fast, memory-efficient Transformer models on GPUs. We begin by validating memory-efficient attention against a standard attention implementation, then compare their speed and memory consumption across different sequence lengths. We then examine causal masking, packed variable-length sequences, grouped-query attention, and custom ALiBi positional biases. Finally, […]










