Efficient Transformers: A Survey
Introduction
Transformers have transformed natural language processing and many other fields by enabling powerful sequence modeling through attention mechanisms. However, their self-attention operation has quadratic complexity with respect to input sequence length, leading to significant computational and memory challenges—especially for long sequences. As applications grow in scale, the demand for more efficient transformer architectures increases. This survey reviews recent advances in model design and architecture that improve transformer efficiency. It also introduces a taxonomy to categorize these approaches by their core techniques and primary use cases.
Background
The standard transformer architecture, introduced by Vaswani et al. (2017), uses self-attention to capture dependencies between all tokens in a sequence. While effective, self-attention scales quadratically (O(N^2)) with sequence length N, making it impractical for long documents, high-resolution images, or real-time applications. Additionally, the two-layer feed-forward network (FFN) in each block, while linear in sequence length, still contributes significantly to overall computation.
Transformer Modes
Transformers can operate in three main modes:
- Encoder-Only: For tasks like text classification and named entity recognition (e.g., BERT).
- Decoder-Only: For autoregressive tasks such as language modeling and text generation (e.g., GPT).
- Encoder-Decoder: For sequence-to-sequence tasks like machine translation.
The choice of mode depends on the application and task requirements. Encoders use only self-attention, while decoders use both self-attention and cross-attention. In decoders, self-attention must be causal (attending only to previous tokens), whereas in encoders it can be non-causal. This difference influences the design of efficient self-attention mechanisms and presents unique challenges.
Taxonomy of Efficient Transformers
Efficient Transformers use a variety of strategies to reduce the computational and memory demands of self-attention. These strategies can be grouped into the following categories:
- Fixed Attention Patterns: Restrict each token’s attention to a predefined subset of tokens, rather than the entire sequence.
- Blockwise: Divide the sequence into blocks; tokens attend only within their block.
- Strided: Tokens attend to others at fixed intervals.
-
Compressed: Reduce sequence length before attention (e.g., pooling or strided convolution).
-
Combined Patterns: Integrate multiple attention patterns to enhance coverage and flexibility.
-
Learned Attention Patterns: The model learns which tokens to attend to during training, often by assigning tokens to clusters or buckets based on relevance.
-
Neural Memory Modules: Introduce learnable memory components that aggregate information from the entire sequence, acting as global tokens or memory slots.
-
Low-Rank Approximations: Use low-rank factorization to approximate the attention matrix, reducing computational cost.
-
Kernel-Based Methods: Reformulate attention as kernel operations, avoiding explicit computation of the full attention matrix.
-
Recurrent Memory Connections: Extend blockwise attention by connecting blocks in a recurrent manner, allowing information flow across blocks.
-
Downsampling Techniques: Shorten the input sequence using pooling or striding before applying attention.
-
Sparse Models: Activate only a subset of model parameters for each input, often using adaptive or learned sparsity mechanisms (e.g., Mixture-of-Experts).
These categories provide a framework for understanding the diverse approaches to making Transformers more efficient, and help guide the selection of models for specific tasks and constraints.
Key Models and Approaches
This section highlights notable efficient transformer models, summarizing their core ideas, techniques, and how they fit into the taxonomy above.
Memory Compressed Transformer (Liu et al., 2018)
- Main idea: Combines local blockwise attention with sequence compression to reduce complexity.
- Techniques:
- Local attention within blocks (Fixed Attention Pattern)
- Strided convolution to compress keys/values (Compressed/Downsampling)
- Complexity: \(O(Nb)\) for local attention, \(O(N \cdot N/k)\) for compressed attention
Image Transformer (Parmar et al., 2018)
- Main idea: Restricts attention to local neighborhoods for image data. Input is partitioned into query blocks and each query block attends to a memory block that contains the query block and its surrounding pixels.
- Techniques:
- 1D local attention: The image is flattened into a 1D sequence in raster order and divided into non-overlapping query blocks of length \(l_q\). Each query block attends to a memory block that contains the query block and a fixed number of pixels, \(l_m\), generated before query pixel.
- 2D local attention: The image is divided into non-overlapping query blocks of size \(l_q = w_q x h_q\). Each query block attends to a memory block that contains the query block and a fixed number of pixels, \(h_m\) and \(w_m\), generated before the query block.
- Complexity: \(O(Nm)\), where \(m\) is the memory block size
Set Transformer (Lee et al., 2019)
- Main idea: Uses attention and pooling for permutation-invariant set processing.
- Techniques:
- Induced set attention blocks (Neural Memory Modules)
Sparse Transformer (Child et al., 2019)
- Main idea: Applies fixed sparse patterns to reduce attention computation. The idea is to reduce the dense attention matrix to sparse version by only computing attention on a sparse number q,k pairs.
- Techniques:
- Combines two types of attention patterns - Strided and local attention heads. Half of the heads are dedicated to strided attention and the other half to local attention.
- Complexity: \(O(N\sqrt{N})\)
- Limitation: Requires custom GPU kernel implementation for efficient block-sparse variant matrix multiplication.
Longformer (Beltagy et al., 2020)
- Main idea: Combines dilated sliding window and global attention tokens for long documents.
- Techniques:
- Dilated sliding window (Fixed Pattern)
- Global attention tokens (Neural Memory) - like [CLS] for classification, or question tokens in QA
- Complexity: \(O(Nw)\), \(w\) = window size
Extended Transformer Construction (ETC) (Ainslie et al., 2020)
- Main idea: Introduces global-local attention with auxiliary global tokens. The global tokens serve as memory hubs that summarize and redistribute information.
- Techniques:
- Global and local attention split (Combined Patterns, Neural Memory)
- The global tokens can attend to all tokens in the sequence, while local tokens can only attend to a fixed window of surrounding tokens.
- Complexity: \(O(n_g^2 + n_g N)\)
- Limitation: Not suitable for autoregressive decoding
BigBird (Zaheer et al., 2020)
- Main idea: Extends ETC with global, sliding window, and random attention.
- Techniques:
- Global attention: A subset of indices is selected as global tokens.
- Sliding window attention: Each query attends to \(w/2\) tokens on left and right.
- Random attention: Each query attends to \(r\) random tokens in the sequence.
- Complexity: \(O(N)\)
- Limitation: Not suitable for autoregressive decoding
Routing Transformer (Roy et al., 2021)
- Main idea: Learns sparse attention via online clustering of queries and keys.
- Techniques:
- Content-based clustering (Learned Patterns): Q & K are projected into a routing matrix R, using a d x d orthonormal projection matrix. \(R = QW_R, KW_R\). The routing matrix is then clustered using K-means
- Complexity: \(O(N^{1.5})\)
Reformer (Kitaev et al., 2020)
- Main idea: Uses locality-sensitive hashing (LSH) to sparsify attention.
- Techniques:
- LSH bucketing (Learned/Fixed Patterns): The queries and keys are hashed into buckets using a random projection matrix \(R \in R^{k x b/2}\). The hash function is defined as \(hash(x) = argmax([xR; -xR])\), where x is the query or key vector. Attention is then computed only within each bucket.
- Complexity: \(O(N \log N)\)
Sinkhorn Transformer (Tay et al., 2020)
- Main idea: Learns block-sparse patterns by sorting input tokens.
- Techniques:
- Learned soft permutation (Learned Patterns)
- Blockwise local attention
Linformer (Wang et al., 2020)
- Main idea: Projects keys/values to lower dimension for low-rank approximation.
- Techniques:
- Low-rank projection (Low-Rank Approximations)
- Complexity: Linear in sequence length
Performer (Choromanski et al., 2020)
- Main idea: Approximates attention with kernel methods and random features.
- Techniques:
- FAVOR+ kernel approximation (Kernel-Based Methods)
- Complexity: Linear in sequence length
- Limitation: Causal masking is less efficient
Linear Transformer (Katharopoulos et al., 2020)
- Main idea: Uses kernel-based attention and associative property for linear complexity.
- Techniques:
- Kernel-based attention (Kernel-Based Methods)
- RNN-style recurrence for causal masking
- Complexity: Linear in sequence length
Transformer-XL (Dai et al., 2019)
- Main idea: Connects blocks with recurrence to capture long-term dependencies.
- Techniques:
- Recurrent memory connections (Recurrent Memory)
Axial Transformer (Ho et al., 2019)
Discussion
While efficient transformer models have made significant progress in reducing computational and memory costs, several challenges and open questions remain:
- Benchmarking and Comparability:
- Many papers use different datasets, tasks, and evaluation metrics, making direct comparison difficult.
- Hyperparameter choices, training regimes, and implementation details can significantly affect results.
-
There is a need for standardized benchmarks and fair evaluation protocols to assess efficiency and performance trade-offs.
-
Task and Mode Specialization:
- Some models are designed primarily for encoder-only tasks (e.g., classification), while others target autoregressive generation or sequence-to-sequence tasks.
-
Not all efficient attention mechanisms support causal masking, limiting their applicability for generative models.
-
Trade-offs:
- Reducing attention complexity often comes at the cost of reduced expressiveness or global context.
- Some methods introduce additional architectural complexity or require custom hardware/software support.
-
There is often a balance between efficiency, accuracy, and model flexibility.
-
Scalability and Real-World Use:
- While many models show promising results on synthetic or academic benchmarks, real-world deployment may reveal new bottlenecks.
-
Memory usage, inference speed, and ease of integration into existing pipelines are important practical considerations.
-
Future Directions:
- Continued development of hybrid models that combine multiple efficiency techniques.
- Exploration of adaptive and data-dependent attention mechanisms.
- Better understanding of the theoretical limits of efficient attention and their impact on downstream tasks.
Orthogonal Efforts
In addition to architectural innovations in attention mechanisms, several orthogonal strategies can further improve the efficiency and practicality of transformer models:
- Weight Sharing: Reusing parameters across layers to reduce model size and memory footprint.
- Quantization: Lowering the precision of weights and activations (e.g., from float32 to int8) to speed up computation and reduce memory usage.
- Inference Optimization: Techniques such as pruning, distillation, and specialized hardware acceleration to make inference faster and more resource-efficient.
- Knowledge Distillation: Training smaller models (students) to mimic larger models (teachers), achieving similar performance with fewer resources.
- Task Adapters: Adding lightweight, task-specific modules to a shared backbone, enabling efficient multi-task learning and transfer.
- Alternative Architectures: Exploring non-transformer models or hybrid approaches that may offer better efficiency for certain tasks or hardware.