Attention
Motivation Standard RNNs, LSTMs, and GRUs struggle with very long sequences due to fixed-size hidden states. As the sequence grows, it becomes harder to retain relevant information. Attention mechanisms help by allowing the model to focus on different parts of the input sequence when generating each output.
🔹 Key Idea Instead of compressing all input information into a single context vector, attention provides dynamic weighting of all encoder hidden states.
🔹 Basic Attention Mechanism (Bahdanau Attention) Given an encoder hidden state sequence($h_1, h_2, \cdots h_T $), attention computes a context vector \(c_t\) for each decoder step:
- Compute alignment scores (similarity between current decoder state and encoder states): \(e_{ti}=score(s _t, h_i)\)
- Normalize scores with softmax: \(\frac{exp(e_{ti})}{\sum_j exp(e_{tj})}\)
- Compute the context vector: \(c_t = \sum_i \alpha_{ti} h_i\)
🔹 Variants of Attention
- Additive attention (Bahdanau): uses a feedforward network
- Dot-product attention (Luong): simpler and more efficient
🔹 Impact
- Improves performance in seq2seq tasks (e.g., translation)
- Forms the foundation of Transformers