Some Math Notes About Self-Attention
If you are familiar with the recent frenzy about ChatGPT, Bard and other large language models, you probably already know about Transformer, the key building block behind these models. The key algorithm inside Transformer block is Self Attention, which is commonly described as the following equation: \begin{matrix} Attn(Q, K, V) = softmax(QK^T)V \end{matrix} (NOTE: the scaling part in the softmax is omitted for brevity.) Sometimes this equation is also written in an iterative way like this: ...