May 02,2023

Transformer Design Formulas

The Transformer is a neural network architecture that has been widely used in natural language processing tasks such as language translation, language modeling, and text classification. Here are some important design formulas for the Transformer:
  1. Attention mechanism: The attention mechanism used in the Transformer can be defined as follows:
    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
    ★ Q, K, and V are matrices representing queries, keys, and values, respectively.
    d_k is the dimensionality of the key space.

     
  2. Multi-head attention: The multi-head attention mechanism in the Transformer can be defined as follows: 
    MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
    head_i = Attention(QW_i^Q, KW_i^K, VW_i^V), W_i^Q, W_i^K, W_i^V are the weight matrices for the i-th head
    W^O is the weight matrix for the output.

     
  3. Positional encoding: To incorporate the positional information of the input sequence, the Transformer uses positional encodings.
    The positional encoding function can be defined as follows:

    PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    pos is the position of the token in the sequence
    i is the dimension index
    d_model is the dimensionality of the model.

     
  4. Feed-forward network: The feed-forward network used in the Transformer can be defined as follows:
    FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
    W_1, b_1, W_2,
    b_2 are weight matrices and bias vectors.

     
  5. Layer normalization: To normalize the output of each layer, the Transformer uses layer normalization.
    The layer normalization function can be defined as follows:

    LayerNorm(x) = (x - mean(x)) / sqrt(var(x) + epsilon) * gamma + beta
    (x) and var(x) are the mean and variance of x, respectively, epsilon is a small constant to avoid division by zero
    gamma and beta are learnable scale and shift parameters.