Publication-quality transformer block diagram with self-attention, cross-attention, and residual connections.
For NeurIPS / ICML / ICLR papers introducing or extending transformer-based architectures. Works well as Figure 1 of a methods section.
A decoder-only transformer architecture in the style of GPT. 12 stacked layers, each with masked multi-head self-attention -> Add & LayerNorm -> feed-forward -> Add & LayerNorm. Token embedding + positional encoding at the bottom; linear + softmax language modeling head on top. Show residual connections as curved arrows. Annotate hidden dim 768, num heads 12. Clean vector style, navy/teal palette.
A Vision Transformer architecture. Input image is split into 16x16 patches, each linearly projected with a positional embedding, prepended with a learnable [CLS] token. The sequence is fed through 12 transformer encoder layers (multi-head self-attention + MLP + LayerNorm). The [CLS] token output goes to a classification head. Show the patch grid clearly on the left, the encoder stack in the center, and the MLP head on the right. Academic style, white background.