ML ArchitectureText → ImageMid creditEN

Transformer Encoder-Decoder Architecture

Publication-quality transformer block diagram with self-attention, cross-attention, and residual connections.

When to use this prompt

For NeurIPS / ICML / ICLR papers introducing or extending transformer-based architectures. Works well as Figure 1 of a methods section.

The prompt

A transformer encoder-decoder architecture for sequence-to-sequence modeling.

Layout: vertical stack, encoder on the left column, decoder on the right column, connected by horizontal cross-attention arrows in the middle.

Encoder (6 stacked layers):
- Input embedding + positional encoding at the bottom
- Each layer contains: multi-head self-attention -> Add & LayerNorm -> feed-forward -> Add & LayerNorm
- Show residual (skip) arrows around each sub-layer with curved dashed lines

Decoder (6 stacked layers):
- Output embedding (shifted right) + positional encoding at the bottom
- Each layer contains: masked multi-head self-attention -> Add & LayerNorm -> cross-attention to encoder output -> Add & LayerNorm -> feed-forward -> Add & LayerNorm
- Linear + softmax head at the top

Style: clean academic vector style, minimal palette (navy blue, teal, light gray), thin borders on rounded boxes, monospace font for tensor shape annotations, white background. Follow NeurIPS figure conventions.

Variations

Decoder-only (GPT style)

A decoder-only transformer architecture in the style of GPT. 12 stacked layers, each with masked multi-head self-attention -> Add & LayerNorm -> feed-forward -> Add & LayerNorm. Token embedding + positional encoding at the bottom; linear + softmax language modeling head on top. Show residual connections as curved arrows. Annotate hidden dim 768, num heads 12. Clean vector style, navy/teal palette.

Vision Transformer (ViT)

A Vision Transformer architecture. Input image is split into 16x16 patches, each linearly projected with a positional embedding, prepended with a learnable [CLS] token. The sequence is fed through 12 transformer encoder layers (multi-head self-attention + MLP + LayerNorm). The [CLS] token output goes to a classification head. Show the patch grid clearly on the left, the encoder stack in the center, and the MLP head on the right. Academic style, white background.

Tips

  • State the number of layers, heads, and hidden dim — generators reproduce these as labels.
  • Use the words "residual", "Add & LayerNorm", "cross-attention" exactly — they map to standard visual primitives.
  • Avoid mixing top-down and left-right flow in one prompt. Pick one and stay consistent.

FAQ

Try this prompt now

Open it inside the generator with the prompt pre-filled.

Try this prompt

Related prompts