Stacked decoder blocks with masked self-attention and a language modeling head.
For LLM / fine-tuning / instruction-tuning papers introducing or modifying a decoder-only model.
Same architecture but annotate the KV-cache flow: highlight where keys and values are cached at each layer during autoregressive decoding. Add a side note showing how cache reuse skips re-computation across positions.