Patch embedding, position encoding, transformer encoder stack and a classification head.
For computer-vision papers using transformer encoders for classification, segmentation or detection.
Same flow but show a hierarchical Swin-style ViT with 4 stages of decreasing spatial resolution and increasing channel dim. Add window-attention boxes inside each stage.