Per-modality encoders, projection to a shared space, fusion module and a downstream classifier.
For multimodal classification papers (hate speech, medical, retrieval, etc.).
Replace the cross-attention fusion with simple concatenation of image and text embeddings followed by an MLP. Note that this is a "late-fusion" baseline for comparison.
Add a contrastive alignment loss between image and text embeddings before fusion (CLIP-style InfoNCE). Show this as an auxiliary loss arrow alongside the classification loss.