Can I show citation generation in the output?

Add "The LLM output includes inline citation markers [1], [2] referring back to retrieved chunks."

ML ArchitectureText → ImageMid creditEN

Retrieval-Augmented Generation (RAG) Pipeline

Query embedding, vector retrieval, prompt augmentation and LLM response generation.

When to use this prompt

For RAG / question-answering / knowledge-grounded generation papers and engineering blog posts.

The prompt

A Retrieval-Augmented Generation (RAG) pipeline, left-to-right horizontal layout.

Stage 1 — User Query:
- A short text query enters the system.

Stage 2 — Query Embedding:
- The query is encoded by a sentence-embedding model into a dense vector q.

Stage 3 — Vector Retrieval:
- q is matched against a vector database (drawn as a stack of vectors with a "Vector Store" label).
- Top-k nearest neighbors (k=4) are retrieved as context chunks.

Stage 4 — Prompt Construction:
- The original query and the retrieved chunks are concatenated into an augmented prompt template.

Stage 5 — LLM Generation:
- The augmented prompt is fed to an LLM (e.g., GPT-class model) which produces the final grounded response.

Outside the main flow, on top: an offline indexing pipeline showing documents -> chunker -> embedder -> vector store. Connect with a dashed arrow into Stage 3.
Style: clean academic vector, navy and amber palette, white background, sans-serif labels.

Variations

With re-ranker stage

Insert a re-ranking stage between vector retrieval and prompt construction. The re-ranker (a cross-encoder) scores each retrieved chunk against the query and reorders them, keeping the top-k'.

Hybrid (sparse + dense) retrieval

Replace the single vector retrieval with two parallel retrievers: BM25 sparse retrieval and dense embedding retrieval. Their results are merged via reciprocal rank fusion before prompt construction.

Tips

Always include the offline indexing branch — without it readers don't see how the vector store was built.
Use k=4 or k=5 in the figure. Larger k crowds the layout; smaller k looks toy.
Annotate the prompt template inline if space allows — it shows readers what the LLM actually sees.