Can I depict multi-turn agent runs?

Add a small inner loop on the agent: each turn the agent observes the environment and takes an action; the inner loop closes when the episode terminates.

PromptsPipeline

PipelineText → ImageMid creditEN

LLM Evaluation Framework

Closed-loop evaluation harness running an LLM agent against benchmark problems with deployment, scoring and reporting.

When to use this prompt

For LLM benchmarking, agent evaluation and AIOps-style automated test harnesses.

The prompt

An LLM evaluation framework drawn as a closed-loop system with a central Orchestrator.

Center — Evaluation Orchestrator (drawn as a hub):
- Manages the full experiment lifecycle.

Right side — LLM Agent under test:
- Receives problem specifications and produces actions/answers.

Surrounding the Orchestrator (clockwise from top), four phases:
1. Register Agent — the agent registers itself with the orchestrator before the experiment.
2. Initialise Problem — orchestrator deploys a fresh sandboxed environment for one benchmark instance.
3. Run Episode — orchestrator forwards the problem to the agent, collects the agent's action sequence, and applies it to the environment.
4. Score & Log — orchestrator evaluates outcomes against ground truth, logs results to a metrics store.

A "next problem" arrow loops back from Score & Log to Initialise Problem.

Outside the loop, on the left: a benchmark database (cylinder) feeding the orchestrator with problem specifications.
Style: clean academic vector, navy / amber palette, white background, suitable for systems-AI conferences.

Variations

With safety guardrails

Add a "Safety Filter" node between Run Episode and Score & Log that screens agent actions for unsafe operations (e.g., destructive shell commands) before they execute. Unsafe actions are logged and the episode is terminated.

Tips

Number the phases — readers expect phase ordering in evaluation harnesses.
Place the benchmark database outside the loop. Mixing it into the loop muddles the figure.
Show a metrics store explicitly — without it the "evaluation" word feels incomplete.