PipelineText → ImageMid creditEN

LLM Evaluation Framework

Closed-loop evaluation harness running an LLM agent against benchmark problems with deployment, scoring and reporting.

When to use this prompt

For LLM benchmarking, agent evaluation and AIOps-style automated test harnesses.

The prompt

An LLM evaluation framework drawn as a closed-loop system with a central Orchestrator.

Center — Evaluation Orchestrator (drawn as a hub):
- Manages the full experiment lifecycle.

Right side — LLM Agent under test:
- Receives problem specifications and produces actions/answers.

Surrounding the Orchestrator (clockwise from top), four phases:
1. Register Agent — the agent registers itself with the orchestrator before the experiment.
2. Initialise Problem — orchestrator deploys a fresh sandboxed environment for one benchmark instance.
3. Run Episode — orchestrator forwards the problem to the agent, collects the agent's action sequence, and applies it to the environment.
4. Score & Log — orchestrator evaluates outcomes against ground truth, logs results to a metrics store.

A "next problem" arrow loops back from Score & Log to Initialise Problem.

Outside the loop, on the left: a benchmark database (cylinder) feeding the orchestrator with problem specifications.
Style: clean academic vector, navy / amber palette, white background, suitable for systems-AI conferences.

Variations

With safety guardrails

Add a "Safety Filter" node between Run Episode and Score & Log that screens agent actions for unsafe operations (e.g., destructive shell commands) before they execute. Unsafe actions are logged and the episode is terminated.

Tips

  • Number the phases — readers expect phase ordering in evaluation harnesses.
  • Place the benchmark database outside the loop. Mixing it into the loop muddles the figure.
  • Show a metrics store explicitly — without it the "evaluation" word feels incomplete.

FAQ

Try this prompt now

Open it inside the generator with the prompt pre-filled.

Try this prompt

Related prompts