Closed-loop evaluation harness running an LLM agent against benchmark problems with deployment, scoring and reporting.
For LLM benchmarking, agent evaluation and AIOps-style automated test harnesses.
Add a "Safety Filter" node between Run Episode and Score & Log that screens agent actions for unsafe operations (e.g., destructive shell commands) before they execute. Unsafe actions are logged and the episode is terminated.