Evals

Datasets

A dataset is a saved set of agent steps you want to keep an eye on — plus the checks each one has to pass. It's how you turn a bug you fixed once into a test it can't fail again.

1
Add a step. Open a run, pick an LLM step, and click + Add to dataset on its Replay tab.
2
Set the checks. Choose what its output must always do — e.g. must not error, must call refund(), or a plain-English rubric.
3
Run an eval. Runback replays every step and scores it, so a regression shows up as a red row — before it ships.

Refund-flow policy guardrailsThe checks the support agent must keep passing before any release.4 items