Evals

Evals

1

An eval re-runs every step in a dataset and scores it against your checks. Green means your change is safe; a red row is a regression you caught before it shipped. Start one from any dataset's Run eval button.

StatusEvalDatasetModelPassStarted
donePre-release · refund-flowRefund-flow policy guardrailscaptured
3/4
31m ago