How it works

From a black-box agent to a governed one.

Runback captures everything an AI agent does, lets you reproduce any decision, gate releases against policy, and keep an audit-ready record. Here's the whole lifecycle — in plain language, no prior knowledge assumed.

First, the problem

Why you can't govern an agent with logs.

A normal program follows fixed instructions. An AI agent is different: you give it a goal and it decides for itself how to get there — searching, reading, calling tools, sending messages — one step at a time, choosing each move from what it has learned so far.

That autonomy is the point, and the problem. When it does something wrong, there's no single line of code to blame — the mistake is buried in a chain of decisions it made on its own. Logs show what happened, never why. To govern an agent you need to see the decision, reproduce it, and prove it. That's the lifecycle below.

01 · Observe

See exactly what the model saw.

Every step records the precise context the model was handed — system prompt, conversation, tools — captured at the model boundary, with PII redacted in-process. A failed run opens on the step that broke. Click any step:

research-email-agent · run failedclick any step ↓

Tool call · send_email — failed here

tool: send_email
status: error

input

{ "to": "alex[at]acme.co", "body": "Next.js 16 summary…" }

Invalid recipient — the model wrote the address in plain text ("[at]") instead of a valid email. The run failed here.

02 · Replay

Reproduce any decision.

Re-run a single step exactly as it happened — or change the prompt or model and compare. An investigation takes minutes, not a war room.

Original · gpt-4opass

Escalated the disputed charge to a specialist.

→ escalate({ reason: "dispute" })

Replay · llama-3.3regression

Refunded the full $250 — broke policy.

→ issue_refund({ amount: 250 })

03 · Gate

Turn a fix into a guardrail.

Save the step to a dataset with the checks it must keep passing, then run it as an eval before release. A policy breach shows up as a red row — not a production incident.

✓refund within $100 limitpass
✗disputed charge → must escalatefail
✓out-of-scope → declinepass
✓stays in policy tonepass

3 / 4 passed — 1 regression caught before deploy

04 · Audit

Prove it, after the fact.

Export a complete, tamper-evident record of any run — every event in a SHA-256 hash chain, signed. Change one byte and verification fails. The artifact an auditor actually asks for.

$schemarunback.audit/v1

run_idsupport-refund-agent

events6 · chained

content_digest3e68cdbbf372df98…

signatureHMAC-SHA256 ✓ signed

Recompute the chain to verify — POST to /api/audit/verify.

Connect your agent

About three lines, any agent.

Use the SDK for the deepest capture, or point existing OpenTelemetry traces at Runback — it sits above whatever framework you build in.

import { withDebugger } from "@runback/sdk";
import { generateText, stepCountIs } from "ai";

const dbg = withDebugger(model, { runName: "support-agent", redact: "standard" });
const res = await generateText({
  model: dbg.model,
  tools: dbg.tools(myTools),
  stopWhen: stepCountIs(8),
  prompt: task,
});
await dbg.finish({ output: res.text, status: "success" });

Every run now shows up ready to observe, replay, gate, and audit. See all integrations →