Governance¶
The gateway emits three kinds of governance signal for every request:
- Audit events — structured records of who did what, when, with what outcome.
- Prometheus metrics — aggregated counters for auth, policy, credentials, runs, tool calls, and LLM tokens.
- Structured logs — JSON-formatted application logs with secret scrubbing and correlation IDs.
Together they cover compliance (immutable audit trail), operational debugging (metrics + logs), and security detection (failed auth, policy denials, access violations) from one coherent source.
The Audit Subsystem¶
audit/ is a pluggable subsystem that sits alongside auth/, enforcement/, and credentials/. Every non-exempt request and every significant internal action produces an AuditEvent that flows through a single AuditRouter out to one or more AuditSink implementations in parallel.
emit_audit_event(action=..., outcome=..., ...)
│
▼
┌─────────────┐
│ AuditRouter │ one queue + worker per sink
└──┬──┬──┬──┬─┘
│ │ │ │
▼ ▼ ▼ ▼
stdout_json file redis_stream observability
Design properties:
- Non-blocking emit.
router.emit()is a synchronousput_nowaitper sink. The request path pays microseconds. - Per-sink failure isolation. A slow or broken sink doesn't hold up others or the response.
- Backpressure. Queue full drops the event for that sink and increments
gateway_audit_events_dropped_total. Per-sink timeouts bound eachemit()call. - Graceful shutdown. The FastAPI lifespan drains every queue and calls
sink.close()on teardown. - Best-effort delivery. At-most-once semantics. Use the
redis_streamsink when durability matters.
See Audit API Reference for the AuditEvent schema and the full action taxonomy.
Signals at a Glance¶
Every primitive emits structured audit events for its mutations.
Read-only operations are covered by the universal provider.call emit
from MetricsProxy. Subsystems (auth, policy, credentials) emit at the
middleware layer.
Cross-cutting:
| Layer | Signal | Where it fires |
|---|---|---|
| HTTP | http.request |
AuditMiddleware wraps every non-exempt request |
| Auth | auth.success / auth.failure / auth.logout |
AuthenticationMiddleware |
| Policy | policy.allow / policy.deny / policy.load |
PolicyEnforcementMiddleware + CedarPolicyEnforcer.load_policies() |
| Credentials | credential.resolve / credential.read / credential.write / credential.delete |
CredentialResolutionMiddleware + routes/{credentials,identity}.py |
| Ownership | resource.access.denied |
auth/access.py::require_access / require_owner_or_admin |
| Network | network.access.denied |
gateway egress guards (URL-shaped credential rejection) |
| Primitive (universal) | provider.call |
MetricsProxy — every wrapped primitive method (coroutine + async generator) |
Per-primitive CRUD + lifecycle:
| Primitive | Resource type(s) | Actions | Where it fires |
|---|---|---|---|
| Agents | agent |
agent.{create,update,delete}, agent.run.{start,complete,failed,cancelled}, agent.delegate, agent.version.{create,propose,approve,reject,deploy}, agent.fork |
routes/agents.py, AgentRunner, agents/tools/delegation.py |
| Teams | team |
team.{create,update,delete}, team.run.{start,complete,failed,cancelled}, team.version.{create,propose,approve,reject,deploy}, team.fork |
routes/teams.py, TeamRunner |
| Policy | policy, policy_engine |
policy.{create,update,delete}, policy.load |
routes/policy.py, CedarPolicyEnforcer |
| Credentials | credential, user |
credential.{resolve,read,write,delete} |
credentials/middleware.py, routes/credentials.py, routes/identity.py |
| Identity | identity |
identity.credential_provider.{create,update,delete}, identity.workload.{create,update,delete} |
routes/identity.py |
| Memory | memory |
memory.resource.{create,delete}, memory.strategy.{create,delete}, memory.branch.create, memory.event.{append,delete}, memory.record.{write,delete} |
routes/memory.py |
| Tools | tool |
tool.{register,delete}, tool.server.register, tool.call (per execution) |
routes/tools.py, agents/tools/catalog.execute_tool |
| LLM | llm |
llm.generate (with input/output tokens) |
LLMProvider.__init_subclass__ (automatic on every backend) |
| Evaluations | evaluator |
evaluator.{create,update,delete}, evaluator.score.{create,delete}, evaluator.online_config.{create,delete} |
routes/evaluations.py |
| Observability | trace |
observability.trace.{ingest,update}, observability.trace.generation.log, observability.trace.score.create, observability.log.ingest, observability.flush |
routes/observability.py |
| Browser | session, page |
session.{create,terminate} (with metadata.primitive=browser), browser.{navigate,click,type,evaluate} |
routes/browser.py |
| Code-interpreter | session, code_execution, file |
session.{create,terminate} (with metadata.primitive=code_interpreter), code_interpreter.execute, code_interpreter.file.{upload,download} |
routes/code_interpreter.py |
| Tasks | task |
task.{create,claim,update,note} |
agents/tools/handlers.py (when invoked via agent tool calls) |
Correlation IDs¶
Every request carries two identifiers:
request_id— unique per HTTP request. Generated if not supplied viaX-Request-Id. Returned onx-request-id.correlation_id— threaded across sub-agent calls, background runs, and checkpoint resumes. Generated from the request ID if not supplied viaX-Correlation-Id. Returned onx-correlation-id.
Both IDs are automatically attached to every audit event, every JSON log line, and response headers — so a single correlation_id query ties together the HTTP request, its auth outcome, its policy decision, the agent run it triggered, every sub-agent it delegated to, every tool call, and every LLM request.
Middleware Order¶
CORS
→ RequestContextMiddleware (sets request_id + correlation_id)
→ AuditMiddleware (emits one http.request event per request)
→ AuthenticationMiddleware (auth.success / auth.failure)
→ CredentialResolutionMiddleware (credential.resolve)
→ PolicyEnforcementMiddleware (policy.allow / policy.deny)
→ route handler
AuditMiddleware wraps the auth + policy chain so the http.request event sees the final response status and the authenticated principal — and still runs inside RequestContextMiddleware so request/correlation IDs are populated before any emit.
Structured Logs¶
Application logs are separate from audit events. Set logging.format: json to get one JSON object per log line, enriched with:
timestamp,level,logger,module,messagerequest_id,correlation_id,principal_id,principal_type
LogSanitizationFilter is installed by default (logging.sanitize: true) and scrubs Bearer tokens, AWS access keys, JWTs, and apg.* key=value pairs from rendered messages before the formatter sees them. This protects against accidental secret leakage in exception tracebacks and debug logs.
Prometheus Metrics¶
All metrics are exposed at GET /metrics. Cardinality is intentionally bounded: actor_id and resource_id are never labels — only bounded enums (outcome, decision, principal_type, kind) and configuration-bounded identifiers (agent_name, team_name, model, tool_name).
| Metric | Labels | Source |
|---|---|---|
gateway_auth_events_total |
backend, outcome, principal_type |
Auth middleware |
gateway_policy_decisions_total |
decision, action_category |
Policy middleware |
gateway_credential_operations_total |
op, service, outcome |
Credential resolver + writer |
gateway_agent_runs_total |
agent_name, status |
AgentRunner |
gateway_team_runs_total |
team_name, status |
TeamRunner |
gateway_tool_calls_total |
tool_name, status |
execute_tool |
gateway_llm_requests_total |
model, status |
LLMProvider ABC |
gateway_llm_tokens_total |
model, kind (input/output/total) |
LLMProvider ABC |
gateway_access_denials_total |
resource_type |
require_access / require_owner_or_admin |
gateway_audit_events_total |
action_category, outcome |
emit_audit_event |
gateway_audit_sink_events_total |
sink, outcome |
Per-sink worker |
gateway_audit_sink_queue_depth |
sink |
Per-sink gauge |
gateway_audit_events_dropped_total |
sink, reason |
Router backpressure |
Wire Everything, Filter At Config¶
APG's default is emit-everything — every mutation, every primitive
call, every session lifecycle event — so the audit stream is the
authoritative record of what happened, not just what somebody
remembered to audit. High-volume deployments tame the stream at the
router with audit.filter:
exclude_actions: exact action drops (provider.call).exclude_action_categories: category prefix drops (memory→ everymemory.*).sample_rates: per-action keep fraction in[0.0, 1.0].
Filtered events increment
gateway_audit_events_dropped_total{sink="__router__",reason="filtered"}
so operators can see how aggressive the filter is. Keep the
compliance-relevant events unfiltered — auth, policy, ownership,
version/fork, and credential events. See the Observability
Guide for recipes.
Choosing Sinks¶
| Sink | Good for | Multi-replica |
|---|---|---|
stdout_json (always-on) |
k8s log shipping (Fluent Bit → Loki/Datadog/CloudWatch) | ✓ (each pod writes its own stream) |
file |
Single-node dev; sidecar tail to SIEM | One file per pod |
redis_stream |
Durable cross-replica audit bus; SIEM consumer via XREAD |
✓ |
observability |
Route audit into Langfuse / AgentCore trace explorer | ✓ |
See the Observability Guide for deployment recipes and the Compliance Guide for SOC2 / GDPR alignment.
See Also¶
- Audit API Reference —
AuditEventschema, action taxonomy, redaction rules - Observability Guide — SIEM / Loki / Datadog / CloudWatch integration
- Compliance Guide — retention, PII, right-to-erasure
- Policy Enforcement — how
policy.allow/policy.denydecisions are made