Observability Guide¶
Wire the gateway's governance signal into the observability stack you already run. This guide covers three deployment patterns: log-shipping from stdout, a durable audit bus on Redis, and routing audit events into a trace backend.
For the conceptual picture see Governance; for the event schema see the Audit API Reference.
Pattern 1: Stdout JSON → Log Shipper → SIEM¶
The default setup. StdoutJsonSink is always on unless explicitly disabled, so every deployment writes one JSON line per audit event to stdout. Any log collector can tail it.
Server config:
audit:
stdout_json: true # default
logging:
format: json # application logs also go to JSON
sanitize: true # scrub secrets in log messages
Kubernetes + Fluent Bit → Loki:
# fluent-bit.conf (DaemonSet; reads pod stdout from /var/log/containers)
[INPUT]
Name tail
Path /var/log/containers/*apg*.log
Parser cri
Tag kube.apg.*
[FILTER]
Name parser
Match kube.apg.*
Key_Name log
Parser json
Reserve_Data On
[OUTPUT]
Name loki
Match kube.apg.*
host loki.monitoring
labels job=apg, action=$action, outcome=$outcome
label_keys $action,$outcome,$actor_type
Query audit events in Grafana:
Kubernetes + Fluent Bit → Datadog:
[OUTPUT]
Name datadog
Match kube.apg.*
Host http-intake.logs.datadoghq.com
TLS on
apikey ${DATADOG_API_KEY}
dd_service apg
dd_source apg-audit
dd_tags env:prod
AWS ECS / Fargate → CloudWatch:
The awslogs driver forwards stdout to CloudWatch automatically. Query with CloudWatch Logs Insights:
fields @timestamp, action, outcome, actor_id, correlation_id, http_path
| filter action = "policy.deny"
| sort @timestamp desc
Pattern 2: Durable Cross-Replica Audit Bus on Redis¶
For multi-replica deployments where you need a single consumable stream across pods, and want at-least-minute-durability without an external logging tier.
Server config:
audit:
stdout_json: true # keep the k8s-native log story
sinks:
- name: bus
backend: redis_stream
config:
redis_url: "${REDIS_URL:=redis://localhost:6379/0}"
stream: "gateway:audit"
maxlen: 100000 # ring-buffer cap; old entries evicted on write
Consume with redis-cli for ad-hoc inspection:
# Latest 20 events
redis-cli XREVRANGE gateway:audit + - COUNT 20
# Stream from newest
redis-cli XREAD COUNT 100 STREAMS gateway:audit $
Dedicated consumer group (Python):
import redis.asyncio as redis
r = redis.from_url("redis://localhost:6379/0", decode_responses=True)
try:
await r.xgroup_create("gateway:audit", "siem-shipper", id="0", mkstream=True)
except redis.ResponseError:
pass # group exists
while True:
entries = await r.xreadgroup(
"siem-shipper", "worker-1",
streams={"gateway:audit": ">"},
count=100, block=5000,
)
for _stream, events in entries or []:
for eid, fields in events:
event = json.loads(fields["event"])
await ship_to_siem(event)
await r.xack("gateway:audit", "siem-shipper", eid)
Pair this with a stdout sink so operators can tail kubectl logs for quick checks while the durable consumer handles compliance delivery.
Pattern 3: Audit Events into Your Trace Backend¶
Route audit into the already-configured observability provider (Langfuse or AgentCore) so the audit stream appears in the same trace explorer your team uses for agent runs.
audit:
stdout_json: true
sinks:
- name: traces
backend: observability # delegates to registry.observability.ingest_log
No extra setup beyond configuring the observability primitive itself. Audit events arrive as log entries in Langfuse / CloudWatch X-Ray with request_id and correlation_id already attached.
Prometheus Metrics¶
The gateway exposes Prometheus metrics at GET /metrics — exempt from auth and policy, so scraping is always allowed. Add it to your ServiceMonitor or Prometheus scrape config:
scrape_configs:
- job_name: apg
metrics_path: /metrics
static_configs:
- targets: ['apg.monitoring:8000']
Useful Alerts¶
# Sudden spike in auth failures (potential brute force).
- alert: APGAuthFailureSpike
expr: |
sum(rate(gateway_auth_events_total{outcome="failure"}[5m])) > 5
for: 2m
annotations:
summary: "Authentication failures elevated on APG"
# Policy denials (possible misconfig or attack).
- alert: APGPolicyDenySpike
expr: |
sum(rate(gateway_policy_decisions_total{decision="deny"}[5m])) > 1
for: 5m
# Sink is dropping events — audit pipeline broken.
- alert: APGAuditDrops
expr: |
rate(gateway_audit_events_dropped_total[5m]) > 0
for: 2m
labels:
severity: warning
# Sink queue filling — backpressure building.
- alert: APGAuditQueueDepth
expr: gateway_audit_sink_queue_depth > 1500
for: 10m
Useful Dashboard Queries¶
# LLM token usage by model (input + output per second)
sum by (model, kind) (rate(gateway_llm_tokens_total[5m]))
# Agent run error rate
sum by (agent_name) (rate(gateway_agent_runs_total{status="failed"}[5m]))
/
sum by (agent_name) (rate(gateway_agent_runs_total{status="start"}[5m]))
# Top tools by latency (approximate — use provider duration metrics for precise)
topk(10, sum by (tool_name) (rate(gateway_tool_calls_total[5m])))
# Policy decision mix
sum by (decision, action_category) (rate(gateway_policy_decisions_total[5m]))
Taming Audit Volume¶
APG emits audit events for every mutation and for every primitive call
(via the MetricsProxy), so high-traffic deployments can generate a lot
of events. The audit.filter config drops events at the router — before
any sink sees them — giving you a single knob to cut noise without
losing compliance-relevant signal.
audit:
enabled: true
stdout_json: true
filter:
# Drop the firehose — every provider call emits one, so it's
# typically the biggest source of volume.
exclude_actions:
- "provider.call"
# Or drop a whole category. Drops memory.record.write /
# memory.record.delete / memory.event.* / memory.branch.create etc.
exclude_action_categories:
- "memory"
# Or keep a statistical sample of the high-volume actions.
# Probabilistic per-event, independent sampling.
sample_rates:
tool.call: 0.1 # keep 10% of tool calls
llm.generate: 0.25 # keep 25% of LLM calls
Filtered events increment
gateway_audit_events_dropped_total{sink="__router__",reason="filtered"}
so you can monitor how aggressively the filter is trimming. Keep
auth / policy / version / fork / credential events unfiltered — those
are the compliance-critical ones.
Correlating Signals¶
Every audit event, log line, and response header carries request_id and correlation_id. Given a user complaint with a response header:
- Grep audit stream for
correlation_idto see the full chain of events for that request and any sub-agent calls it triggered. - Grep application logs for the same
correlation_idto see diagnostic output from middleware and providers. - Use Prometheus to see aggregate behavior at the time of the request.
This is the intended workflow: one identifier, three lenses.
Web UI audit viewer¶
When the redis_stream sink is configured (see Pattern 2 above), admins
get an in-browser audit viewer at /ui/audit:
- Live tail — SSE feed of new events as they're written, with pause/clear controls and a client-side ring buffer (1000 events max).
- Historical browse — paginated
XREVRANGEover the stream, newest-first, with filters for action, outcome, actor, correlation ID, resource type. - Empty state — if the sink isn't configured yet, the page shows a ready-to-paste YAML snippet and a link back to this guide.
The page is gated by the admin scope on the principal (returned by
GET /api/v1/auth/whoami). Non-admins get a 403 panel; the nav link is
hidden entirely. Regular users have no visibility into the audit stream
via the UI — consumption by non-admins happens through the audit sinks
themselves (log shipper, SIEM, etc.) according to your retention policy.
See Also¶
- Governance — conceptual overview
- Audit API Reference —
AuditEventschema - Compliance Guide — SOC2 / GDPR alignment and retention