Evaluations API¶

/api/v1/evaluations

Evaluator management, LLM-as-a-judge evaluation, and score recording. All endpoints require authentication.

Backends: NoopEvaluationsProvider, LangfuseEvaluationsProvider, AgentCoreEvaluationsProvider

Evaluator Management¶

Method	Path	Description
`POST`	`/evaluators`	Create an evaluator. Returns 201.
`GET`	`/evaluators`	List evaluators. Query params: `max_results` (default 100), `next_token`.
`GET`	`/evaluators/{evaluator_id}`	Get an evaluator.
`PUT`	`/evaluators/{evaluator_id}`	Update an evaluator.
`DELETE`	`/evaluators/{evaluator_id}`	Delete an evaluator. Returns 204.

curl -X POST http://localhost:8000/api/v1/evaluations/evaluators \
  -H "Content-Type: application/json" \
  -d '{
    "name": "helpfulness",
    "evaluator_type": "numeric",
    "config": {"min_value": 0.0, "max_value": 1.0},
    "description": "Rates response helpfulness"
  }'

Evaluator types vary by backend:

Backend	Evaluator types
AgentCore	`TRACE`, `TOOL_CALL`, `SESSION` (LLM-as-a-judge configs)
Langfuse	`numeric`, `boolean`, `categorical` (Score Config data types)
Noop	Any string (in-memory only)

Evaluate¶

Method	Path	Description
`POST`	`/evaluate`	Run an evaluation or record a score.

The behavior depends on the backend:

AgentCore: Runs LLM-as-a-judge server-side — sends input/output to AgentCore, which calls the LLM and returns computed scores.
Langfuse: Records a score against a trace. Pass metadata.value to set the score. For LLM-as-a-judge, configure evaluators in the Langfuse UI — Langfuse runs evaluations automatically.
Noop: Returns a placeholder score.

# AgentCore: LLM-as-a-judge (computes a score)
curl -X POST http://localhost:8000/api/v1/evaluations/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "evaluator_id": "eval-123",
    "input_data": "What is the capital of France?",
    "output_data": "The capital of France is Paris.",
    "expected_output": "Paris"
  }'

# Langfuse: Record a score against a trace
curl -X POST http://localhost:8000/api/v1/evaluations/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "evaluator_id": "helpfulness",
    "target": "trace-abc123",
    "output_data": "The capital of France is Paris.",
    "metadata": {"value": 0.95}
  }'

Request body:

Field	Type	Required	Description
`evaluator_id`	string	yes	ID of the evaluator (AgentCore) or score name (Langfuse).
`target`	string	no	Trace ID to associate the evaluation with.
`input_data`	string	no	The input/question.
`output_data`	string	no	The model's output.
`expected_output`	string	no	The expected/reference output.
`metadata`	object	no	Additional context. Langfuse: set `value` for the score.

Built-in evaluators (AgentCore): Builtin.Helpfulness, Builtin.Coherence, Builtin.Relevance, Builtin.Correctness

Scores¶

Record, retrieve, and manage pre-computed evaluation scores. Returns 501 if not supported by the configured provider. Only LangfuseEvaluationsProvider supports score CRUD.

Method	Path	Description
`POST`	`/scores`	Record a score. Returns 201 or 501.
`GET`	`/scores`	List scores. Query params: `trace_id`, `name`, `config_id`, `data_type`, `page`, `limit`. Returns 501 if not supported.
`GET`	`/scores/{score_id}`	Get a score by ID. Returns 404/501.
`DELETE`	`/scores/{score_id}`	Delete a score. Returns 204 or 501.

Record a score¶

curl -X POST http://localhost:8000/api/v1/evaluations/scores \
  -H "Content-Type: application/json" \
  -d '{
    "name": "accuracy",
    "value": 0.92,
    "trace_id": "trace-abc123",
    "comment": "Correct and concise",
    "data_type": "NUMERIC"
  }'

Request body:

Field	Type	Required	Description
`name`	string	yes	Score name (e.g., `accuracy`, `helpfulness`).
`value`	float or string	yes	Score value. Numeric for `NUMERIC`/`BOOLEAN`, string for `CATEGORICAL`.
`trace_id`	string	no	Trace ID to associate the score with.
`observation_id`	string	no	Observation ID within the trace.
`comment`	string	no	Explanation or notes.
`data_type`	string	no	`NUMERIC`, `BOOLEAN`, or `CATEGORICAL`.
`config_id`	string	no	Score config (evaluator) ID.
`metadata`	object	no	Key-value metadata.

List scores¶

# All scores for a trace
curl "http://localhost:8000/api/v1/evaluations/scores?trace_id=trace-abc123"

# Filter by name
curl "http://localhost:8000/api/v1/evaluations/scores?name=accuracy&limit=10"

Online Evaluation Configs (Optional)¶

Returns 501 if not supported. Only AgentCoreEvaluationsProvider supports online configs.

Method	Path	Description
`POST`	`/online-configs`	Create an online eval config. Returns 201 or 501.
`GET`	`/online-configs`	List online eval configs. Returns 501 if not supported.
`GET`	`/online-configs/{config_id}`	Get an online eval config. Returns 501 if not supported.
`DELETE`	`/online-configs/{config_id}`	Delete an online eval config. Returns 204 or 501.

Backend Support¶

Feature	Noop	Langfuse	AgentCore
Evaluator CRUD	yes (in-memory)	yes (Score Configs)	yes (LLM-as-a-judge configs)
Evaluate	yes (placeholder)	yes (record score)	yes (LLM-as-a-judge)
Score CRUD	no	yes	no
Online eval configs	no	no	yes