feat(evaluators): add standalone evaluator create/list/get/update/run SDK methods#908
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a standalone evaluator API to the SDK: new Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant EC as Evaluator
participant TC as TraceloopClient
participant API as Traceloop API
rect rgba(100,150,255,0.5)
Note over App,API: Create Evaluator
App->>EC: create(options)
EC->>EC: buildPayload (schema & snake_case)
EC->>TC: post("/v2/evaluators", payload)
TC->>API: POST /v2/evaluators (auth headers)
API-->>TC: { evaluatorId, slug }
TC-->>EC: response
EC-->>App: { id, slug }
end
rect rgba(150,200,100,0.5)
Note over App,API: List / Get Evaluator
App->>EC: list(source?) / get(identifier)
EC->>TC: get("/v2/evaluators" or "/v2/evaluators/:id")
TC->>API: GET (auth)
API-->>TC: { evaluators | evaluator }
TC-->>EC: response
EC->>EC: toEvaluatorData (convert schemas, validate llm config)
EC-->>App: EvaluatorCatalogItem | EvaluatorData
end
rect rgba(200,150,100,0.5)
Note over App,API: Run Execution
App->>EC: run(identifier, { input })
EC->>TC: post("/v2/evaluators/:id/executions", { input })
TC->>API: POST execution (auth)
API-->>TC: { executionId, result }
TC-->>EC: response
EC-->>App: { executionId, result }
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/traceloop-sdk/test/standalone-evaluators.test.ts (1)
333-361: Use raw API payload shapes in the response-mapping tests.
BaseDatasetEntity.handleResponse()normalizes successful JSON viatransformApiResponse(), but these cases stub already-camelCasedinputSchema/outputSchema/enumValues. That means the tests won't catch a regression in the actual wire-format mapping path.🧪 Example fixture change
- inputSchema: [{ name: "text", type: "string" }], - outputSchema: [ - { name: "label", type: "enum", enumValues: ["good", "bad"] }, - ], + input_schema: [{ name: "text", type: "string" }], + output_schema: [ + { name: "label", type: "enum", enum_values: ["good", "bad"] }, + ],Also applies to: 591-621
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/traceloop-sdk/test/standalone-evaluators.test.ts` around lines 333 - 361, The test stubs use already camelCased fields (inputSchema/outputSchema/enumValues) so it bypasses the API-to-model mapping in BaseDatasetEntity.handleResponse() which uses transformApiResponse(); update the test fixture in standalone-evaluators.test.ts to return the raw wire-format (snake_case) payload (e.g., input_schema, output_schema, enum_values) so the response-mapping path is exercised and assertions still verify the post-mapping properties (inputSchema, outputSchema) on the result of client.evaluator.list().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/traceloop-sdk/src/lib/client/evaluator/evaluator.ts`:
- Around line 64-66: The evaluator methods get, update, and run interpolate
arbitrary evaluator identifiers directly into the URL, which allows reserved
characters to break the path; fix this by wrapping the identifier with
encodeURIComponent(...) wherever it's used in the URL templates (e.g., change
this.client.get(`/v2/evaluators/${identifier}`) and the similar
this.client.put(...) and this.client.post(...) calls in Evaluator class methods
get, update, and run to use `/v2/evaluators/${encodeURIComponent(identifier)}`)
so the identifier is treated as a single path segment.
- Around line 205-220: The getter in evaluator.ts is stripping explicit zero
numeric LLM settings by only copying llmConfig fields into result when they are
non-zero; update the code that builds result (the block that checks
llmConfig.temperature, maxTokens, topP, frequencyPenalty, presencePenalty) to
copy values whenever the field is defined (!== undefined) regardless of being 0
so zero is preserved on read-after-write — locate the logic around llmConfig and
result in the get() path of Evaluator (evaluator.ts) and change the conditional
checks to test for undefined only.
---
Nitpick comments:
In `@packages/traceloop-sdk/test/standalone-evaluators.test.ts`:
- Around line 333-361: The test stubs use already camelCased fields
(inputSchema/outputSchema/enumValues) so it bypasses the API-to-model mapping in
BaseDatasetEntity.handleResponse() which uses transformApiResponse(); update the
test fixture in standalone-evaluators.test.ts to return the raw wire-format
(snake_case) payload (e.g., input_schema, output_schema, enum_values) so the
response-mapping path is exercised and assertions still verify the post-mapping
properties (inputSchema, outputSchema) on the result of client.evaluator.list().
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e5ca8cf5-492d-4019-8dac-15f703557978
📒 Files selected for processing (5)
packages/traceloop-sdk/src/lib/client/evaluator/evaluator.tspackages/traceloop-sdk/src/lib/client/traceloop-client.tspackages/traceloop-sdk/src/lib/interfaces/evaluator.interface.tspackages/traceloop-sdk/src/lib/node-server-sdk.tspackages/traceloop-sdk/test/standalone-evaluators.test.ts
|
Request timed out after 900000ms (requestId=9a405c62-4ad9-4f45-976c-b48aed23c314) |
077fdb5 to
7276037
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/traceloop-sdk/src/lib/client/evaluator/evaluator.ts (1)
85-98: Extract shared schema mapping to one helper.
update()andbuildPayload()duplicate the samePropertySchema→ API field mapping. Centralizing this avoids future drift.♻️ Suggested refactor
+ private toApiPropertySchema(p: PropertySchema): Record<string, unknown> { + return { + name: p.name, + type: p.type, + description: p.description, + enum_values: p.enumValues, + }; + } + async update( @@ if (patch.inputSchema !== undefined) - payload.input_schema = patch.inputSchema.map((p) => ({ - name: p.name, - type: p.type, - description: p.description, - enum_values: p.enumValues, - })); + payload.input_schema = patch.inputSchema.map((p) => + this.toApiPropertySchema(p), + ); if (patch.outputSchema !== undefined) - payload.output_schema = patch.outputSchema.map((p) => ({ - name: p.name, - type: p.type, - description: p.description, - enum_values: p.enumValues, - })); + payload.output_schema = patch.outputSchema.map((p) => + this.toApiPropertySchema(p), + ); @@ model: options.model, - input_schema: options.inputSchema.map((p) => ({ - name: p.name, - type: p.type, - description: p.description, - enum_values: p.enumValues, - })), - output_schema: options.outputSchema.map((p) => ({ - name: p.name, - type: p.type, - description: p.description, - enum_values: p.enumValues, - })), + input_schema: options.inputSchema.map((p) => this.toApiPropertySchema(p)), + output_schema: options.outputSchema.map((p) => this.toApiPropertySchema(p)), };Also applies to: 144-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/traceloop-sdk/src/lib/client/evaluator/evaluator.ts` around lines 85 - 98, Both update() and buildPayload() duplicate mapping logic from PropertySchema objects (patch.inputSchema / patch.outputSchema) into API fields (payload.input_schema / payload.output_schema); extract that mapping into a single helper function (e.g., mapPropertySchemaToApiField or formatSchemaFields) that takes a PropertySchema and returns { name, type, description, enum_values }, then replace the inline .map(...) calls in both buildPayload() and update() to call this helper for inputSchema and outputSchema; ensure the helper is exported/placed near the evaluator utilities so both call sites can reuse it and update any types if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/traceloop-sdk/src/lib/client/evaluator/evaluator.ts`:
- Around line 64-66: Validate that the evaluator identifier is non-empty and not
just whitespace before building URLs: in methods like get, listExecutions, and
getExecution (and any other methods that accept an identifier) check that
identifier.trim().length > 0 and throw a clear Error (e.g., "Invalid evaluator
identifier") if it fails; only after validation use
encodeURIComponent(identifier) to construct paths such as
`/v2/evaluators/${encodeURIComponent(identifier)}` or
`/v2/evaluators/${encodeURIComponent(identifier)}/executions`.
- Around line 49-58: The list() output mapping assumes data.evaluators is an
array; guard against non-array payloads by checking
Array.isArray(data.evaluators) before mapping. In evaluator.ts update the items
assignment used by the mapping in list() (currently `items: any[] =
data.evaluators ?? []`) to set items to data.evaluators only when
Array.isArray(...) is true, otherwise use an empty array; keep the mapping that
returns id, slug, name, description, source, inputSchema and outputSchema
unchanged.
---
Nitpick comments:
In `@packages/traceloop-sdk/src/lib/client/evaluator/evaluator.ts`:
- Around line 85-98: Both update() and buildPayload() duplicate mapping logic
from PropertySchema objects (patch.inputSchema / patch.outputSchema) into API
fields (payload.input_schema / payload.output_schema); extract that mapping into
a single helper function (e.g., mapPropertySchemaToApiField or
formatSchemaFields) that takes a PropertySchema and returns { name, type,
description, enum_values }, then replace the inline .map(...) calls in both
buildPayload() and update() to call this helper for inputSchema and
outputSchema; ensure the helper is exported/placed near the evaluator utilities
so both call sites can reuse it and update any types if needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1429ab7f-59a6-4b03-9223-c28c0c7e8e4e
📒 Files selected for processing (5)
packages/traceloop-sdk/src/lib/client/evaluator/evaluator.tspackages/traceloop-sdk/src/lib/client/traceloop-client.tspackages/traceloop-sdk/src/lib/interfaces/evaluator.interface.tspackages/traceloop-sdk/src/lib/node-server-sdk.tspackages/traceloop-sdk/test/standalone-evaluators.test.ts
✅ Files skipped from review due to trivial changes (4)
- packages/traceloop-sdk/src/lib/client/traceloop-client.ts
- packages/traceloop-sdk/src/lib/node-server-sdk.ts
- packages/traceloop-sdk/test/standalone-evaluators.test.ts
- packages/traceloop-sdk/src/lib/interfaces/evaluator.interface.ts
|
Also addressed the nitpick about snake_case test fixtures — updated the |
4f40f41 to
19c6f72
Compare
a93341e to
85eacd9
Compare
There was a problem hiding this comment.
- Add a sample app that demonstrates the flow of creating and patching an evaluator
- In JS you need to add jsdocs to make it user friendly - as detaild as possible
JSDocs Added.
Sample app results:
lena.badinter@MREMD84ACDDC sample-app % npm run run:standalone_evaluators
sample-app@0.0.1 run:standalone_evaluators
npm run build && node dist/src/sample_standalone_evaluators.js
sample-app@0.0.1 build
tsc --build tsconfig.json
🚀 Standalone Evaluator SDK — Full Coverage Sample App
API: https://api.traceloop.dev
── 1. CREATE ──────────────────────────────────────────────────
✅ create with all fields → {"id":"cmnbnq18z000g01siegvop9m8","slug":"quality-eval-1774782576827"}
✅ create with minimal fields → {"id":"cmnbnq1n1000i01siea0uelog","slug":"minimal-evaluator-1774782577670"}
✅ create with enum output schema → {"id":"cmnbnq1rj000k01siuxgyolo0","slug":"enum-evaluator-1774782578164"}
✅ create with zero LLM params (temperature=0) → {"id":"cmnbnq1w2000m01sim5qtt6rl"}
✅ create missing name (correctly threw) → Key: 'CreateCustomEvaluatorInput.Name' Error:Field validation for 'Name' failed on the 'required' tag
✅ create missing provider (correctly threw) → Key: 'CreateCustomEvaluatorInput.Provider' Error:Field validation for 'Provider' failed on the 'required' tag
✅ create missing model (correctly threw) → Key: 'CreateCustomEvaluatorInput.Model' Error:Field validation for 'Model' failed on the 'required' tag
✅ create missing messages (correctly threw) → messages is required
✅ create missing input_schema (correctly threw) → input_schema is required
✅ create missing output_schema (correctly threw) → output_schema is required
✅ create input_schema property missing name (correctly threw) → input_schema validation failed: property name is required
── 2. LIST ────────────────────────────────────────────────────
✅ list all evaluators (316 found)
✅ list item has required fields → {"id":"cmhvvwu9w000001w72ccqzc4l","name":"cc","slug":"cc","type":"WORD_COUNT","source":"custom"}
✅ list custom evaluators (281 found)
✅ all listed items have source=custom → true
✅ list prebuilt evaluators (35 found)
✅ all listed items have source=prebuilt → true
✅ list with invalid source (correctly threw) → invalid source parameter: must be 'prebuilt' or 'custom'
── 3. GET ─────────────────────────────────────────────────────
✅ get by ID → {"id":"cmnbnq18z000g01siegvop9m8","slug":"quality-eval-1774782576827","provider":"openai"}
✅ get by slug → {"id":"cmnbnq18z000g01siegvop9m8","slug":"quality-eval-1774782576827"}
✅ get returns all required fields → "yes"
✅ get non-existent ID (correctly threw) → Evaluator not found
✅ get non-existent slug (correctly threw) → Evaluator not found
✅ get empty identifier (correctly threw) → Evaluator identifier must be a non-empty string
✅ get whitespace identifier (correctly threw) → Evaluator identifier must be a non-empty string
── 4. UPDATE ──────────────────────────────────────────────────
✅ update name only → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update by slug → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update with duplicate name (correctly threw) → evaluator with this name already exists in this project
✅ update config and output_schema → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update input_schema only → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update output_schema and config → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update all fields → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update with temperature=0 → {"id":"cmnbnq18z000g01siegvop9m8"}
✅ update config without output_schema (correctly threw) (correctly threw) → config and output_schema must be provided together
✅ update output_schema without config (correctly threw) (correctly threw) → config and output_schema must be provided together
✅ update with empty body (correctly threw) → at least one field must be provided for evaluator patch
✅ update non-existent evaluator (correctly threw) → Evaluator not found
✅ update empty identifier (correctly threw) → Evaluator identifier must be a non-empty string
── 5. RUN ─────────────────────────────────────────────────────
✅ run by ID with minimal input → {"result":{"passed":true,"reason":"The statement is clear, accurate, and provides a scientific explanation for the color of the sky."}}
✅ run by slug → {"result":{"passed":true,"reason":"The statement is factually correct and clearly conveys accurate information."}}
✅ run with multiple input fields → {"result":{"passed":true,"reason":"The statement is correct; water boils at 100 degrees Celsius at standard atmospheric pressure (1 atm)."}}
✅ run second evaluator → {"result":{"error":"Error code: 400"}}
✅ run with empty input (correctly threw) → input is required for custom evaluator execution
✅ run non-existent evaluator (correctly threw) → Evaluator not found
✅ run empty identifier (correctly threw) → Evaluator identifier must be a non-empty string
✅ run whitespace identifier (correctly threw) → Evaluator identifier must be a non-empty string
══════════════════════════════════════════════════════════════
Results: 46 passed, 0 failed
✅ All tests passed!
lena.badinter@MREMD84ACDDC sample-app %
1c33acb to
97b92f7
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/traceloop-sdk/src/lib/client/evaluator/evaluator.ts`:
- Around line 235-239: The run() return currently sets result: data which
returns the whole API response; change it to return the actual evaluation
payload by using data.result (with a safe fallback) so that executionId remains
executionId and result contains only the evaluation output; update the return in
the run() function to set result: data.result ?? data to preserve backward
compatibility if result is missing.
- Around line 207-209: The returned object is extracting the id from a nested
evaluator object but the API/test uses a flat evaluatorId field; in evaluator.ts
update the extraction after handleResponse to read data.evaluatorId (e.g. return
{ id: data.evaluatorId }) instead of data.evaluator?.id, and adjust any related
typing in the enclosing method (the function that currently does "const data =
await this.handleResponse(response); return { id: ... }") so it matches the
create() behavior and the test mock.
- Around line 175-201: The update() payload is incorrectly nesting fields under
payload.config and payload.config.llm_config (see payload.config,
config.llm_config) while tests expect the same flat structure as
create()/buildPayload(); fix by flattening the patch into the top-level payload
(move messages, provider, model, temperature, max_tokens, top_p,
frequency_penalty, presence_penalty, etc. out of payload.config and
payload.config.llm_config into payload) or simply reuse buildPayload(patch) from
buildPayload() to construct the request body so update() and create() share the
same flat contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9402e111-d7da-48b5-917a-7a8b41e2a6e1
📒 Files selected for processing (5)
packages/traceloop-sdk/src/lib/client/evaluator/evaluator.tspackages/traceloop-sdk/src/lib/client/traceloop-client.tspackages/traceloop-sdk/src/lib/interfaces/evaluator.interface.tspackages/traceloop-sdk/src/lib/node-server-sdk.tspackages/traceloop-sdk/test/standalone-evaluators.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/traceloop-sdk/src/lib/client/traceloop-client.ts
- packages/traceloop-sdk/src/lib/node-server-sdk.ts
ee0d65c to
8a5e1e7
Compare
349c861 to
c7f199d
Compare
Summary
Adds SDK support for managing standalone custom evaluators (not bound to any project or environment): create, list, get, update, and run.
Fixed TLP-1998
Test plan
pnpm testpasses (all existing + new standalone evaluator tests)Summary by CodeRabbit
New Features
Tests