103 failed test cases across 6 experiments. fix by improving tool descriptions (most important factor per evals/README.md).
Failed cases: 103
- experiment-cb4f5987004088687b05ab69: 11
- experiment-86552f5159c0ae4c4b3d92b2: 16
- experiment-435995e92aaced9c46c5859c: 22
- experiment-9eb78796dd81ed5083eb2d58: 20
- experiment-d5587019ccdc52204cce0064: 20
- experiment-4dd9f161222374467d278cdc: 14
Phoenix: https://app.phoenix.arize.com/s/apify
Never use an LLM to automatically fix tool descriptions. Always make improvements manually, based on your understanding of the problem. LLMs are very likely to worsen the issue instead of fixing it.
Guidelines (from evals/README.md):
- update one tool at a time (changing multiple tools simultaneously is untraceable)
- focus on exact tool match first (easier to debug and track)
- prioritize descriptions over examples (descriptions are most important)
- test incrementally (subset → full dataset)
- verify across multiple models (different models may behave differently)
Tool description best practices (from evals/README.md):
- Provide extremely detailed descriptions (most important factor)
- Explain: what it does, when to use it (and when not), what each parameter means
- Prioritize descriptions over examples (add examples only after comprehensive description)
- Aim for at least 3-4 sentences, more if complex
- Start with "use this when..." and call out disallowed cases
Workflow:
- analyze phoenix results to understand the problem
- manually write/update tool description based on understanding
npm run evals:run- check phoenix dashboard
- verify no regressions
- iterate experimentally (trial and error)
- move to next tool
File: src/tools/actor.ts lines 333-361
Impact: ~30 cases (29%)
Problem:
LLM uses step="info" when user explicitly requests execution with parameters.
Failed cases:
- "Run apify/instagram-scraper to scrape #dwaynejohnson" → got
step="info", expectedstep="call"with hashtag - "Call apify/google-search-scraper to find restaurants in London" → got
step="info", expectedstep="call"with query - "Call epctex/weather-scraper for New York" → got
step="info", expectedstep="call"with location
Root cause: Lines 349-358 say "MANDATORY TWO-STEP-WORKFLOW" and "You MUST do this step first", making LLM always start with "info" even when user explicitly requests execution.
What needs to be addressed in description:
-
Clarify when to use step="info" vs step="call":
- add explicit "when to use step='info'" section at top
- add explicit "when to use step='call' directly" section
- emphasize: if user explicitly requests execution with parameters → use step="call" directly
- only use step="info" if user asks about details or you need to discover schema
-
Make workflow less prescriptive:
- change "MANDATORY TWO-STEP-WORKFLOW" to "two-step workflow (when needed)"
- remove "You MUST do this step first" language
- explain workflow is optional when user provides clear execution intent
-
Add clear disallowed cases:
- do not use step="info" when user explicitly requests execution
- do not use step="info" when user provides parameters in query
-
Add examples (after comprehensive description):
- correct: user requests execution → step="call"
- correct: user asks about parameters → step="info"
- wrong: user requests execution → step="info"
Testing:
- Filter by
category: "call-actor"andexpectedTools: ["call-actor"] - focus on execution requests
- verify no regressions
File: src/tools/store_collection.ts lines 86-114
Impact: ~35 cases (34%)
Problem categories:
Failed cases:
- "Find actors for scraping social media" → keywords: "social media scraper" (should be "social media")
- "What tools can extract data from e-commerce sites?" → keywords: "e-commerce scraper" (should be "e-commerce")
- "Find actors for flight data extraction" → keywords: "flight data extraction" (should be "flight data" or "flight booking")
Root cause: Keyword rules exist at lines 47-48 in parameter description but are buried. LLM doesn't see them prominently.
What needs to be addressed in description:
-
Move keyword rules to top of description:
- never include generic terms: "scraper", "crawler", "extractor", "extraction", "scraping"
- use only platform names (instagram, twitter) and data types (posts, products, profiles)
- add explicit examples: "instagram posts" (correct) | "instagram scraper" (wrong)
-
Add simplicity rule:
- use simplest, most direct keywords possible
- ignore additional context in user query (e.g., "about ai", "python")
- if user asks "instagram posts about ai" → use keywords: "instagram posts" (not "instagram posts ai")
-
Add single query rule:
- always use one search call with most general keyword
- do not make multiple specific calls unless user explicitly asks for specific data types
- example: "facebook data" → one call with "facebook" (not multiple calls for posts/pages/groups)
-
Add "do not use" section:
- do not use for fetching actual data (news, weather, web content) → use apify-slash-rag-web-browser
- do not use for running actors → use call-actor or dedicated actor tools
- do not use for getting actor details → use fetch-actor-details
- do not use for overly general queries → ask user for specifics
-
Add "only use when" section:
- user specifies platform (instagram, twitter, amazon, etc.)
- user specifies data type (posts, products, profiles, etc.)
- user mentions specific service or website
Impact: ~20 cases (19%)
Failed cases:
- "Fetch recent articles about climate change" → used
search-actors, expectedapify-slash-rag-web-browser - "Get the latest weather forecast for New York" → used
search-actors, expectedapify-slash-rag-web-browser - "Get the latest tech industry news" → used
search-actors, expectedapify-slash-rag-web-browser
Fix: Already covered in section 2 above (do not use section).
File: src/tools/fetch-actor-details.ts lines 20-30
Failed cases:
- "What parameters does apify/instagram-scraper accept?" → used
call-actorstep="info", expectedfetch-actor-details
Root cause:
Description doesn't clearly distinguish when to use fetch-actor-details vs call-actor step="info".
What needs to be addressed in description:
-
add explicit "use this tool when" section:
- user asks about actor parameters, input schema, or configuration
- user asks about actor documentation or how to use it
- user asks about actor pricing or cost information
- user asks about actor details, description, or capabilities
-
add explicit "do not use" section:
- do not use
call-actorwith step="info" for these queries - use
fetch-actor-detailsinstead
- do not use
-
clarify distinction:
fetch-actor-details: for getting actor information/documentationcall-actorstep="info": for discovering input schema before calling (not for documentation queries)
Impact: ~12 cases (12%)
Failed cases:
- "How does apify/rag-web-browser work?" → no tool called, expected
fetch-actor-details - "documentation" → no tool called, expected
search-apify-docs - "Look for news articles on AI" → no tool called, expected
apify-slash-rag-web-browser
Fix: Add "must use" section to each tool description. This might be model/configuration issue, but clearer guidance helps.
Impact: ~6 cases (6%)
Failed cases:
- "Find actors for data extraction tasks" → used
search-actors, expected to ask for specifics
Fix: Already covered in section 2 above (do not use for overly general queries).
- fix
call-actordescription (when to use step="call" vs step="info") - fix
search-actorskeyword rules (move to top, add rules) - add "do not use" sections
Estimated impact: ~65 cases resolved (63%)
- improve
fetch-actor-detailsvscall-actordistinction - add explicit guidance about
apify-slash-rag-web-browservssearch-actors
Estimated impact: ~30 cases resolved (29% of remaining)
- add general query handling guidance
- improve missing tool call handling (may require system prompt changes)
Estimated impact: ~8 cases resolved (8% of remaining)
- add "when to use" section at top
- reorganize workflow (less prescriptive)
- add examples
- move keyword rules to top
- add "do not use" section
- add simplicity rule
- add single query rule
- add "use this tool when" section
- add "do not use call-actor" warning
npm run evals:run- check phoenix dashboard
- verify phase 1 cases now pass
- check for regressions
- iterate on phase 2
- some test cases may have ambiguous expected behavior
- tool descriptions should be verbose and explicit
- examples come after comprehensive descriptions
- update one tool at a time, test incrementally