Skip to content

hyper-mcp-rs/arxiv-plugin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv API Plugin

A hyper-mcp WebAssembly plugin that exposes the arXiv API as a single MCP tool with structured input and output.

Tool

query

Searches arXiv.org for e-prints via the arXiv query interface and returns structured metadata for each matching article.

Input Schema (all parameters are optional, but at least one of search_query or id_list must be provided):

Parameter Type Description
search_query string An arXiv search query, e.g. all:electron, ti:"quantum criticality", or a boolean expression like au:del_maestro AND ti:checkerboard. See the query construction appendix.
id_list string A comma-delimited list of arXiv ids, e.g. 2301.00001,hep-ex/0307015.
start int The 0-based index of the first returned result (default 0).
max_results int The maximum number of results to return (default 10, max 2000).
sortBy enum One of relevance, lastUpdatedDate, submittedDate.
sortOrder enum One of ascending, descending.
verify_source_url bool Whether to verify each entry's e-print source_url with a HEAD request (default true). Set false for large/unbounded queries — see below.

If both search_query and id_list are provided, the results are the articles in id_list that also match search_query (i.e. id_list acts as a filter).

Performance note: source_url verification issues one HTTP HEAD request per returned entry. For large or unbounded result sets this is slow (and can time out), so set verify_source_url to false for bulk queries — source_url is then omitted from every entry while all other fields are returned as usual. Prefer small slices (via start/max_results) when you do want verified source links.

Output: The arXiv API returns an Atom 1.0 XML feed, which the plugin deserializes into a structured response:

{
  "total_results": 182239,   // total matches for the query (not just this page)
  "start_index": 0,          // 0-based index of the first returned result
  "items_per_page": 2,       // number of results in this page
  "entries": [
    {
      "id": "cond-mat/0011267v1",       // parsed from the <id> tag's …/abs/ URL
      "title": "...",
      "summary": "...",                  // the abstract
      "published": "2000-11-15T16:19:15Z",
      "updated": "2000-11-15T16:19:15Z",
      "authors": [{ "name": "...", "affiliations": ["..."] }],
      "primary_category": "cond-mat.supr-con",
      "categories": ["cond-mat.supr-con", "cond-mat.str-el"],
      "comment": "...",                  // optional
      "journal_ref": "...",              // optional
      "dois": [                          // optional (may be several)
        { "doi": "10.1234/example", "url": "https://doi.org/10.1234/example" }
      ],
      "abstract_url": "https://arxiv.org/abs/cond-mat/0011267v1",
      "pdf_url": "https://arxiv.org/pdf/cond-mat/0011267v1",
      "source_url": "https://arxiv.org/e-print/cond-mat/0011267v1"  // optional
    }
  ]
}

Notes on the parsing:

  • id is parsed from the entry's <id> tag by stripping the http(s)://arxiv.org/abs/ prefix. arXiv ids may be new-style (2301.00001) or old-style containing a slash (hep-ex/0307015), so the prefix is stripped rather than splitting on the last /.
  • primary_category / categories come from the term attribute of the <arxiv:primary_category> and <category> elements.
  • pdf_url is taken from the entry's <link title="pdf"> element.
  • source_url is the article's e-print (source) bundle at https://arxiv.org/e-print/<id>. It is included only when a HEAD request to that URL returns a 2xx status, so it is omitted for articles without a published source bundle.
  • dois is a list of { doi, url } objects. An article may have several DOIs (e.g. an original plus errata). Each DOI string comes from an <arxiv:doi> element and is paired with its resolved URL from the matching <link title="doi"> element (omitted if absent or malformed).

Every URL field (abstract_url, pdf_url, source_url, and each dois[].url) is parsed and validated as a well-formed URL before being returned; any malformed URL is omitted rather than passed through. The values are still serialized as JSON strings, and the output schema marks them with "format": "uri".

arXiv reports query errors (such as malformed ids) as an Atom feed containing a single error entry; the plugin detects these and returns them as a tool error.

Configuration

The plugin needs network access to the arXiv API host and the arXiv site (for e-print HEAD checks):

{
  "plugins": {
    "arxiv": {
      "url": "oci://your-registry/arxiv-plugin:latest",
      "runtime_config": {
        "allowed_hosts": ["export.arxiv.org", "arxiv.org"]
      }
    }
  }
}

For local development, point the plugin at the built WASM file:

{
  "plugins": {
    "arxiv": {
      "url": "file:///path/to/target/wasm32-wasip1/release/plugin.wasm",
      "runtime_config": {
        "allowed_hosts": ["export.arxiv.org", "arxiv.org"]
      }
    }
  }
}

Development

Building

rustup target add wasm32-wasip1
cargo build --release --target wasm32-wasip1
# Output: target/wasm32-wasip1/release/plugin.wasm

Testing

Tests run against the native host target and exercise the XML deserialization and transformation logic using captured fixtures in tests/fixtures/:

cargo test

Code Quality

cargo fmt -- --check
cargo clippy --all-targets -- -D warnings

License

Apache 2.0. See LICENSE.

About

A plugin that let you search for papers on arXiv and download them.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors