Skip to content

Enhance arXiv retrieval robustness: add retry logic for HTTP errors#261

Open
dekrt wants to merge 3 commits into
TideDra:mainfrom
dekrt:main
Open

Enhance arXiv retrieval robustness: add retry logic for HTTP errors#261
dekrt wants to merge 3 commits into
TideDra:mainfrom
dekrt:main

Conversation

@dekrt

@dekrt dekrt commented Jun 6, 2026

Copy link
Copy Markdown

This pull request improves the robustness of the arXiv paper retrieval process by handling additional retryable HTTP errors and adds tests to ensure correct behavior in these scenarios. The main changes are the introduction of a set of retryable status codes, enhanced error handling logic during batch retrieval, and new tests for these cases.

Error handling improvements:

  • Introduced RETRYABLE_ARXIV_STATUSES in arxiv_retriever.py to define which HTTP status codes (429, 500, 502, 503, 504) should trigger a retry when communicating with the arXiv API.
  • Updated the _retrieve_raw_papers method to retry on any status in RETRYABLE_ARXIV_STATUSES, log appropriate warnings, and skip batches after maximum retries, ensuring only truly unrecoverable errors are raised.

Testing improvements:

  • Added pytest as a test dependency for enhanced testing capabilities.
  • Added tests to verify that batches are skipped after retryable HTTP errors and that non-retryable errors are raised, ensuring the new error handling logic works as intended.

Copilot AI and others added 3 commits June 5, 2026 06:15
Copilot AI review requested due to automatic review settings June 6, 2026 03:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves arXiv batch retrieval resilience by retrying additional transient HTTP errors (not just 429), skipping batches after exhausting retries, and adding tests to validate retryable vs non-retryable behavior.

Changes:

  • Add a shared set of retryable arXiv HTTP statuses and use it in _retrieve_raw_papers.
  • Skip a batch after max retries for retryable status codes and continue processing.
  • Add pytest coverage for retryable (503) and non-retryable (400) HTTP error handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/retriever/test_arxiv_retriever.py Adds tests covering skipped batches after retryable errors and raising on non-retryable HTTP errors.
src/zotero_arxiv_daily/retriever/arxiv_retriever.py Expands retry logic to multiple transient HTTP statuses and adds batch-skipping behavior after retries.

Comment on lines +157 to +165
elif status in RETRYABLE_ARXIV_STATUSES:
logger.warning(
f"Skipping batch {i // 20} after {max_batch_retries} retries due to arXiv API {status}"
)
break
else:
raise
if not batch_succeeded:
logger.warning(f"No papers retrieved for batch {i // 20}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants