Skip to content

Commit 9d53bd1

Browse files
authored
Merge branch 'main' into main
2 parents 2514540 + 2e37aad commit 9d53bd1

35 files changed

Lines changed: 1940 additions & 495 deletions

.github/copilot-instructions.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Copilot Instructions
2+
3+
## Project Overview
4+
5+
Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
6+
7+
## Commands
8+
9+
```bash
10+
# Install/sync dependencies
11+
uv sync
12+
13+
# Run the application
14+
uv run src/zotero_arxiv_daily/main.py
15+
16+
# Run tests (excludes slow tests by default)
17+
uv run pytest
18+
19+
# Run all tests including slow ones
20+
uv run pytest -m ""
21+
22+
# Run a single test
23+
uv run pytest tests/test_utils.py::TestGlobMatch -v
24+
```
25+
26+
No linter or formatter is configured.
27+
28+
## Architecture
29+
30+
The app is a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
31+
32+
1. **Fetch Zotero corpus** → pyzotero API
33+
2. **Filter corpus**`include_path` / `ignore_path` glob patterns
34+
3. **Retrieve new papers** → from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
35+
4. **Rerank** → weighted embedding similarity to corpus (newer Zotero papers weighted higher)
36+
5. **Generate TLDRs + affiliations** → OpenAI-compatible LLM API
37+
6. **Render + send email** → HTML email via SMTP
38+
39+
### Plugin Systems
40+
41+
**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever("name")` decorator on a `BaseRetriever` subclass. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`. Discovered at runtime via `get_retriever_cls(name)` from a module-level `registered_retrievers` dict.
42+
43+
**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker("name")` decorator on a `BaseReranker` subclass. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint). Discovered via `get_reranker_cls(name)`.
44+
45+
When adding a new retriever or reranker, follow the existing pattern: create a new file, subclass the base, apply the registration decorator, and implement the abstract methods.
46+
47+
### Configuration
48+
49+
Uses Hydra + OmegaConf. Config composes from `config/base.yaml` (defaults with `???` placeholders for required values) + `config/custom.yaml` (user overrides). The composition order is defined in `config/default.yaml`. Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main(config_name="default")`.
50+
51+
### Data Classes
52+
53+
`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly with `tiktoken`-based token truncation.
54+
55+
## Testing Conventions
56+
57+
- Tests use **pytest monkeypatch + `SimpleNamespace`** for stubs — not `unittest.mock`.
58+
- A session-scoped Hydra config in `tests/conftest.py` is deep-copied per test via the `config` fixture.
59+
- Canned response factories live in `tests/canned_responses.py` (e.g., `make_stub_openai_client()`, `make_stub_zotero_client()`).
60+
- Tests marked `@pytest.mark.slow` require heavy dependencies (model downloads) and are excluded by default (`addopts = "-m 'not slow'"` in pyproject.toml).
61+
- Monkeypatching targets the module-level import path (e.g., `"zotero_arxiv_daily.executor.zotero.Zotero"`).
62+
63+
## Coding Conventions
64+
65+
- **Logging:** `loguru.logger` throughout — never `print()` or stdlib `logging`.
66+
- **Type hints:** Modern Python 3.10+ syntax (`list[Paper]`, `str | None`).
67+
- **Constants:** Module-level `UPPER_SNAKE_CASE`.
68+
- **Private methods:** Prefixed with `_` (e.g., `_retrieve_raw_papers`).
69+
- **Error handling:** Graceful degradation with try/except and fallback logic; log warnings rather than raising.
70+
- **Config injection:** All major components receive `DictConfig` at init and store it as `self.config`.
71+
72+
## Git Workflow
73+
74+
- PRs should target the **`dev`** branch, not `main`.

.github/keep-alive.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
Last run: 2026-05-01 01:29:37 UTC
1+
Last run: 2026-05-01 01:01:08 UTC
22
This file is automatically updated to keep the repository active and prevent GitHub Actions from disabling scheduled workflows.

.github/workflows/ci.yml

Lines changed: 1 addition & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -12,31 +12,13 @@ on:
1212
jobs:
1313
pytest:
1414
runs-on: ubuntu-latest
15-
services:
16-
mailhog:
17-
image: mailhog/mailhog:latest
18-
ports:
19-
- 1025:1025 # SMTP
20-
openai:
21-
image: tidedra/mock_openai:latest
22-
ports:
23-
- 30000:30000
2415
steps:
2516
- name: Checkout
2617
uses: actions/checkout@v6
2718

2819
- name: Setup uv
2920
uses: astral-sh/setup-uv@v7.1.4
3021

31-
3222
- name: Run Pytest
33-
env:
34-
ZOTERO_ID: "0"
35-
ZOTERO_KEY: "AbCdEfGhIjKlMnOpQrStUvWx"
36-
SENDER: "test@example.com"
37-
RECEIVER: "test@example.com"
38-
SENDER_PASSWORD: "test"
39-
OPENAI_API_KEY: "sk-xxx"
40-
OPENAI_API_BASE: "http://openai:30000/v1"
4123
run: |
42-
uv run pytest -m ""
24+
uv run pytest -m "" --cov=src/zotero_arxiv_daily --cov-report=term-missing

CLAUDE.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
8+
9+
## Commands
10+
11+
```bash
12+
# Run the application
13+
uv run src/zotero_arxiv_daily/main.py
14+
15+
# Run tests (excludes slow tests by default)
16+
uv run pytest
17+
18+
# Run all tests including slow ones
19+
uv run pytest -m ""
20+
21+
# Run a single test
22+
uv run pytest tests/test_utils.py::TestGlobMatch -v
23+
24+
# Install/sync dependencies
25+
uv sync
26+
```
27+
28+
No linter or formatter is configured.
29+
30+
## Architecture
31+
32+
The app follows a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
33+
34+
1. **Fetch Zotero corpus** — retrieves user's library papers via pyzotero API
35+
2. **Filter corpus** — applies `include_path` glob patterns to select relevant collections
36+
3. **Retrieve new papers** — fetches from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
37+
4. **Rerank** — scores candidates by weighted similarity to corpus (newer Zotero papers weighted higher)
38+
5. **Generate TLDRs + affiliations** — via OpenAI-compatible LLM API
39+
6. **Render + send email** — HTML email via SMTP
40+
41+
### Plugin Systems
42+
43+
**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever` decorator, discovered by `get_retriever_cls()`. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`.
44+
45+
**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker` decorator, discovered by `get_reranker_cls()`. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint).
46+
47+
### Configuration
48+
49+
Uses Hydra + OmegaConf. Config is composed from `config/base.yaml` (defaults) + `config/custom.yaml` (user overrides). Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main`.
50+
51+
### Data Classes
52+
53+
`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly.
54+
55+
## Testing
56+
57+
Tests marked `@pytest.mark.slow` require heavy dependencies (e.g., sentence-transformers model download) and are skipped locally by default (`addopts = "-m 'not slow'"` in pyproject.toml). All other tests run with pure Python stubs (no Docker containers needed).
58+
59+
```bash
60+
# Run tests (excludes slow tests)
61+
uv run pytest
62+
63+
# Run all tests including slow ones
64+
uv run pytest -m ""
65+
66+
# Run with coverage
67+
uv run pytest --cov=src/zotero_arxiv_daily --cov-report=term-missing
68+
```
69+
70+
## gstack
71+
72+
Use the `/browse` skill from gstack for all web browsing. Never use `mcp__claude-in-chrome__*` tools.
73+
74+
Available skills: `/office-hours`, `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/design-consultation`, `/design-shotgun`, `/design-html`, `/review`, `/ship`, `/land-and-deploy`, `/canary`, `/benchmark`, `/browse`, `/connect-chrome`, `/qa`, `/qa-only`, `/design-review`, `/setup-browser-cookies`, `/setup-deploy`, `/retro`, `/investigate`, `/document-release`, `/codex`, `/cso`, `/autoplan`, `/plan-devex-review`, `/devex-review`, `/careful`, `/freeze`, `/guard`, `/unfreeze`, `/gstack-upgrade`, `/learn`.
75+
76+
If gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
77+
78+
## Git Workflow
79+
80+
- PRs should target the `dev` branch, not `main`
81+
- Current development branch: `dev`

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -192,8 +192,6 @@ This project is in active development. You can subscribe this repo via `Watch` s
192192
- The recommendation algorithm is very simple, it may not accurately reflect your interest. Welcome better ideas for improving the algorithm!
193193
- High `MAX_PAPER_NUM` can lead the execution time exceed the limitation of Github Action runner (6h per execution for public repo, and 2000 mins per month for private repo). Commonly, the quota given to public repo is definitely enough for individual use. If you have special requirements, you can deploy the workflow in your own server, or use a self-hosted Github Action runner, or pay for the exceeded execution time.
194194

195-
## 👯‍♂️ Contribution
196-
Any issue and PR are welcomed! But remember that **each PR should merge to the `dev` branch**.
197195

198196
## 📃 License
199197
Distributed under the AGPLv3 License. See `LICENSE` for detail.

config/base.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ zotero:
22
user_id: ??? # User ID of your Zotero account.
33
api_key: ??? # An Zotero API key with read access.
44
include_path: null # A list of glob patterns marking the Zotero collections that should be included. Example: ["2026/survey/**","2026/reading-group/**"]
5+
ignore_path: null # A list of glob patterns marking the Zotero collections that should be excluded. Example: ["2026/ignore/**","archive/**"]
56

67
source:
78
arxiv:
@@ -31,7 +32,7 @@ llm:
3132

3233
reranker:
3334
local:
34-
model: jinaai/jina-embeddings-v5-text-nano # The Hugging Face model name of the local embedding model. Example: jinaai/jina-embeddings-v5-text-nano
35+
model: jinaai/jina-embeddings-v5-text-nano-retrieval # The Hugging Face model name of the local embedding model. Example: jinaai/jina-embeddings-v5-text-nano
3536
encode_kwargs:
3637
# The kwargs for the encode method of the local embedding model. Details see [here](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode)
3738
task: retrieval

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,9 @@ url = "https://download.pytorch.org/whl/cpu"
3737
explicit = true
3838

3939
[tool.pytest.ini_options]
40-
addopts = "-m 'not ci'"
40+
addopts = "-m 'not slow'"
4141
markers = [
42-
"ci: tests that only run in CI (require external services)",
42+
"slow: tests that are slow (e.g. download models)",
4343
]
4444
filterwarnings = [
4545
"ignore::DeprecationWarning:multiprocessing",
@@ -49,4 +49,5 @@ filterwarnings = [
4949
dev = [
5050
"ipykernel>=7.1.0",
5151
"pytest>=8.4.1",
52+
"pytest-cov>=6.0",
5253
]

src/zotero_arxiv_daily/executor.py

Lines changed: 33 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -13,26 +13,27 @@
1313
from tqdm import tqdm
1414

1515

16-
def normalize_include_path_patterns(include_path: list[str] | ListConfig | None) -> list[str] | None:
17-
if include_path is None:
16+
def normalize_path_patterns(patterns: list[str] | ListConfig | None, config_key: str) -> list[str] | None:
17+
if patterns is None:
1818
return None
1919

20-
if not isinstance(include_path, (list, ListConfig)):
20+
if not isinstance(patterns, (list, ListConfig)):
2121
raise TypeError(
22-
"config.zotero.include_path must be a list of glob patterns or null, "
22+
f"config.zotero.{config_key} must be a list of glob patterns or null, "
2323
'for example ["2026/survey/**"]. Single strings are not supported.'
2424
)
2525

26-
if any(not isinstance(pattern, str) for pattern in include_path):
27-
raise TypeError("config.zotero.include_path must contain only glob pattern strings.")
26+
if any(not isinstance(pattern, str) for pattern in patterns):
27+
raise TypeError(f"config.zotero.{config_key} must contain only glob pattern strings.")
2828

29-
return list(include_path)
29+
return list(patterns)
3030

3131

3232
class Executor:
3333
def __init__(self, config:DictConfig):
3434
self.config = config
35-
self.include_path_patterns = normalize_include_path_patterns(config.zotero.include_path)
35+
self.include_path_patterns = normalize_path_patterns(config.zotero.include_path, "include_path")
36+
self.ignore_path_patterns = normalize_path_patterns(config.zotero.ignore_path, "ignore_path")
3637
self.retrievers = {
3738
source: get_retriever_cls(source)(config) for source in config.executor.source
3839
}
@@ -62,22 +63,31 @@ def get_collection_path(col_key:str) -> str:
6263
) for c in corpus]
6364

6465
def filter_corpus(self, corpus:list[CorpusPaper]) -> list[CorpusPaper]:
65-
if not self.include_path_patterns:
66-
return corpus
67-
new_corpus = []
68-
logger.info(f"Selecting zotero papers matching include_path: {self.include_path_patterns}")
69-
for c in corpus:
70-
match_results = [
71-
glob_match(path, pattern)
72-
for path in c.paths
73-
for pattern in self.include_path_patterns
66+
if self.include_path_patterns:
67+
logger.info(f"Selecting zotero papers matching include_path: {self.include_path_patterns}")
68+
corpus = [
69+
c for c in corpus
70+
if any(
71+
glob_match(path, pattern)
72+
for path in c.paths
73+
for pattern in self.include_path_patterns
74+
)
75+
]
76+
if self.ignore_path_patterns:
77+
logger.info(f"Excluding zotero papers matching ignore_path: {self.ignore_path_patterns}")
78+
corpus = [
79+
c for c in corpus
80+
if not any(
81+
glob_match(path, pattern)
82+
for path in c.paths
83+
for pattern in self.ignore_path_patterns
84+
)
7485
]
75-
if any(match_results):
76-
new_corpus.append(c)
77-
samples = random.sample(new_corpus, min(5, len(new_corpus)))
78-
samples = '\n'.join([c.title + ' - ' + '\n'.join(c.paths) for c in samples])
79-
logger.info(f"Selected {len(new_corpus)} zotero papers:\n{samples}\n...")
80-
return new_corpus
86+
if self.include_path_patterns or self.ignore_path_patterns:
87+
samples = random.sample(corpus, min(5, len(corpus)))
88+
samples = '\n'.join([c.title + ' - ' + '\n'.join(c.paths) for c in samples])
89+
logger.info(f"Selected {len(corpus)} zotero papers:\n{samples}\n...")
90+
return corpus
8191

8292

8393
def run(self):

src/zotero_arxiv_daily/retriever/arxiv_retriever.py

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
import multiprocessing
1010
import os
1111
from queue import Empty
12+
from time import sleep
1213
from typing import Any, Callable, TypeVar
1314
from loguru import logger
1415
import requests
@@ -95,11 +96,11 @@ def _extract_text_from_html_worker(html_url: str) -> str | None:
9596
return text
9697

9798

98-
def _extract_text_from_tar_worker(source_url: str, paper_id: str) -> str | None:
99+
def _extract_text_from_tar_worker(source_url: str, paper_id: str, paper_title: str | None = None) -> str | None:
99100
with TemporaryDirectory() as temp_dir:
100101
path = os.path.join(temp_dir, "paper.tar.gz")
101102
_download_file(source_url, path)
102-
file_contents = extract_tex_code_from_tar(path, paper_id)
103+
file_contents = extract_tex_code_from_tar(path, paper_id, paper_title=paper_title)
103104
if not file_contents or "all" not in file_contents:
104105
raise ValueError("Main tex file not found.")
105106
return file_contents["all"]
@@ -132,11 +133,25 @@ def _retrieve_raw_papers(self) -> list[ArxivResult]:
132133

133134
# Get full information of each paper from arxiv api
134135
bar = tqdm(total=len(all_paper_ids))
136+
max_batch_retries = 5
137+
batch_retry_delay = 30
135138
for i in range(0, len(all_paper_ids), 20):
136139
search = arxiv.Search(id_list=all_paper_ids[i:i + 20])
137-
batch = list(client.results(search))
138-
bar.update(len(batch))
139-
raw_papers.extend(batch)
140+
for attempt in range(max_batch_retries):
141+
try:
142+
batch = list(client.results(search))
143+
bar.update(len(batch))
144+
raw_papers.extend(batch)
145+
break
146+
except arxiv.HTTPError as exc:
147+
if exc.status == 429 and attempt < max_batch_retries - 1:
148+
wait = batch_retry_delay * (attempt + 1)
149+
logger.warning(f"arXiv API 429 on batch {i // 20}, retry {attempt + 1}/{max_batch_retries} in {wait}s")
150+
sleep(wait)
151+
else:
152+
raise
153+
if i + 20 < len(all_paper_ids):
154+
sleep(3)
140155
bar.close()
141156

142157
return raw_papers
@@ -146,11 +161,11 @@ def convert_to_paper(self, raw_paper: ArxivResult) -> Paper:
146161
authors = [a.name for a in raw_paper.authors]
147162
abstract = raw_paper.summary
148163
pdf_url = raw_paper.pdf_url
149-
full_text = extract_text_from_html(raw_paper)
164+
full_text = extract_text_from_tar(raw_paper)
150165
if full_text is None:
151-
full_text = extract_text_from_pdf(raw_paper)
166+
full_text = extract_text_from_html(raw_paper)
152167
if full_text is None:
153-
full_text = extract_text_from_tar(raw_paper)
168+
full_text = extract_text_from_pdf(raw_paper)
154169
return Paper(
155170
source=self.name,
156171
title=title,
@@ -191,7 +206,7 @@ def extract_text_from_tar(paper: ArxivResult) -> str | None:
191206
return None
192207
return _run_with_hard_timeout(
193208
_extract_text_from_tar_worker,
194-
(source_url, paper.entry_id),
209+
(source_url, paper.entry_id, paper.title),
195210
timeout=TAR_EXTRACT_TIMEOUT,
196211
operation="Tar extraction",
197212
paper_title=paper.title,

0 commit comments

Comments
 (0)