You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
6
+
7
+
## Commands
8
+
9
+
```bash
10
+
# Install/sync dependencies
11
+
uv sync
12
+
13
+
# Run the application
14
+
uv run src/zotero_arxiv_daily/main.py
15
+
16
+
# Run tests (excludes slow tests by default)
17
+
uv run pytest
18
+
19
+
# Run all tests including slow ones
20
+
uv run pytest -m ""
21
+
22
+
# Run a single test
23
+
uv run pytest tests/test_utils.py::TestGlobMatch -v
24
+
```
25
+
26
+
No linter or formatter is configured.
27
+
28
+
## Architecture
29
+
30
+
The app is a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
3.**Retrieve new papers** → from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
35
+
4.**Rerank** → weighted embedding similarity to corpus (newer Zotero papers weighted higher)
36
+
5.**Generate TLDRs + affiliations** → OpenAI-compatible LLM API
37
+
6.**Render + send email** → HTML email via SMTP
38
+
39
+
### Plugin Systems
40
+
41
+
**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever("name")` decorator on a `BaseRetriever` subclass. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`. Discovered at runtime via `get_retriever_cls(name)` from a module-level `registered_retrievers` dict.
42
+
43
+
**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker("name")` decorator on a `BaseReranker` subclass. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint). Discovered via `get_reranker_cls(name)`.
44
+
45
+
When adding a new retriever or reranker, follow the existing pattern: create a new file, subclass the base, apply the registration decorator, and implement the abstract methods.
46
+
47
+
### Configuration
48
+
49
+
Uses Hydra + OmegaConf. Config composes from `config/base.yaml` (defaults with `???` placeholders for required values) + `config/custom.yaml` (user overrides). The composition order is defined in `config/default.yaml`. Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main(config_name="default")`.
50
+
51
+
### Data Classes
52
+
53
+
`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly with `tiktoken`-based token truncation.
54
+
55
+
## Testing Conventions
56
+
57
+
- Tests use **pytest monkeypatch + `SimpleNamespace`** for stubs — not `unittest.mock`.
58
+
- A session-scoped Hydra config in `tests/conftest.py` is deep-copied per test via the `config` fixture.
59
+
- Canned response factories live in `tests/canned_responses.py` (e.g., `make_stub_openai_client()`, `make_stub_zotero_client()`).
60
+
- Tests marked `@pytest.mark.slow` require heavy dependencies (model downloads) and are excluded by default (`addopts = "-m 'not slow'"` in pyproject.toml).
61
+
- Monkeypatching targets the module-level import path (e.g., `"zotero_arxiv_daily.executor.zotero.Zotero"`).
62
+
63
+
## Coding Conventions
64
+
65
+
-**Logging:**`loguru.logger` throughout — never `print()` or stdlib `logging`.
66
+
-**Type hints:** Modern Python 3.10+ syntax (`list[Paper]`, `str | None`).
67
+
-**Constants:** Module-level `UPPER_SNAKE_CASE`.
68
+
-**Private methods:** Prefixed with `_` (e.g., `_retrieve_raw_papers`).
69
+
-**Error handling:** Graceful degradation with try/except and fallback logic; log warnings rather than raising.
70
+
-**Config injection:** All major components receive `DictConfig` at init and store it as `self.config`.
71
+
72
+
## Git Workflow
73
+
74
+
- PRs should target the **`dev`** branch, not `main`.
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+
## Project Overview
6
+
7
+
Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
8
+
9
+
## Commands
10
+
11
+
```bash
12
+
# Run the application
13
+
uv run src/zotero_arxiv_daily/main.py
14
+
15
+
# Run tests (excludes slow tests by default)
16
+
uv run pytest
17
+
18
+
# Run all tests including slow ones
19
+
uv run pytest -m ""
20
+
21
+
# Run a single test
22
+
uv run pytest tests/test_utils.py::TestGlobMatch -v
23
+
24
+
# Install/sync dependencies
25
+
uv sync
26
+
```
27
+
28
+
No linter or formatter is configured.
29
+
30
+
## Architecture
31
+
32
+
The app follows a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
33
+
34
+
1.**Fetch Zotero corpus** — retrieves user's library papers via pyzotero API
3.**Retrieve new papers** — fetches from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
37
+
4.**Rerank** — scores candidates by weighted similarity to corpus (newer Zotero papers weighted higher)
38
+
5.**Generate TLDRs + affiliations** — via OpenAI-compatible LLM API
39
+
6.**Render + send email** — HTML email via SMTP
40
+
41
+
### Plugin Systems
42
+
43
+
**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever` decorator, discovered by `get_retriever_cls()`. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`.
44
+
45
+
**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker` decorator, discovered by `get_reranker_cls()`. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint).
46
+
47
+
### Configuration
48
+
49
+
Uses Hydra + OmegaConf. Config is composed from `config/base.yaml` (defaults) + `config/custom.yaml` (user overrides). Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main`.
50
+
51
+
### Data Classes
52
+
53
+
`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly.
54
+
55
+
## Testing
56
+
57
+
Tests marked `@pytest.mark.slow` require heavy dependencies (e.g., sentence-transformers model download) and are skipped locally by default (`addopts = "-m 'not slow'"` in pyproject.toml). All other tests run with pure Python stubs (no Docker containers needed).
58
+
59
+
```bash
60
+
# Run tests (excludes slow tests)
61
+
uv run pytest
62
+
63
+
# Run all tests including slow ones
64
+
uv run pytest -m ""
65
+
66
+
# Run with coverage
67
+
uv run pytest --cov=src/zotero_arxiv_daily --cov-report=term-missing
68
+
```
69
+
70
+
## gstack
71
+
72
+
Use the `/browse` skill from gstack for all web browsing. Never use `mcp__claude-in-chrome__*` tools.
Copy file name to clipboardExpand all lines: README.md
-2Lines changed: 0 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -192,8 +192,6 @@ This project is in active development. You can subscribe this repo via `Watch` s
192
192
- The recommendation algorithm is very simple, it may not accurately reflect your interest. Welcome better ideas for improving the algorithm!
193
193
- High `MAX_PAPER_NUM` can lead the execution time exceed the limitation of Github Action runner (6h per execution for public repo, and 2000 mins per month for private repo). Commonly, the quota given to public repo is definitely enough for individual use. If you have special requirements, you can deploy the workflow in your own server, or use a self-hosted Github Action runner, or pay for the exceeded execution time.
194
194
195
-
## 👯♂️ Contribution
196
-
Any issue and PR are welcomed! But remember that **each PR should merge to the `dev` branch**.
197
195
198
196
## 📃 License
199
197
Distributed under the AGPLv3 License. See `LICENSE` for detail.
Copy file name to clipboardExpand all lines: config/base.yaml
+2-1Lines changed: 2 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,7 @@ zotero:
2
2
user_id: ??? # User ID of your Zotero account.
3
3
api_key: ??? # An Zotero API key with read access.
4
4
include_path: null # A list of glob patterns marking the Zotero collections that should be included. Example: ["2026/survey/**","2026/reading-group/**"]
5
+
ignore_path: null # A list of glob patterns marking the Zotero collections that should be excluded. Example: ["2026/ignore/**","archive/**"]
5
6
6
7
source:
7
8
arxiv:
@@ -31,7 +32,7 @@ llm:
31
32
32
33
reranker:
33
34
local:
34
-
model: jinaai/jina-embeddings-v5-text-nano # The Hugging Face model name of the local embedding model. Example: jinaai/jina-embeddings-v5-text-nano
35
+
model: jinaai/jina-embeddings-v5-text-nano-retrieval# The Hugging Face model name of the local embedding model. Example: jinaai/jina-embeddings-v5-text-nano
35
36
encode_kwargs:
36
37
# The kwargs for the encode method of the local embedding model. Details see [here](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode)
0 commit comments