TideDra
diff --git a/‎.github/copilot-instructions.md‎
Lines changed: 74 additions & 0 deletions b/‎.github/copilot-instructions.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎.github/keep-alive.txt‎
Lines changed: 1 addition & 1 deletion b/‎.github/keep-alive.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 1 addition & 19 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 1 addition & 19 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 81 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 81 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 0 additions & 2 deletions b/‎README.md‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎config/base.yaml‎
Lines changed: 2 additions & 1 deletion b/‎config/base.yaml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 3 additions & 2 deletions b/‎pyproject.toml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/zotero_arxiv_daily/executor.py‎
Lines changed: 33 additions & 23 deletions b/‎src/zotero_arxiv_daily/executor.py‎
Lines changed: 33 additions & 23 deletions
diff --git a/‎src/zotero_arxiv_daily/retriever/arxiv_retriever.py‎
Lines changed: 24 additions & 9 deletions b/‎src/zotero_arxiv_daily/retriever/arxiv_retriever.py‎
Lines changed: 24 additions & 9 deletions
@@ -0,0 +1,74 @@
+# Copilot Instructions
+
+## Project Overview
+
+Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
+
+## Commands
+
+```bash
+# Install/sync dependencies
+uv sync
+
+# Run the application
+uv run src/zotero_arxiv_daily/main.py
+
+# Run tests (excludes slow tests by default)
+uv run pytest
+
+# Run all tests including slow ones
+uv run pytest -m ""
+
+# Run a single test
+uv run pytest tests/test_utils.py::TestGlobMatch -v
+```
+
+No linter or formatter is configured.
+
+## Architecture
+
+The app is a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
+
+1. **Fetch Zotero corpus** → pyzotero API
+2. **Filter corpus** → `include_path` / `ignore_path` glob patterns
+3. **Retrieve new papers** → from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
+4. **Rerank** → weighted embedding similarity to corpus (newer Zotero papers weighted higher)
+5. **Generate TLDRs + affiliations** → OpenAI-compatible LLM API
+6. **Render + send email** → HTML email via SMTP
+
+### Plugin Systems
+
+**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever("name")` decorator on a `BaseRetriever` subclass. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`. Discovered at runtime via `get_retriever_cls(name)` from a module-level `registered_retrievers` dict.
+
+**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker("name")` decorator on a `BaseReranker` subclass. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint). Discovered via `get_reranker_cls(name)`.
+
+When adding a new retriever or reranker, follow the existing pattern: create a new file, subclass the base, apply the registration decorator, and implement the abstract methods.
+
+### Configuration
+
+Uses Hydra + OmegaConf. Config composes from `config/base.yaml` (defaults with `???` placeholders for required values) + `config/custom.yaml` (user overrides). The composition order is defined in `config/default.yaml`. Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main(config_name="default")`.
+
+### Data Classes
+
+`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly with `tiktoken`-based token truncation.
+
+## Testing Conventions
+
+- Tests use **pytest monkeypatch + `SimpleNamespace`** for stubs — not `unittest.mock`.
+- A session-scoped Hydra config in `tests/conftest.py` is deep-copied per test via the `config` fixture.
+- Canned response factories live in `tests/canned_responses.py` (e.g., `make_stub_openai_client()`, `make_stub_zotero_client()`).
+- Tests marked `@pytest.mark.slow` require heavy dependencies (model downloads) and are excluded by default (`addopts = "-m 'not slow'"` in pyproject.toml).
+- Monkeypatching targets the module-level import path (e.g., `"zotero_arxiv_daily.executor.zotero.Zotero"`).
+
+## Coding Conventions
+
+- **Logging:** `loguru.logger` throughout — never `print()` or stdlib `logging`.
+- **Type hints:** Modern Python 3.10+ syntax (`list[Paper]`, `str | None`).
+- **Constants:** Module-level `UPPER_SNAKE_CASE`.
+- **Private methods:** Prefixed with `_` (e.g., `_retrieve_raw_papers`).
+- **Error handling:** Graceful degradation with try/except and fallback logic; log warnings rather than raising.
+- **Config injection:** All major components receive `DictConfig` at init and store it as `self.config`.
+
+## Git Workflow
+
+- PRs should target the **`dev`** branch, not `main`.
@@ -1,2 +1,2 @@
-Last run: 2026-05-01 01:29:37 UTC
+Last run: 2026-05-01 01:01:08 UTC
 This file is automatically updated to keep the repository active and prevent GitHub Actions from disabling scheduled workflows.
@@ -12,31 +12,13 @@ on:
 jobs:
   pytest:
     runs-on: ubuntu-latest
-    services:
-      mailhog:
-        image: mailhog/mailhog:latest
-        ports:
-          - 1025:1025  # SMTP
-      openai:
-        image: tidedra/mock_openai:latest
-        ports:
-          - 30000:30000
     steps:
       - name: Checkout
         uses: actions/checkout@v6
 
       - name: Setup uv
         uses: astral-sh/setup-uv@v7.1.4
 
-
       - name: Run Pytest
-        env:
-          ZOTERO_ID: "0"
-          ZOTERO_KEY: "AbCdEfGhIjKlMnOpQrStUvWx"
-          SENDER: "test@example.com"
-          RECEIVER: "test@example.com"
-          SENDER_PASSWORD: "test"
-          OPENAI_API_KEY: "sk-xxx"
-          OPENAI_API_BASE: "http://openai:30000/v1"
         run: |
-          uv run pytest -m ""
+          uv run pytest -m "" --cov=src/zotero_arxiv_daily --cov-report=term-missing
@@ -0,0 +1,81 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
+
+## Commands
+
+```bash
+# Run the application
+uv run src/zotero_arxiv_daily/main.py
+
+# Run tests (excludes slow tests by default)
+uv run pytest
+
+# Run all tests including slow ones
+uv run pytest -m ""
+
+# Run a single test
+uv run pytest tests/test_utils.py::TestGlobMatch -v
+
+# Install/sync dependencies
+uv sync
+```
+
+No linter or formatter is configured.
+
+## Architecture
+
+The app follows a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
+
+1. **Fetch Zotero corpus** — retrieves user's library papers via pyzotero API
+2. **Filter corpus** — applies `include_path` glob patterns to select relevant collections
+3. **Retrieve new papers** — fetches from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
+4. **Rerank** — scores candidates by weighted similarity to corpus (newer Zotero papers weighted higher)
+5. **Generate TLDRs + affiliations** — via OpenAI-compatible LLM API
+6. **Render + send email** — HTML email via SMTP
+
+### Plugin Systems
+
+**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever` decorator, discovered by `get_retriever_cls()`. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`.
+
+**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker` decorator, discovered by `get_reranker_cls()`. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint).
+
+### Configuration
+
+Uses Hydra + OmegaConf. Config is composed from `config/base.yaml` (defaults) + `config/custom.yaml` (user overrides). Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main`.
+
+### Data Classes
+
+`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly.
+
+## Testing
+
+Tests marked `@pytest.mark.slow` require heavy dependencies (e.g., sentence-transformers model download) and are skipped locally by default (`addopts = "-m 'not slow'"` in pyproject.toml). All other tests run with pure Python stubs (no Docker containers needed).
+
+```bash
+# Run tests (excludes slow tests)
+uv run pytest
+
+# Run all tests including slow ones
+uv run pytest -m ""
+
+# Run with coverage
+uv run pytest --cov=src/zotero_arxiv_daily --cov-report=term-missing
+```
+
+## gstack
+
+Use the `/browse` skill from gstack for all web browsing. Never use `mcp__claude-in-chrome__*` tools.
+
+Available skills: `/office-hours`, `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/design-consultation`, `/design-shotgun`, `/design-html`, `/review`, `/ship`, `/land-and-deploy`, `/canary`, `/benchmark`, `/browse`, `/connect-chrome`, `/qa`, `/qa-only`, `/design-review`, `/setup-browser-cookies`, `/setup-deploy`, `/retro`, `/investigate`, `/document-release`, `/codex`, `/cso`, `/autoplan`, `/plan-devex-review`, `/devex-review`, `/careful`, `/freeze`, `/guard`, `/unfreeze`, `/gstack-upgrade`, `/learn`.
+
+If gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
+
+## Git Workflow
+
+- PRs should target the `dev` branch, not `main`
+- Current development branch: `dev`
@@ -192,8 +192,6 @@ This project is in active development. You can subscribe this repo via `Watch` s
 - The recommendation algorithm is very simple, it may not accurately reflect your interest. Welcome better ideas for improving the algorithm!
 - High `MAX_PAPER_NUM` can lead the execution time exceed the limitation of Github Action runner (6h per execution for public repo, and 2000 mins per month for private repo). Commonly, the quota given to public repo is definitely enough for individual use. If you have special requirements, you can deploy the workflow in your own server, or use a self-hosted Github Action runner, or pay for the exceeded execution time.
 
-## 👯‍♂️ Contribution
-Any issue and PR are welcomed! But remember that **each PR should merge to the `dev` branch**.
 
 ## 📃 License
 Distributed under the AGPLv3 License. See `LICENSE` for detail.
 
@@ -2,6 +2,7 @@ zotero:
   user_id: ??? # User ID of your Zotero account.
   api_key: ??? # An Zotero API key with read access.
   include_path: null # A list of glob patterns marking the Zotero collections that should be included. Example: ["2026/survey/**","2026/reading-group/**"]
+  ignore_path: null # A list of glob patterns marking the Zotero collections that should be excluded. Example: ["2026/ignore/**","archive/**"]
 
 source:
   arxiv:
@@ -31,7 +32,7 @@ llm:
 
 reranker:
   local:
-    model: jinaai/jina-embeddings-v5-text-nano # The Hugging Face model name of the local embedding model. Example: jinaai/jina-embeddings-v5-text-nano
+    model: jinaai/jina-embeddings-v5-text-nano-retrieval # The Hugging Face model name of the local embedding model. Example: jinaai/jina-embeddings-v5-text-nano
     encode_kwargs:
     # The kwargs for the encode method of the local embedding model. Details see [here](https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode)
       task: retrieval
 
@@ -37,9 +37,9 @@ url = "https://download.pytorch.org/whl/cpu"
 explicit = true
 
 [tool.pytest.ini_options]
-addopts = "-m 'not ci'"
+addopts = "-m 'not slow'"
 markers = [
-    "ci: tests that only run in CI (require external services)",
+    "slow: tests that are slow (e.g. download models)",
 ]
 filterwarnings = [
     "ignore::DeprecationWarning:multiprocessing",
@@ -49,4 +49,5 @@ filterwarnings = [
 dev = [
     "ipykernel>=7.1.0",
     "pytest>=8.4.1",
+    "pytest-cov>=6.0",
 ]
@@ -13,26 +13,27 @@
 from tqdm import tqdm
 
 
-def normalize_include_path_patterns(include_path: list[str] | ListConfig | None) -> list[str] | None:
-    if include_path is None:
+def normalize_path_patterns(patterns: list[str] | ListConfig | None, config_key: str) -> list[str] | None:
+    if patterns is None:
         return None
 
-    if not isinstance(include_path, (list, ListConfig)):
+    if not isinstance(patterns, (list, ListConfig)):
         raise TypeError(
-            "config.zotero.include_path must be a list of glob patterns or null, "
+            f"config.zotero.{config_key} must be a list of glob patterns or null, "
             'for example ["2026/survey/**"]. Single strings are not supported.'
         )
 
-    if any(not isinstance(pattern, str) for pattern in include_path):
-        raise TypeError("config.zotero.include_path must contain only glob pattern strings.")
+    if any(not isinstance(pattern, str) for pattern in patterns):
+        raise TypeError(f"config.zotero.{config_key} must contain only glob pattern strings.")
 
-    return list(include_path)
+    return list(patterns)
 
 
 class Executor:
     def __init__(self, config:DictConfig):
         self.config = config
-        self.include_path_patterns = normalize_include_path_patterns(config.zotero.include_path)
+        self.include_path_patterns = normalize_path_patterns(config.zotero.include_path, "include_path")
+        self.ignore_path_patterns = normalize_path_patterns(config.zotero.ignore_path, "ignore_path")
         self.retrievers = {
             source: get_retriever_cls(source)(config) for source in config.executor.source
         }
@@ -62,22 +63,31 @@ def get_collection_path(col_key:str) -> str:
         ) for c in corpus]
 
     def filter_corpus(self, corpus:list[CorpusPaper]) -> list[CorpusPaper]:
-        if not self.include_path_patterns:
-            return corpus
-        new_corpus = []
-        logger.info(f"Selecting zotero papers matching include_path: {self.include_path_patterns}")
-        for c in corpus:
-            match_results = [
-                glob_match(path, pattern)
-                for path in c.paths
-                for pattern in self.include_path_patterns
+        if self.include_path_patterns:
+            logger.info(f"Selecting zotero papers matching include_path: {self.include_path_patterns}")
+            corpus = [
+                c for c in corpus
+                if any(
+                    glob_match(path, pattern)
+                    for path in c.paths
+                    for pattern in self.include_path_patterns
+                )
+            ]
+        if self.ignore_path_patterns:
+            logger.info(f"Excluding zotero papers matching ignore_path: {self.ignore_path_patterns}")
+            corpus = [
+                c for c in corpus
+                if not any(
+                    glob_match(path, pattern)
+                    for path in c.paths
+                    for pattern in self.ignore_path_patterns
+                )
             ]
-            if any(match_results):
-                new_corpus.append(c)
-        samples = random.sample(new_corpus, min(5, len(new_corpus)))
-        samples = '\n'.join([c.title + ' - ' + '\n'.join(c.paths) for c in samples])
-        logger.info(f"Selected {len(new_corpus)} zotero papers:\n{samples}\n...")
-        return new_corpus
+        if self.include_path_patterns or self.ignore_path_patterns:
+            samples = random.sample(corpus, min(5, len(corpus)))
+            samples = '\n'.join([c.title + ' - ' + '\n'.join(c.paths) for c in samples])
+            logger.info(f"Selected {len(corpus)} zotero papers:\n{samples}\n...")
+        return corpus
 
 
     def run(self):
 
@@ -9,6 +9,7 @@
 import multiprocessing
 import os
 from queue import Empty
+from time import sleep
 from typing import Any, Callable, TypeVar
 from loguru import logger
 import requests
@@ -95,11 +96,11 @@ def _extract_text_from_html_worker(html_url: str) -> str | None:
     return text
 
 
-def _extract_text_from_tar_worker(source_url: str, paper_id: str) -> str | None:
+def _extract_text_from_tar_worker(source_url: str, paper_id: str, paper_title: str | None = None) -> str | None:
     with TemporaryDirectory() as temp_dir:
         path = os.path.join(temp_dir, "paper.tar.gz")
         _download_file(source_url, path)
-        file_contents = extract_tex_code_from_tar(path, paper_id)
+        file_contents = extract_tex_code_from_tar(path, paper_id, paper_title=paper_title)
         if not file_contents or "all" not in file_contents:
             raise ValueError("Main tex file not found.")
         return file_contents["all"]
@@ -132,11 +133,25 @@ def _retrieve_raw_papers(self) -> list[ArxivResult]:
 
         # Get full information of each paper from arxiv api
         bar = tqdm(total=len(all_paper_ids))
+        max_batch_retries = 5
+        batch_retry_delay = 30
         for i in range(0, len(all_paper_ids), 20):
             search = arxiv.Search(id_list=all_paper_ids[i:i + 20])
-            batch = list(client.results(search))
-            bar.update(len(batch))
-            raw_papers.extend(batch)
+            for attempt in range(max_batch_retries):
+                try:
+                    batch = list(client.results(search))
+                    bar.update(len(batch))
+                    raw_papers.extend(batch)
+                    break
+                except arxiv.HTTPError as exc:
+                    if exc.status == 429 and attempt < max_batch_retries - 1:
+                        wait = batch_retry_delay * (attempt + 1)
+                        logger.warning(f"arXiv API 429 on batch {i // 20}, retry {attempt + 1}/{max_batch_retries} in {wait}s")
+                        sleep(wait)
+                    else:
+                        raise
+            if i + 20 < len(all_paper_ids):
+                sleep(3)
         bar.close()
 
         return raw_papers
@@ -146,11 +161,11 @@ def convert_to_paper(self, raw_paper: ArxivResult) -> Paper:
         authors = [a.name for a in raw_paper.authors]
         abstract = raw_paper.summary
         pdf_url = raw_paper.pdf_url
-        full_text = extract_text_from_html(raw_paper)
+        full_text = extract_text_from_tar(raw_paper)
         if full_text is None:
-            full_text = extract_text_from_pdf(raw_paper)
+            full_text = extract_text_from_html(raw_paper)
         if full_text is None:
-            full_text = extract_text_from_tar(raw_paper)
+            full_text = extract_text_from_pdf(raw_paper)
         return Paper(
             source=self.name,
             title=title,
@@ -191,7 +206,7 @@ def extract_text_from_tar(paper: ArxivResult) -> str | None:
         return None
     return _run_with_hard_timeout(
         _extract_text_from_tar_worker,
-        (source_url, paper.entry_id),
+        (source_url, paper.entry_id, paper.title),
         timeout=TAR_EXTRACT_TIMEOUT,
         operation="Tar extraction",
         paper_title=paper.title,
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`		`-Last run: 2026-05-01 01:29:37 UTC`
	`1`	`+Last run: 2026-05-01 01:01:08 UTC`
`2`	`2`	`This file is automatically updated to keep the repository active and prevent GitHub Actions from disabling scheduled workflows.`