Merge pull request #237 from TideDra/fix/prioritize_tex

TideDra · web-flow · commit b2f6460349e8 · 2026-04-14T20:02:26.000+08:00
Use tex extraction as first priority
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,74 @@
+# Copilot Instructions
+
+## Project Overview
+
+Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
+
+## Commands
+
+```bash
+# Install/sync dependencies
+uv sync
+
+# Run the application
+uv run src/zotero_arxiv_daily/main.py
+
+# Run tests (excludes slow tests by default)
+uv run pytest
+
+# Run all tests including slow ones
+uv run pytest -m ""
+
+# Run a single test
+uv run pytest tests/test_utils.py::TestGlobMatch -v
+```
+
+No linter or formatter is configured.
+
+## Architecture
+
+The app is a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
+
+1. **Fetch Zotero corpus** → pyzotero API
+2. **Filter corpus** → `include_path` / `ignore_path` glob patterns
+3. **Retrieve new papers** → from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
+4. **Rerank** → weighted embedding similarity to corpus (newer Zotero papers weighted higher)
+5. **Generate TLDRs + affiliations** → OpenAI-compatible LLM API
+6. **Render + send email** → HTML email via SMTP
+
+### Plugin Systems
+
+**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever("name")` decorator on a `BaseRetriever` subclass. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`. Discovered at runtime via `get_retriever_cls(name)` from a module-level `registered_retrievers` dict.
+
+**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker("name")` decorator on a `BaseReranker` subclass. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint). Discovered via `get_reranker_cls(name)`.
+
+When adding a new retriever or reranker, follow the existing pattern: create a new file, subclass the base, apply the registration decorator, and implement the abstract methods.
+
+### Configuration
+
+Uses Hydra + OmegaConf. Config composes from `config/base.yaml` (defaults with `???` placeholders for required values) + `config/custom.yaml` (user overrides). The composition order is defined in `config/default.yaml`. Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main(config_name="default")`.
+
+### Data Classes
+
+`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly with `tiktoken`-based token truncation.
+
+## Testing Conventions
+
+- Tests use **pytest monkeypatch + `SimpleNamespace`** for stubs — not `unittest.mock`.
+- A session-scoped Hydra config in `tests/conftest.py` is deep-copied per test via the `config` fixture.
+- Canned response factories live in `tests/canned_responses.py` (e.g., `make_stub_openai_client()`, `make_stub_zotero_client()`).
+- Tests marked `@pytest.mark.slow` require heavy dependencies (model downloads) and are excluded by default (`addopts = "-m 'not slow'"` in pyproject.toml).
+- Monkeypatching targets the module-level import path (e.g., `"zotero_arxiv_daily.executor.zotero.Zotero"`).
+
+## Coding Conventions
+
+- **Logging:** `loguru.logger` throughout — never `print()` or stdlib `logging`.
+- **Type hints:** Modern Python 3.10+ syntax (`list[Paper]`, `str | None`).
+- **Constants:** Module-level `UPPER_SNAKE_CASE`.
+- **Private methods:** Prefixed with `_` (e.g., `_retrieve_raw_papers`).
+- **Error handling:** Graceful degradation with try/except and fallback logic; log warnings rather than raising.
+- **Config injection:** All major components receive `DictConfig` at init and store it as `self.config`.
+
+## Git Workflow
+
+- PRs should target the **`dev`** branch, not `main`.
diff --git a/src/zotero_arxiv_daily/retriever/arxiv_retriever.py b/src/zotero_arxiv_daily/retriever/arxiv_retriever.py
@@ -95,11 +95,11 @@ def _extract_text_from_html_worker(html_url: str) -> str | None:
     return text
 
 
-def _extract_text_from_tar_worker(source_url: str, paper_id: str) -> str | None:
+def _extract_text_from_tar_worker(source_url: str, paper_id: str, paper_title: str | None = None) -> str | None:
     with TemporaryDirectory() as temp_dir:
         path = os.path.join(temp_dir, "paper.tar.gz")
         _download_file(source_url, path)
-        file_contents = extract_tex_code_from_tar(path, paper_id)
+        file_contents = extract_tex_code_from_tar(path, paper_id, paper_title=paper_title)
         if not file_contents or "all" not in file_contents:
             raise ValueError("Main tex file not found.")
         return file_contents["all"]
@@ -146,11 +146,11 @@ def convert_to_paper(self, raw_paper: ArxivResult) -> Paper:
         authors = [a.name for a in raw_paper.authors]
         abstract = raw_paper.summary
         pdf_url = raw_paper.pdf_url
-        full_text = extract_text_from_html(raw_paper)
+        full_text = extract_text_from_tar(raw_paper)
         if full_text is None:
-            full_text = extract_text_from_pdf(raw_paper)
+            full_text = extract_text_from_html(raw_paper)
         if full_text is None:
-            full_text = extract_text_from_tar(raw_paper)
+            full_text = extract_text_from_pdf(raw_paper)
         return Paper(
             source=self.name,
             title=title,
@@ -191,7 +191,7 @@ def extract_text_from_tar(paper: ArxivResult) -> str | None:
         return None
     return _run_with_hard_timeout(
         _extract_text_from_tar_worker,
-        (source_url, paper.entry_id),
+        (source_url, paper.entry_id, paper.title),
         timeout=TAR_EXTRACT_TIMEOUT,
         operation="Tar extraction",
         paper_title=paper.title,
diff --git a/src/zotero_arxiv_daily/utils.py b/src/zotero_arxiv_daily/utils.py
@@ -1,7 +1,9 @@
 import tarfile
 import re
 import glob
+import math
 import smtplib
+from collections import Counter
 from email.header import Header
 from email.mime.text import MIMEText
 from email.utils import parseaddr, formataddr
@@ -15,7 +17,43 @@
 
 import pymupdf4llm  # noqa: E402
 
-def extract_tex_code_from_tar(file_path:str, paper_id:str) -> dict[str,str]:
+_TOKEN_RE = re.compile(r'[a-zA-Z0-9]+')
+
+def _tokenize(text: str) -> list[str]:
+    return [t.lower() for t in _TOKEN_RE.findall(text)]
+
+
+def _bm25_pick(query: str, candidates: dict[str, str], k1: float = 1.5, b: float = 0.75) -> str:
+    """Return the candidate key whose content best matches *query* by BM25."""
+    query_tokens = _tokenize(query)
+    if not query_tokens:
+        return next(iter(candidates))
+
+    doc_tokens = {name: _tokenize(content) for name, content in candidates.items()}
+    N = len(doc_tokens)
+    avgdl = sum(len(t) for t in doc_tokens.values()) / max(N, 1)
+
+    df: Counter[str] = Counter()
+    for tokens in doc_tokens.values():
+        df.update(set(tokens))
+
+    best_name, best_score = None, -1.0
+    for name, tokens in doc_tokens.items():
+        tf = Counter(tokens)
+        dl = len(tokens)
+        score = 0.0
+        for q in query_tokens:
+            n_q = df.get(q, 0)
+            idf = math.log((N - n_q + 0.5) / (n_q + 0.5) + 1)
+            f_q = tf.get(q, 0)
+            score += idf * (f_q * (k1 + 1)) / (f_q + k1 * (1 - b + b * dl / max(avgdl, 1)))
+        if score > best_score:
+            best_score = score
+            best_name = name
+    return best_name
+
+
+def extract_tex_code_from_tar(file_path:str, paper_id:str, paper_title:str | None = None) -> dict[str,str]:
     try:
         tar = tarfile.open(file_path)
     except tarfile.ReadError:
@@ -48,25 +86,34 @@ def extract_tex_code_from_tar(file_path:str, paper_id:str) -> dict[str,str]:
 
     if main_tex is None:
         logger.debug(f"Trying to choose tex file containing the document block as main tex file of {paper_id}")
-    #read all tex files
+
     file_contents = {}
+    doc_block_candidates: list[str] = []
     for t in tex_files:
         f = tar.extractfile(t)
         content = f.read().decode('utf-8',errors='ignore')
-        #remove comments
         content = re.sub(r'%.*\n', '\n', content)
         content = re.sub(r'\\begin{comment}.*?\\end{comment}', '', content, flags=re.DOTALL)
         content = re.sub(r'\\iffalse.*?\\fi', '', content, flags=re.DOTALL)
-        #remove redundant \n
         content = re.sub(r'\n+', '\n', content)
         content = re.sub(r'\\\\', '', content)
-        #remove consecutive spaces
         content = re.sub(r'[ \t\r\f]{3,}', ' ', content)
-        if main_tex is None and re.search(r'\\begin\{document\}', content) and not any(w in t for w in ['example', 'sample']):
-            main_tex = t
-            logger.debug(f"Choose {t} as main tex file of {paper_id}")
+        if main_tex is None and re.search(r'\\begin\{document\}', content) and not any(w in t for w in ['example', 'sample', 'template']):
+            doc_block_candidates.append(t)
         file_contents[t] = content
-    
+
+    if main_tex is None:
+        if len(doc_block_candidates) == 1:
+            main_tex = doc_block_candidates[0]
+            logger.debug(f"Choose {main_tex} as main tex file of {paper_id}")
+        elif len(doc_block_candidates) > 1:
+            if paper_title:
+                main_tex = _bm25_pick(paper_title, {c: file_contents[c] for c in doc_block_candidates})
+                logger.debug(f"Multiple document blocks found in {paper_id}; BM25 selected {main_tex} from {doc_block_candidates}")
+            else:
+                main_tex = doc_block_candidates[0]
+                logger.debug(f"Multiple document blocks found in {paper_id}; no title provided, using first candidate {main_tex}")
+
     if main_tex is not None:
         main_source:str = file_contents[main_tex]
         #find and replace all included sub-files
diff --git a/tests/test_utils.py b/tests/test_utils.py
@@ -6,7 +6,7 @@
 
 import pytest
 
-from zotero_arxiv_daily.utils import glob_match, send_email, extract_tex_code_from_tar
+from zotero_arxiv_daily.utils import glob_match, send_email, extract_tex_code_from_tar, _bm25_pick
 from tests.canned_responses import make_stub_smtp
 
 
@@ -248,3 +248,43 @@ def test_extract_tex_multiple_tex_no_bbl(make_tar):
     result = extract_tex_code_from_tar(path, "test-paper")
     assert result is not None
     assert "Main content" in result["all"]
+
+
+def test_extract_tex_multiple_document_blocks_bm25(make_tar):
+    """When multiple tex files contain \\begin{document}, BM25 picks the one matching paper_title."""
+    path = make_tar({
+        "appendix.tex": "\\begin{document}\n\\title{Supplementary Material}\nAppendix stuff\n\\end{document}",
+        "main.tex": "\\begin{document}\n\\title{Quantum Entanglement in Neural Networks}\nReal content here\n\\end{document}",
+    })
+    result = extract_tex_code_from_tar(path, "test-paper", paper_title="Quantum Entanglement in Neural Networks")
+    assert result is not None
+    assert "Real content here" in result["all"]
+
+
+def test_extract_tex_multiple_document_blocks_no_title(make_tar):
+    """Without paper_title, falls back to the first candidate."""
+    path = make_tar({
+        "a.tex": "\\begin{document}\nFirst doc\n\\end{document}",
+        "b.tex": "\\begin{document}\nSecond doc\n\\end{document}",
+    })
+    result = extract_tex_code_from_tar(path, "test-paper")
+    assert result is not None
+    assert result["all"] is not None
+
+
+class TestBm25Pick:
+    def test_picks_best_match(self):
+        candidates = {
+            "a.tex": "This paper discusses cats and dogs in the wild",
+            "b.tex": "Quantum entanglement in neural network architectures",
+        }
+        assert _bm25_pick("Quantum entanglement neural networks", candidates) == "b.tex"
+
+    def test_single_candidate(self):
+        candidates = {"only.tex": "Some content here"}
+        assert _bm25_pick("anything", candidates) == "only.tex"
+
+    def test_empty_query_returns_first(self):
+        candidates = {"a.tex": "hello", "b.tex": "world"}
+        result = _bm25_pick("", candidates)
+        assert result in candidates