Skip to content

Commit b2f6460

Browse files
authored
Merge pull request #237 from TideDra/fix/prioritize_tex
Use tex extraction as first priority
2 parents e5139fa + a42b328 commit b2f6460

4 files changed

Lines changed: 177 additions & 16 deletions

File tree

.github/copilot-instructions.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Copilot Instructions
2+
3+
## Project Overview
4+
5+
Zotero-arXiv-Daily recommends new arXiv/bioRxiv/medRxiv papers based on a user's Zotero library. It computes embedding similarity between new papers and the user's existing library, generates TLDRs via LLM, and delivers results by email. Designed to run as a GitHub Actions workflow at zero cost.
6+
7+
## Commands
8+
9+
```bash
10+
# Install/sync dependencies
11+
uv sync
12+
13+
# Run the application
14+
uv run src/zotero_arxiv_daily/main.py
15+
16+
# Run tests (excludes slow tests by default)
17+
uv run pytest
18+
19+
# Run all tests including slow ones
20+
uv run pytest -m ""
21+
22+
# Run a single test
23+
uv run pytest tests/test_utils.py::TestGlobMatch -v
24+
```
25+
26+
No linter or formatter is configured.
27+
28+
## Architecture
29+
30+
The app is a linear pipeline orchestrated by `Executor` (`src/zotero_arxiv_daily/executor.py`):
31+
32+
1. **Fetch Zotero corpus** → pyzotero API
33+
2. **Filter corpus**`include_path` / `ignore_path` glob patterns
34+
3. **Retrieve new papers** → from configured sources (arXiv RSS, bioRxiv/medRxiv REST API)
35+
4. **Rerank** → weighted embedding similarity to corpus (newer Zotero papers weighted higher)
36+
5. **Generate TLDRs + affiliations** → OpenAI-compatible LLM API
37+
6. **Render + send email** → HTML email via SMTP
38+
39+
### Plugin Systems
40+
41+
**Retrievers** (`src/zotero_arxiv_daily/retriever/`): Register via `@register_retriever("name")` decorator on a `BaseRetriever` subclass. Each retriever implements `_retrieve_raw_papers()` and `convert_to_paper()`. Discovered at runtime via `get_retriever_cls(name)` from a module-level `registered_retrievers` dict.
42+
43+
**Rerankers** (`src/zotero_arxiv_daily/reranker/`): Register via `@register_reranker("name")` decorator on a `BaseReranker` subclass. Two implementations: `local` (sentence-transformers) and `api` (OpenAI-compatible embeddings endpoint). Discovered via `get_reranker_cls(name)`.
44+
45+
When adding a new retriever or reranker, follow the existing pattern: create a new file, subclass the base, apply the registration decorator, and implement the abstract methods.
46+
47+
### Configuration
48+
49+
Uses Hydra + OmegaConf. Config composes from `config/base.yaml` (defaults with `???` placeholders for required values) + `config/custom.yaml` (user overrides). The composition order is defined in `config/default.yaml`. Environment variables are interpolated via `${oc.env:VAR_NAME,default}` syntax. Entry point uses `@hydra.main(config_name="default")`.
50+
51+
### Data Classes
52+
53+
`Paper` and `CorpusPaper` in `src/zotero_arxiv_daily/protocol.py`. `Paper` has LLM-powered methods (`generate_tldr`, `generate_affiliations`) that call the OpenAI API directly with `tiktoken`-based token truncation.
54+
55+
## Testing Conventions
56+
57+
- Tests use **pytest monkeypatch + `SimpleNamespace`** for stubs — not `unittest.mock`.
58+
- A session-scoped Hydra config in `tests/conftest.py` is deep-copied per test via the `config` fixture.
59+
- Canned response factories live in `tests/canned_responses.py` (e.g., `make_stub_openai_client()`, `make_stub_zotero_client()`).
60+
- Tests marked `@pytest.mark.slow` require heavy dependencies (model downloads) and are excluded by default (`addopts = "-m 'not slow'"` in pyproject.toml).
61+
- Monkeypatching targets the module-level import path (e.g., `"zotero_arxiv_daily.executor.zotero.Zotero"`).
62+
63+
## Coding Conventions
64+
65+
- **Logging:** `loguru.logger` throughout — never `print()` or stdlib `logging`.
66+
- **Type hints:** Modern Python 3.10+ syntax (`list[Paper]`, `str | None`).
67+
- **Constants:** Module-level `UPPER_SNAKE_CASE`.
68+
- **Private methods:** Prefixed with `_` (e.g., `_retrieve_raw_papers`).
69+
- **Error handling:** Graceful degradation with try/except and fallback logic; log warnings rather than raising.
70+
- **Config injection:** All major components receive `DictConfig` at init and store it as `self.config`.
71+
72+
## Git Workflow
73+
74+
- PRs should target the **`dev`** branch, not `main`.

src/zotero_arxiv_daily/retriever/arxiv_retriever.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,11 @@ def _extract_text_from_html_worker(html_url: str) -> str | None:
9595
return text
9696

9797

98-
def _extract_text_from_tar_worker(source_url: str, paper_id: str) -> str | None:
98+
def _extract_text_from_tar_worker(source_url: str, paper_id: str, paper_title: str | None = None) -> str | None:
9999
with TemporaryDirectory() as temp_dir:
100100
path = os.path.join(temp_dir, "paper.tar.gz")
101101
_download_file(source_url, path)
102-
file_contents = extract_tex_code_from_tar(path, paper_id)
102+
file_contents = extract_tex_code_from_tar(path, paper_id, paper_title=paper_title)
103103
if not file_contents or "all" not in file_contents:
104104
raise ValueError("Main tex file not found.")
105105
return file_contents["all"]
@@ -146,11 +146,11 @@ def convert_to_paper(self, raw_paper: ArxivResult) -> Paper:
146146
authors = [a.name for a in raw_paper.authors]
147147
abstract = raw_paper.summary
148148
pdf_url = raw_paper.pdf_url
149-
full_text = extract_text_from_html(raw_paper)
149+
full_text = extract_text_from_tar(raw_paper)
150150
if full_text is None:
151-
full_text = extract_text_from_pdf(raw_paper)
151+
full_text = extract_text_from_html(raw_paper)
152152
if full_text is None:
153-
full_text = extract_text_from_tar(raw_paper)
153+
full_text = extract_text_from_pdf(raw_paper)
154154
return Paper(
155155
source=self.name,
156156
title=title,
@@ -191,7 +191,7 @@ def extract_text_from_tar(paper: ArxivResult) -> str | None:
191191
return None
192192
return _run_with_hard_timeout(
193193
_extract_text_from_tar_worker,
194-
(source_url, paper.entry_id),
194+
(source_url, paper.entry_id, paper.title),
195195
timeout=TAR_EXTRACT_TIMEOUT,
196196
operation="Tar extraction",
197197
paper_title=paper.title,

src/zotero_arxiv_daily/utils.py

Lines changed: 56 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
import tarfile
22
import re
33
import glob
4+
import math
45
import smtplib
6+
from collections import Counter
57
from email.header import Header
68
from email.mime.text import MIMEText
79
from email.utils import parseaddr, formataddr
@@ -15,7 +17,43 @@
1517

1618
import pymupdf4llm # noqa: E402
1719

18-
def extract_tex_code_from_tar(file_path:str, paper_id:str) -> dict[str,str]:
20+
_TOKEN_RE = re.compile(r'[a-zA-Z0-9]+')
21+
22+
def _tokenize(text: str) -> list[str]:
23+
return [t.lower() for t in _TOKEN_RE.findall(text)]
24+
25+
26+
def _bm25_pick(query: str, candidates: dict[str, str], k1: float = 1.5, b: float = 0.75) -> str:
27+
"""Return the candidate key whose content best matches *query* by BM25."""
28+
query_tokens = _tokenize(query)
29+
if not query_tokens:
30+
return next(iter(candidates))
31+
32+
doc_tokens = {name: _tokenize(content) for name, content in candidates.items()}
33+
N = len(doc_tokens)
34+
avgdl = sum(len(t) for t in doc_tokens.values()) / max(N, 1)
35+
36+
df: Counter[str] = Counter()
37+
for tokens in doc_tokens.values():
38+
df.update(set(tokens))
39+
40+
best_name, best_score = None, -1.0
41+
for name, tokens in doc_tokens.items():
42+
tf = Counter(tokens)
43+
dl = len(tokens)
44+
score = 0.0
45+
for q in query_tokens:
46+
n_q = df.get(q, 0)
47+
idf = math.log((N - n_q + 0.5) / (n_q + 0.5) + 1)
48+
f_q = tf.get(q, 0)
49+
score += idf * (f_q * (k1 + 1)) / (f_q + k1 * (1 - b + b * dl / max(avgdl, 1)))
50+
if score > best_score:
51+
best_score = score
52+
best_name = name
53+
return best_name
54+
55+
56+
def extract_tex_code_from_tar(file_path:str, paper_id:str, paper_title:str | None = None) -> dict[str,str]:
1957
try:
2058
tar = tarfile.open(file_path)
2159
except tarfile.ReadError:
@@ -48,25 +86,34 @@ def extract_tex_code_from_tar(file_path:str, paper_id:str) -> dict[str,str]:
4886

4987
if main_tex is None:
5088
logger.debug(f"Trying to choose tex file containing the document block as main tex file of {paper_id}")
51-
#read all tex files
89+
5290
file_contents = {}
91+
doc_block_candidates: list[str] = []
5392
for t in tex_files:
5493
f = tar.extractfile(t)
5594
content = f.read().decode('utf-8',errors='ignore')
56-
#remove comments
5795
content = re.sub(r'%.*\n', '\n', content)
5896
content = re.sub(r'\\begin{comment}.*?\\end{comment}', '', content, flags=re.DOTALL)
5997
content = re.sub(r'\\iffalse.*?\\fi', '', content, flags=re.DOTALL)
60-
#remove redundant \n
6198
content = re.sub(r'\n+', '\n', content)
6299
content = re.sub(r'\\\\', '', content)
63-
#remove consecutive spaces
64100
content = re.sub(r'[ \t\r\f]{3,}', ' ', content)
65-
if main_tex is None and re.search(r'\\begin\{document\}', content) and not any(w in t for w in ['example', 'sample']):
66-
main_tex = t
67-
logger.debug(f"Choose {t} as main tex file of {paper_id}")
101+
if main_tex is None and re.search(r'\\begin\{document\}', content) and not any(w in t for w in ['example', 'sample', 'template']):
102+
doc_block_candidates.append(t)
68103
file_contents[t] = content
69-
104+
105+
if main_tex is None:
106+
if len(doc_block_candidates) == 1:
107+
main_tex = doc_block_candidates[0]
108+
logger.debug(f"Choose {main_tex} as main tex file of {paper_id}")
109+
elif len(doc_block_candidates) > 1:
110+
if paper_title:
111+
main_tex = _bm25_pick(paper_title, {c: file_contents[c] for c in doc_block_candidates})
112+
logger.debug(f"Multiple document blocks found in {paper_id}; BM25 selected {main_tex} from {doc_block_candidates}")
113+
else:
114+
main_tex = doc_block_candidates[0]
115+
logger.debug(f"Multiple document blocks found in {paper_id}; no title provided, using first candidate {main_tex}")
116+
70117
if main_tex is not None:
71118
main_source:str = file_contents[main_tex]
72119
#find and replace all included sub-files

tests/test_utils.py

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
import pytest
88

9-
from zotero_arxiv_daily.utils import glob_match, send_email, extract_tex_code_from_tar
9+
from zotero_arxiv_daily.utils import glob_match, send_email, extract_tex_code_from_tar, _bm25_pick
1010
from tests.canned_responses import make_stub_smtp
1111

1212

@@ -248,3 +248,43 @@ def test_extract_tex_multiple_tex_no_bbl(make_tar):
248248
result = extract_tex_code_from_tar(path, "test-paper")
249249
assert result is not None
250250
assert "Main content" in result["all"]
251+
252+
253+
def test_extract_tex_multiple_document_blocks_bm25(make_tar):
254+
"""When multiple tex files contain \\begin{document}, BM25 picks the one matching paper_title."""
255+
path = make_tar({
256+
"appendix.tex": "\\begin{document}\n\\title{Supplementary Material}\nAppendix stuff\n\\end{document}",
257+
"main.tex": "\\begin{document}\n\\title{Quantum Entanglement in Neural Networks}\nReal content here\n\\end{document}",
258+
})
259+
result = extract_tex_code_from_tar(path, "test-paper", paper_title="Quantum Entanglement in Neural Networks")
260+
assert result is not None
261+
assert "Real content here" in result["all"]
262+
263+
264+
def test_extract_tex_multiple_document_blocks_no_title(make_tar):
265+
"""Without paper_title, falls back to the first candidate."""
266+
path = make_tar({
267+
"a.tex": "\\begin{document}\nFirst doc\n\\end{document}",
268+
"b.tex": "\\begin{document}\nSecond doc\n\\end{document}",
269+
})
270+
result = extract_tex_code_from_tar(path, "test-paper")
271+
assert result is not None
272+
assert result["all"] is not None
273+
274+
275+
class TestBm25Pick:
276+
def test_picks_best_match(self):
277+
candidates = {
278+
"a.tex": "This paper discusses cats and dogs in the wild",
279+
"b.tex": "Quantum entanglement in neural network architectures",
280+
}
281+
assert _bm25_pick("Quantum entanglement neural networks", candidates) == "b.tex"
282+
283+
def test_single_candidate(self):
284+
candidates = {"only.tex": "Some content here"}
285+
assert _bm25_pick("anything", candidates) == "only.tex"
286+
287+
def test_empty_query_returns_first(self):
288+
candidates = {"a.tex": "hello", "b.tex": "world"}
289+
result = _bm25_pick("", candidates)
290+
assert result in candidates

0 commit comments

Comments
 (0)