Use tex extraction as first priority by TideDra · Pull Request #237 · TideDra/zotero-arxiv-daily

TideDra · 2026-04-14T11:57:02Z

No description provided.

Copilot

Pull request overview

This PR makes TeX source extraction the preferred path for arXiv papers and improves main .tex selection when multiple files contain \\begin{document} by using a BM25-based match against the paper title.

Changes:

Add BM25-based candidate selection for TeX sources when multiple document blocks are present, optionally guided by paper_title.
Reorder arXiv full-text extraction priority to try tar/TeX first, then HTML, then PDF; pass paper.title into tar extraction.
Extend unit tests for TeX extraction and BM25 picking; add a repository Copilot instructions document.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/zotero_arxiv_daily/utils.py`	Adds `_bm25_pick` and updates TeX main-file selection logic with optional `paper_title`.
`src/zotero_arxiv_daily/retriever/arxiv_retriever.py`	Switches extraction order to tar-first and threads `paper_title` through the tar worker.
`tests/test_utils.py`	Adds tests for multiple `\\begin{document}` handling and BM25 selection behavior.
`.github/copilot-instructions.md`	Documents project commands, architecture, and conventions for contributors/tools.

Comments suppressed due to low confidence (1)

src/zotero_arxiv_daily/utils.py:61

extract_tex_code_from_tar is annotated as returning dict[str, str], but it returns None on read errors / missing tex files, and it can also set file_contents["all"] = None. This is a real type contract mismatch and will confuse callers and type checkers. Update the return type to something like dict[str, str | None] | None (or split out a dedicated result type) so the signature matches actual behavior.

def extract_tex_code_from_tar(file_path:str, paper_id:str, paper_title:str | None = None) -> dict[str,str]:
    try:
        tar = tarfile.open(file_path)
    except tarfile.ReadError:
        logger.debug(f"Failed to find main tex file of {paper_id}: Not a tar file.")
        return None

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T12:00:28Z

+    def test_empty_query_returns_first(self):
+        candidates = {"a.tex": "hello", "b.tex": "world"}
+        result = _bm25_pick("", candidates)
+        assert result in candidates


test_empty_query_returns_first currently asserts only that the result is one of the candidates, but _bm25_pick is specifically implemented to return the first candidate when the query tokenizes to empty. If that behavior matters, assert equality to the first key to make the test catch regressions.

Suggested change

assert result in candidates

assert result == "a.tex"

Copilot · 2026-04-14T12:00:29Z

+def _bm25_pick(query: str, candidates: dict[str, str], k1: float = 1.5, b: float = 0.75) -> str:
+    """Return the candidate key whose content best matches *query* by BM25."""
+    query_tokens = _tokenize(query)
+    if not query_tokens:
+        return next(iter(candidates))
+


_bm25_pick assumes candidates is non-empty (next(iter(candidates)) and best_name initialization). If it is ever called with an empty dict, it will raise StopIteration or return None despite the -> str annotation. Add an explicit guard (e.g., raise ValueError with a clear message) and/or adjust the return type so the function’s contract is unambiguous.

Copilot · 2026-04-14T12:00:29Z

+        full_text = extract_text_from_tar(raw_paper)
        if full_text is None:
-            full_text = extract_text_from_pdf(raw_paper)
+            full_text = extract_text_from_html(raw_paper)
        if full_text is None:
-            full_text = extract_text_from_tar(raw_paper)
+            full_text = extract_text_from_pdf(raw_paper)


convert_to_paper changes the extraction priority order (tar → html → pdf). There’s existing test coverage for ArxivRetriever, but no test asserting the fallback order, so regressions (e.g., accidentally reverting to html-first) would be easy to miss. Consider adding a unit test that monkeypatches extract_text_from_tar/html/pdf to record call order and returns None/text to verify the intended priority.

Copilot · 2026-04-14T12:00:29Z

+    })
+    result = extract_tex_code_from_tar(path, "test-paper")
+    assert result is not None
+    assert result["all"] is not None


test_extract_tex_multiple_document_blocks_no_title docstring says it "falls back to the first candidate", but the assertions only check that result["all"] is not None. This doesn’t actually validate the intended fallback behavior. Strengthen the test to assert the extracted content comes from the first candidate (e.g., contains "First doc").

Suggested change

assert result["all"] is not None

assert "First doc" in result["all"]

assert "Second doc" not in result["all"]

TideDra added 3 commits April 14, 2026 11:32

add copilot instruction

e7626bf

prioritize tar extraction

62ed87d

use bm25 to select main tex

a42b328

Copilot AI review requested due to automatic review settings April 14, 2026 11:57

Copilot started reviewing on behalf of TideDra April 14, 2026 11:57 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

TideDra merged commit b2f6460 into main Apr 14, 2026
5 checks passed

TideDra deleted the fix/prioritize_tex branch April 14, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use tex extraction as first priority#237

Use tex extraction as first priority#237
TideDra merged 3 commits into
mainfrom
fix/prioritize_tex

TideDra commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert result["all"] is not None
	assert "First doc" in result["all"]
	assert "Second doc" not in result["all"]

Uh oh!

Conversation

TideDra commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants