Skip to content

Use tex extraction as first priority#237

Merged
TideDra merged 3 commits into
mainfrom
fix/prioritize_tex
Apr 14, 2026
Merged

Use tex extraction as first priority#237
TideDra merged 3 commits into
mainfrom
fix/prioritize_tex

Conversation

@TideDra

@TideDra TideDra commented Apr 14, 2026

Copy link
Copy Markdown
Owner

No description provided.

Copilot AI review requested due to automatic review settings April 14, 2026 11:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes TeX source extraction the preferred path for arXiv papers and improves main .tex selection when multiple files contain \\begin{document} by using a BM25-based match against the paper title.

Changes:

  • Add BM25-based candidate selection for TeX sources when multiple document blocks are present, optionally guided by paper_title.
  • Reorder arXiv full-text extraction priority to try tar/TeX first, then HTML, then PDF; pass paper.title into tar extraction.
  • Extend unit tests for TeX extraction and BM25 picking; add a repository Copilot instructions document.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/zotero_arxiv_daily/utils.py Adds _bm25_pick and updates TeX main-file selection logic with optional paper_title.
src/zotero_arxiv_daily/retriever/arxiv_retriever.py Switches extraction order to tar-first and threads paper_title through the tar worker.
tests/test_utils.py Adds tests for multiple \\begin{document} handling and BM25 selection behavior.
.github/copilot-instructions.md Documents project commands, architecture, and conventions for contributors/tools.
Comments suppressed due to low confidence (1)

src/zotero_arxiv_daily/utils.py:61

  • extract_tex_code_from_tar is annotated as returning dict[str, str], but it returns None on read errors / missing tex files, and it can also set file_contents["all"] = None. This is a real type contract mismatch and will confuse callers and type checkers. Update the return type to something like dict[str, str | None] | None (or split out a dedicated result type) so the signature matches actual behavior.
def extract_tex_code_from_tar(file_path:str, paper_id:str, paper_title:str | None = None) -> dict[str,str]:
    try:
        tar = tarfile.open(file_path)
    except tarfile.ReadError:
        logger.debug(f"Failed to find main tex file of {paper_id}: Not a tar file.")
        return None

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_utils.py
def test_empty_query_returns_first(self):
candidates = {"a.tex": "hello", "b.tex": "world"}
result = _bm25_pick("", candidates)
assert result in candidates

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_empty_query_returns_first currently asserts only that the result is one of the candidates, but _bm25_pick is specifically implemented to return the first candidate when the query tokenizes to empty. If that behavior matters, assert equality to the first key to make the test catch regressions.

Suggested change
assert result in candidates
assert result == "a.tex"

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +31
def _bm25_pick(query: str, candidates: dict[str, str], k1: float = 1.5, b: float = 0.75) -> str:
"""Return the candidate key whose content best matches *query* by BM25."""
query_tokens = _tokenize(query)
if not query_tokens:
return next(iter(candidates))

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bm25_pick assumes candidates is non-empty (next(iter(candidates)) and best_name initialization). If it is ever called with an empty dict, it will raise StopIteration or return None despite the -> str annotation. Add an explicit guard (e.g., raise ValueError with a clear message) and/or adjust the return type so the function’s contract is unambiguous.

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +153
full_text = extract_text_from_tar(raw_paper)
if full_text is None:
full_text = extract_text_from_pdf(raw_paper)
full_text = extract_text_from_html(raw_paper)
if full_text is None:
full_text = extract_text_from_tar(raw_paper)
full_text = extract_text_from_pdf(raw_paper)

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_to_paper changes the extraction priority order (tar → html → pdf). There’s existing test coverage for ArxivRetriever, but no test asserting the fallback order, so regressions (e.g., accidentally reverting to html-first) would be easy to miss. Consider adding a unit test that monkeypatches extract_text_from_tar/html/pdf to record call order and returns None/text to verify the intended priority.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_utils.py
})
result = extract_tex_code_from_tar(path, "test-paper")
assert result is not None
assert result["all"] is not None

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_extract_tex_multiple_document_blocks_no_title docstring says it "falls back to the first candidate", but the assertions only check that result["all"] is not None. This doesn’t actually validate the intended fallback behavior. Strengthen the test to assert the extracted content comes from the first candidate (e.g., contains "First doc").

Suggested change
assert result["all"] is not None
assert "First doc" in result["all"]
assert "Second doc" not in result["all"]

Copilot uses AI. Check for mistakes.
@TideDra TideDra merged commit b2f6460 into main Apr 14, 2026
5 checks passed
@TideDra TideDra deleted the fix/prioritize_tex branch April 14, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants