Use tex extraction as first priority#237
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes TeX source extraction the preferred path for arXiv papers and improves main .tex selection when multiple files contain \\begin{document} by using a BM25-based match against the paper title.
Changes:
- Add BM25-based candidate selection for TeX sources when multiple document blocks are present, optionally guided by
paper_title. - Reorder arXiv full-text extraction priority to try tar/TeX first, then HTML, then PDF; pass
paper.titleinto tar extraction. - Extend unit tests for TeX extraction and BM25 picking; add a repository Copilot instructions document.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/zotero_arxiv_daily/utils.py |
Adds _bm25_pick and updates TeX main-file selection logic with optional paper_title. |
src/zotero_arxiv_daily/retriever/arxiv_retriever.py |
Switches extraction order to tar-first and threads paper_title through the tar worker. |
tests/test_utils.py |
Adds tests for multiple \\begin{document} handling and BM25 selection behavior. |
.github/copilot-instructions.md |
Documents project commands, architecture, and conventions for contributors/tools. |
Comments suppressed due to low confidence (1)
src/zotero_arxiv_daily/utils.py:61
extract_tex_code_from_taris annotated as returningdict[str, str], but it returnsNoneon read errors / missing tex files, and it can also setfile_contents["all"] = None. This is a real type contract mismatch and will confuse callers and type checkers. Update the return type to something likedict[str, str | None] | None(or split out a dedicated result type) so the signature matches actual behavior.
def extract_tex_code_from_tar(file_path:str, paper_id:str, paper_title:str | None = None) -> dict[str,str]:
try:
tar = tarfile.open(file_path)
except tarfile.ReadError:
logger.debug(f"Failed to find main tex file of {paper_id}: Not a tar file.")
return None
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_empty_query_returns_first(self): | ||
| candidates = {"a.tex": "hello", "b.tex": "world"} | ||
| result = _bm25_pick("", candidates) | ||
| assert result in candidates |
There was a problem hiding this comment.
test_empty_query_returns_first currently asserts only that the result is one of the candidates, but _bm25_pick is specifically implemented to return the first candidate when the query tokenizes to empty. If that behavior matters, assert equality to the first key to make the test catch regressions.
| assert result in candidates | |
| assert result == "a.tex" |
| def _bm25_pick(query: str, candidates: dict[str, str], k1: float = 1.5, b: float = 0.75) -> str: | ||
| """Return the candidate key whose content best matches *query* by BM25.""" | ||
| query_tokens = _tokenize(query) | ||
| if not query_tokens: | ||
| return next(iter(candidates)) | ||
|
|
There was a problem hiding this comment.
_bm25_pick assumes candidates is non-empty (next(iter(candidates)) and best_name initialization). If it is ever called with an empty dict, it will raise StopIteration or return None despite the -> str annotation. Add an explicit guard (e.g., raise ValueError with a clear message) and/or adjust the return type so the function’s contract is unambiguous.
| full_text = extract_text_from_tar(raw_paper) | ||
| if full_text is None: | ||
| full_text = extract_text_from_pdf(raw_paper) | ||
| full_text = extract_text_from_html(raw_paper) | ||
| if full_text is None: | ||
| full_text = extract_text_from_tar(raw_paper) | ||
| full_text = extract_text_from_pdf(raw_paper) |
There was a problem hiding this comment.
convert_to_paper changes the extraction priority order (tar → html → pdf). There’s existing test coverage for ArxivRetriever, but no test asserting the fallback order, so regressions (e.g., accidentally reverting to html-first) would be easy to miss. Consider adding a unit test that monkeypatches extract_text_from_tar/html/pdf to record call order and returns None/text to verify the intended priority.
| }) | ||
| result = extract_tex_code_from_tar(path, "test-paper") | ||
| assert result is not None | ||
| assert result["all"] is not None |
There was a problem hiding this comment.
test_extract_tex_multiple_document_blocks_no_title docstring says it "falls back to the first candidate", but the assertions only check that result["all"] is not None. This doesn’t actually validate the intended fallback behavior. Strengthen the test to assert the extracted content comes from the first candidate (e.g., contains "First doc").
| assert result["all"] is not None | |
| assert "First doc" in result["all"] | |
| assert "Second doc" not in result["all"] |
No description provided.