Skip to content

Commit 5e75a1e

Browse files
JihaoXinclaude
andcommitted
Fix page count: filter headers/footers by position only, not text length
Previous filter rejected blocks with <10 chars text, which killed chart labels, table cells, and figure annotations. Page numbers are now excluded by position only (top 6% and bottom 4% of page). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
1 parent ff88e98 commit 5e75a1e

File tree

1 file changed

+17
-4
lines changed

1 file changed

+17
-4
lines changed

ark/compiler.py

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -335,10 +335,16 @@ def _count_body_pages(self, pdf_path: Path) -> float:
335335
page_height = last_body_page.rect.height
336336

337337
# Detect dual-column by checking if text exists in both halves
338+
# Filter out headers (top 6%) and footers (bottom 4%) — page numbers, running titles
338339
blocks = last_body_page.get_text("blocks")
340+
content_blocks = [
341+
b for b in blocks
342+
if b[3] > page_height * 0.06
343+
and b[1] < page_height * 0.96
344+
]
339345
mid_x = page_width / 2
340-
left_blocks = [b for b in blocks if b[0] < mid_x and b[3] > page_height * 0.1]
341-
right_blocks = [b for b in blocks if b[0] >= mid_x and b[3] > page_height * 0.1]
346+
left_blocks = [b for b in content_blocks if b[0] < mid_x]
347+
right_blocks = [b for b in content_blocks if b[0] >= mid_x]
342348

343349
is_dual_column = len(left_blocks) > 0 and len(right_blocks) > 0
344350

@@ -348,8 +354,15 @@ def _count_body_pages(self, pdf_path: Path) -> float:
348354
fill_ratio = right_last_y / page_height
349355
elif ref_at_top:
350356
# Single column, References on separate page: check last body page fill
351-
if blocks:
352-
last_y = max(b[3] for b in blocks if b[3] > page_height * 0.1)
357+
# Filter out headers/footers: ignore blocks in top 8% and bottom 5% of page,
358+
# and blocks shorter than 10 chars (page numbers, headers)
359+
body_blocks = [
360+
b for b in blocks
361+
if b[3] > page_height * 0.06 # below header
362+
and b[1] < page_height * 0.96 # above footer
363+
]
364+
if body_blocks:
365+
last_y = max(b[3] for b in body_blocks)
353366
fill_ratio = last_y / page_height
354367
else:
355368
fill_ratio = 0.0

0 commit comments

Comments
 (0)