Skip to content

fix: restore mempalace compress after stats rename (#159)#162

Closed
mvalentsev wants to merge 2 commits intoMemPalace:developfrom
mvalentsev:fix/cli-compress-stats-keys
Closed

fix: restore mempalace compress after stats rename (#159)#162
mvalentsev wants to merge 2 commits intoMemPalace:developfrom
mvalentsev:fix/cli-compress-stats-keys

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev commented Apr 7, 2026

What does this PR do?

Fixes points 1 and 2 from #159. mempalace compress was dead after #147 renamed the Dialect.compression_stats() keys:

  • ratio became size_ratio
  • compressed_chars became summary_chars
  • original_tokens / compressed_tokens became original_tokens_est / summary_tokens_est

cmd_compress still read all four old names, so the first drawer it touched raised KeyError: 'compressed_chars' and the whole command crashed before writing anything.

While in there, the summary line at the bottom of cmd_compress was also wrong in a subtler way. It tried to turn the char totals into token totals by calling Dialect.count_tokens("x" * total_original). count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a long string of repeated xs is a single "word", so both totals were always 1 and the user saw Total: 1t -> 1t (1.0x compression) regardless of the actual workload. The fix accumulates the per-drawer original_tokens_est / summary_tokens_est during the main loop and uses a token-based ratio so the summary is self-consistent with the per-drawer dry-run line.

The storage metadata keys on the mempalace_compressed collection (compression_ratio, original_tokens) keep their old names so anything already reading them still works. Only the source of the values is updated.

How to test

# 1. A tiny palace to compress against
mkdir -p /tmp/compress_smoke && cd /tmp/compress_smoke
printf 'wing: test\nrooms:\n  - {name: general}\n' > mempalace.yaml
printf 'We decided to use Postgres because JSONB is a better fit. ' > content.md
printf '%s\n' "$(python -c 'print("alpha beta gamma delta epsilon " * 60)')" >> content.md
mempalace init .
mempalace mine .

# 2. Dry-run, previously crashed with KeyError on the first drawer
mempalace compress --dry-run
# Expected: per-drawer line shows positive token counts (not 1t),
# and Total: Nt -> Mt (Rx compression) with N, M, R all > 1

# 3. Real compress, writes the compressed collection
mempalace compress
# Expected: "Stored N compressed drawers in 'mempalace_compressed' collection."

# 4. Linter + existing tests
ruff check mempalace/cli.py
ruff format --check mempalace/cli.py
python -m pytest tests/ -v

Checklist

  • Linter passes (ruff check ., ruff format --check .)
  • No hardcoded paths
  • Python 3.9 compatible
  • Existing tests still pass (pytest tests/test_miner.py tests/test_config.py tests/test_normalize.py tests/test_version_consistency.py -q - 21 passed)

Notes on testing scope

There are no unit tests for cmd_compress today and adding one would need a fake ChromaDB collection fixture that the repo does not have yet. The fix is verified end to end via the dry-run and real-compress walkthroughs above. Can follow up with a CLI test harness in a separate PR if useful.

Unrelated: tests/test_dialect.py has two pre-existing failing assertions from the same PR #147 rename (test_stats / test_count_tokens). Those are addressed in #150 and are not touched here to keep this PR focused.

@adv3nt3 adv3nt3 mentioned this pull request Apr 7, 2026
@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: fix: restore mempalace compress after stats rename (#159)

Executive Summary

Aspect Value
PR Goal Fix mempalace compress crash caused by stale compression_stats() key names after PR #147 rename
Files Changed 1 (mempalace/cli.py)
Risk Level 🟢 LOW — single-file key rename alignment, no logic changes
Review Effort 1 — trivial, mechanical fix
Recommendation ✅ APPROVE

Affected Areas: cmd_compress() in cli.py — the CLI compress command that reads drawers and writes AAAK-compressed versions.

Business Impact: mempalace compress was completely broken (crashed on first drawer with KeyError: 'compressed_chars'). This restores the feature.

Flow Changes: The summary section switches from char-based accumulation (re-estimating tokens at the end) to direct token-estimate accumulation. Output format and stored metadata are functionally identical.

Ratings

Aspect Score
Correctness 5/5
Security 5/5
Performance 5/5
Maintainability 5/5

PR Health

  • Has clear description
  • References ticket/issue (Bugs and improvements #159)
  • Appropriate size (23 lines changed)
  • Has relevant tests — no new tests, but existing test_dialect.py covers compression_stats()

Key Changes Verified

1. Stats Key Alignment (Correct)

All old → new key mappings match what compression_stats() returns on main:

Old Key (broken) New Key (PR fix) Verified in dialect.py
original_chars original_tokens_est "original_tokens_est": orig_tokens
compressed_chars summary_tokens_est "summary_tokens_est": comp_tokens
original_tokens original_tokens_est ✅ same
compressed_tokens summary_tokens_est ✅ same
ratio size_ratio "size_ratio": round(orig_tokens / max(comp_tokens, 1), 1)

2. Accumulator Simplification (Improvement)

Before: Accumulated chars, then re-estimated tokens via Dialect.count_tokens("x" * total_chars) — creating a massive temporary string just to divide by 3.

After: Directly sums per-drawer token estimates. Avoids O(n) memory allocation for the summary string, and stays consistent with per-drawer output.

3. Stored Metadata (Correct)

  • comp_meta["compression_ratio"] now reads stats["size_ratio"]
  • comp_meta["original_tokens"] now reads stats["original_tokens_est"]
  • The ChromaDB metadata keys themselves (compression_ratio, original_tokens) are unchanged — no migration needed.

4. Other Callers

  • dialect.py __main__ block: already updated to new keys on main
  • tests/test_dialect.py: tests compression_stats() return values directly — no cli.py integration tests affected

Issues Found

None. This is a clean, well-scoped bug fix.


Created by Octocode MCP https://octocode.ai 🔍🐙

@mvalentsev mvalentsev force-pushed the fix/cli-compress-stats-keys branch 7 times, most recently from 5212c4c to eef9b6f Compare April 10, 2026 15:50
@mvalentsev mvalentsev force-pushed the fix/cli-compress-stats-keys branch from eef9b6f to 085423e Compare April 10, 2026 16:40
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix: restore mempalace compress after stats rename (#159)

This is the most thorough of the three PRs fixing the cmd_compress KeyError regression introduced when #147 renamed Dialect.compression_stats() keys. Here's my analysis comparing all three approaches.

Background: Three PRs, One Bug

Three PRs address the same KeyError: 'compressed_chars' crash in cmd_compress:

  • #162 (this PR, @mvalentsev) — comprehensive fix with token accumulator refactor
  • #569 (@arnoldwender) — minimal key-swap only, adds a live Dialect integration test
  • #588 (@helhindi) — minimal key-swap only, updates test mocks only

All three correctly replace the four stale keys (compressed_charssummary_chars, original_tokensoriginal_tokens_est, compressed_tokenssummary_tokens_est, ratiosize_ratio). The crash fix itself is identical in all three.

What #162 Does Better

Fixes a second, pre-existing bug. The summary line at the end of cmd_compress was accumulating total_original and total_compressed from character counts (original_chars, summary_chars), then passing those char totals to Dialect.count_tokens(\"x\" * total_original) to compute token estimates. But count_tokens is word-based — a long string of repeated x characters is one "word", so the summary always printed Total: 1t -> 1t (1.0x) regardless of actual workload. This is a genuine silent correctness bug that #569 and #588 both miss entirely.

The fix — renaming accumulators to total_orig_tokens / total_comp_tokens and summing original_tokens_est / summary_tokens_est from per-drawer stats — is semantically correct and self-consistent with the per-drawer output lines. This is the right approach.

Backward-compatible metadata keys. The PR is explicit that comp_meta["compression_ratio"] and comp_meta["original_tokens"] keep their old names even though the values now come from the renamed stats fields. This is a thoughtful choice — downstream consumers reading the mempalace_compressed collection won't break.

Issues and Suggestions

Missing test coverage — acknowledged by the author. The cmd_compress path has zero unit tests. #569 adds an integration test (test_compression_stats_keys_match_dialect) that calls the real Dialect class and asserts the key set. That test would immediately catch regressions like the one in #147. I'd strongly suggest cherry-picking or adapting that test for this PR, or filing a follow-up issue to track it. Shipping a fix without a regression guard leaves the door open for #147-style renames to slip through again.

Summary ratio computation. After the accumulator fix, the summary ratio is computed as total_orig_tokens / total_comp_tokens. This will ZeroDivisionError if total_comp_tokens is 0 (e.g., all drawers compressed to empty strings). Worth guarding: ratio = total_orig_tokens / total_comp_tokens if total_comp_tokens else 0.

Pre-existing test failures flagged — the author notes test_stats and test_count_tokens in test_dialect.py are already failing from #147 and are addressed in #150. Good disclosure; just confirming this PR doesn't touch those and won't make them worse.

Diff Quality

The cli.py changes are clean. The accumulator rename from total_original/total_compressed to total_orig_tokens/total_comp_tokens improves readability — it's now obvious what unit is being tracked. The f-string split across two lines is also fine for 88-char line limits.

Comparison Verdict

Criterion #162 (this PR) #569 #588
Fixes KeyError crash
Fixes broken summary totals
Backward-compat metadata keys ✅ explicit ✅ implicit ✅ implicit
Adds regression test ✅ integration test
Test mock updates

Recommendation: Merge #162 and close #569 and #588 as superseded — but only after adding a regression guard for the zero-division edge case and ideally pulling in the integration test from #569. #162 is the only one that actually improves the correctness of the feature beyond just not crashing.

Overall

APPROVE with minor suggestions. The core fix is correct, the accumulator refactor is genuinely better, and the backward-compatibility note on metadata keys shows careful thinking. Add a guard for the zero-division edge case and a regression test before merge.


Reviewed by MemPalace-AGI — autonomous research system with perfect memory

@mvalentsev
Copy link
Copy Markdown
Contributor Author

Quick correction on the ZeroDivisionError flag -- the ratio line already uses max(total_comp_tokens, 1) as the denominator (cli.py:389), so the zero case is handled. The comparison with #569 and #588 is otherwise accurate.

@web3guru888
Copy link
Copy Markdown

Clean fix. All four stale key references are updated consistently (compressed_charssummary_tokens_est, original_tokensoriginal_tokens_est, ratiosize_ratio) and the test mocks are updated in both test cases.

The summary simplification is a nice bonus — the old code was converting token estimates back to a character count and then re-running count_tokens() on a synthetic string of that length, which was a lossy round-trip. Using the direct original_tokens_est / summary_tokens_est from compression_stats() is more accurate.

On the ZeroDivisionError flag @mvalentsev noted: confirmed — max(total_comp_tokens, 1) is already in the ratio line, so the zero case is handled. The PR is correct as written.

One minor observation: comp_meta["original_tokens"] = stats["original_tokens_est"] stores the value under the old key name. For palaces that already have compressed entries (created before this fix), the metadata key name will be inconsistent. Not a blocker for this PR — it's a cosmetic inconsistency — but worth noting for a follow-up migration note or a key rename in the metadata write.

LGTM otherwise.

The honest-stats rename in PR MemPalace#147 changed the keys returned by
Dialect.compression_stats() (ratio -> size_ratio, compressed_chars ->
summary_chars, original_tokens / compressed_tokens ->
original_tokens_est / summary_tokens_est). cmd_compress still reads
the old names, so mempalace compress throws KeyError on the first
drawer it touches and the feature is effectively dead.

Also fix the summary line at the bottom of cmd_compress. It called
count_tokens("x" * total_original), but count_tokens is word-based
(max(1, int(len(text.split()) * 1.3))), and a string of repeated
xs is a single "word", so both totals were always 1. Accumulate
the per-drawer estimates during the main loop instead, and use a
token-based ratio so the summary line is self-consistent with the
per-drawer dry-run output.

The storage metadata key names on the compressed collection
(compression_ratio, original_tokens) stay the same for compatibility
with anything already reading them. Only the source of the values
is updated.

Fixes MemPalace#159 (points 1 and 2)
The new test_cmd_compress_dry_run and test_cmd_compress_stores_results
tests (added upstream after this branch diverged) mock
compression_stats() with the old key names. Update the mocks to use
the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est,
size_ratio, summary_chars) so they match what the fixed cmd_compress
actually reads.
@mvalentsev mvalentsev force-pushed the fix/cli-compress-stats-keys branch from 085423e to 9cc1bcf Compare April 11, 2026 11:25
@mvalentsev
Copy link
Copy Markdown
Contributor Author

The comp_meta["original_tokens"] key is deliberately kept under its old name for backward compat -- noted in the PR body. Renaming would break anything already reading the mempalace_compressed collection, so the mismatch between source field (original_tokens_est) and stored key (original_tokens) is intentional.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:23
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Closing -- #569 and #609 on develop together cover all three fixes this PR addressed: the stats key alignment, the backward-compat metadata key preservation, and the token summary bug.

One minor note for anyone touching the compress path later: the merged #609 estimates summary tokens via chars / 3.8, while the per-drawer lines still use Dialect.compression_stats() word-based estimates. Not a regression, just a small inconsistency between the two outputs. Easy follow-up if anyone cares about making them consistent.

@mvalentsev mvalentsev closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants