fix: restore mempalace compress after stats rename (#159) by mvalentsev · Pull Request #162 · MemPalace/mempalace

mvalentsev · 2026-04-07T22:31:34Z

What does this PR do?

Fixes points 1 and 2 from #159. mempalace compress was dead after #147 renamed the Dialect.compression_stats() keys:

ratio became size_ratio
compressed_chars became summary_chars
original_tokens / compressed_tokens became original_tokens_est / summary_tokens_est

cmd_compress still read all four old names, so the first drawer it touched raised KeyError: 'compressed_chars' and the whole command crashed before writing anything.

While in there, the summary line at the bottom of cmd_compress was also wrong in a subtler way. It tried to turn the char totals into token totals by calling Dialect.count_tokens("x" * total_original). count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a long string of repeated xs is a single "word", so both totals were always 1 and the user saw Total: 1t -> 1t (1.0x compression) regardless of the actual workload. The fix accumulates the per-drawer original_tokens_est / summary_tokens_est during the main loop and uses a token-based ratio so the summary is self-consistent with the per-drawer dry-run line.

The storage metadata keys on the mempalace_compressed collection (compression_ratio, original_tokens) keep their old names so anything already reading them still works. Only the source of the values is updated.

How to test

# 1. A tiny palace to compress against
mkdir -p /tmp/compress_smoke && cd /tmp/compress_smoke
printf 'wing: test\nrooms:\n  - {name: general}\n' > mempalace.yaml
printf 'We decided to use Postgres because JSONB is a better fit. ' > content.md
printf '%s\n' "$(python -c 'print("alpha beta gamma delta epsilon " * 60)')" >> content.md
mempalace init .
mempalace mine .

# 2. Dry-run, previously crashed with KeyError on the first drawer
mempalace compress --dry-run
# Expected: per-drawer line shows positive token counts (not 1t),
# and Total: Nt -> Mt (Rx compression) with N, M, R all > 1

# 3. Real compress, writes the compressed collection
mempalace compress
# Expected: "Stored N compressed drawers in 'mempalace_compressed' collection."

# 4. Linter + existing tests
ruff check mempalace/cli.py
ruff format --check mempalace/cli.py
python -m pytest tests/ -v

Checklist

Linter passes (ruff check ., ruff format --check .)
No hardcoded paths
Python 3.9 compatible
Existing tests still pass (pytest tests/test_miner.py tests/test_config.py tests/test_normalize.py tests/test_version_consistency.py -q - 21 passed)

Notes on testing scope

There are no unit tests for cmd_compress today and adding one would need a fake ChromaDB collection fixture that the repo does not have yet. The fix is verified end to end via the dry-run and real-compress walkthroughs above. Can follow up with a CLI test harness in a separate PR if useful.

Unrelated: tests/test_dialect.py has two pre-existing failing assertions from the same PR #147 rename (test_stats / test_count_tokens). Those are addressed in #150 and are not touched here to keep this PR focused.

bgauryy · 2026-04-08T23:11:06Z

PR Review: fix: restore mempalace compress after stats rename (#159)

Executive Summary

Aspect	Value
PR Goal	Fix `mempalace compress` crash caused by stale `compression_stats()` key names after PR #147 rename
Files Changed	1 (`mempalace/cli.py`)
Risk Level	🟢 LOW — single-file key rename alignment, no logic changes
Review Effort	1 — trivial, mechanical fix
Recommendation	✅ APPROVE

Affected Areas: cmd_compress() in cli.py — the CLI compress command that reads drawers and writes AAAK-compressed versions.

Business Impact: mempalace compress was completely broken (crashed on first drawer with KeyError: 'compressed_chars'). This restores the feature.

Flow Changes: The summary section switches from char-based accumulation (re-estimating tokens at the end) to direct token-estimate accumulation. Output format and stored metadata are functionally identical.

Ratings

Aspect	Score
Correctness	5/5
Security	5/5
Performance	5/5
Maintainability	5/5

PR Health

Has clear description
References ticket/issue (Bugs and improvements #159)
Appropriate size (23 lines changed)
Has relevant tests — no new tests, but existing test_dialect.py covers compression_stats()

Key Changes Verified

1. Stats Key Alignment (Correct)

All old → new key mappings match what compression_stats() returns on main:

Old Key (broken)	New Key (PR fix)	Verified in `dialect.py`
`original_chars`	`original_tokens_est`	✅ `"original_tokens_est": orig_tokens`
`compressed_chars`	`summary_tokens_est`	✅ `"summary_tokens_est": comp_tokens`
`original_tokens`	`original_tokens_est`	✅ same
`compressed_tokens`	`summary_tokens_est`	✅ same
`ratio`	`size_ratio`	✅ `"size_ratio": round(orig_tokens / max(comp_tokens, 1), 1)`

2. Accumulator Simplification (Improvement)

Before: Accumulated chars, then re-estimated tokens via Dialect.count_tokens("x" * total_chars) — creating a massive temporary string just to divide by 3.

After: Directly sums per-drawer token estimates. Avoids O(n) memory allocation for the summary string, and stays consistent with per-drawer output.

3. Stored Metadata (Correct)

comp_meta["compression_ratio"] now reads stats["size_ratio"] ✅
comp_meta["original_tokens"] now reads stats["original_tokens_est"] ✅
The ChromaDB metadata keys themselves (compression_ratio, original_tokens) are unchanged — no migration needed.

4. Other Callers

dialect.py __main__ block: already updated to new keys on main ✅
tests/test_dialect.py: tests compression_stats() return values directly — no cli.py integration tests affected

Issues Found

None. This is a clean, well-scoped bug fix.

Created by Octocode MCP https://octocode.ai 🔍🐙

web3guru888

Review: fix: restore mempalace compress after stats rename (#159)

This is the most thorough of the three PRs fixing the cmd_compress KeyError regression introduced when #147 renamed Dialect.compression_stats() keys. Here's my analysis comparing all three approaches.

Background: Three PRs, One Bug

Three PRs address the same KeyError: 'compressed_chars' crash in cmd_compress:

#162 (this PR, @mvalentsev) — comprehensive fix with token accumulator refactor
#569 (@arnoldwender) — minimal key-swap only, adds a live Dialect integration test
#588 (@helhindi) — minimal key-swap only, updates test mocks only

All three correctly replace the four stale keys (compressed_chars → summary_chars, original_tokens → original_tokens_est, compressed_tokens → summary_tokens_est, ratio → size_ratio). The crash fix itself is identical in all three.

What #162 Does Better

Fixes a second, pre-existing bug. The summary line at the end of cmd_compress was accumulating total_original and total_compressed from character counts (original_chars, summary_chars), then passing those char totals to Dialect.count_tokens(\"x\" * total_original) to compute token estimates. But count_tokens is word-based — a long string of repeated x characters is one "word", so the summary always printed Total: 1t -> 1t (1.0x) regardless of actual workload. This is a genuine silent correctness bug that #569 and #588 both miss entirely.

The fix — renaming accumulators to total_orig_tokens / total_comp_tokens and summing original_tokens_est / summary_tokens_est from per-drawer stats — is semantically correct and self-consistent with the per-drawer output lines. This is the right approach.

Backward-compatible metadata keys. The PR is explicit that comp_meta["compression_ratio"] and comp_meta["original_tokens"] keep their old names even though the values now come from the renamed stats fields. This is a thoughtful choice — downstream consumers reading the mempalace_compressed collection won't break.

Issues and Suggestions

Missing test coverage — acknowledged by the author. The cmd_compress path has zero unit tests. #569 adds an integration test (test_compression_stats_keys_match_dialect) that calls the real Dialect class and asserts the key set. That test would immediately catch regressions like the one in #147. I'd strongly suggest cherry-picking or adapting that test for this PR, or filing a follow-up issue to track it. Shipping a fix without a regression guard leaves the door open for #147-style renames to slip through again.

Summary ratio computation. After the accumulator fix, the summary ratio is computed as total_orig_tokens / total_comp_tokens. This will ZeroDivisionError if total_comp_tokens is 0 (e.g., all drawers compressed to empty strings). Worth guarding: ratio = total_orig_tokens / total_comp_tokens if total_comp_tokens else 0.

Pre-existing test failures flagged — the author notes test_stats and test_count_tokens in test_dialect.py are already failing from #147 and are addressed in #150. Good disclosure; just confirming this PR doesn't touch those and won't make them worse.

Diff Quality

The cli.py changes are clean. The accumulator rename from total_original/total_compressed to total_orig_tokens/total_comp_tokens improves readability — it's now obvious what unit is being tracked. The f-string split across two lines is also fine for 88-char line limits.

Comparison Verdict

Criterion	#162 (this PR)	#569	#588
Fixes KeyError crash	✅	✅	✅
Fixes broken summary totals	✅	❌	❌
Backward-compat metadata keys	✅ explicit	✅ implicit	✅ implicit
Adds regression test	❌	✅ integration test	❌
Test mock updates	✅	✅	✅

Recommendation: Merge #162 and close #569 and #588 as superseded — but only after adding a regression guard for the zero-division edge case and ideally pulling in the integration test from #569. #162 is the only one that actually improves the correctness of the feature beyond just not crashing.

Overall

APPROVE with minor suggestions. The core fix is correct, the accumulator refactor is genuinely better, and the backward-compatibility note on metadata keys shows careful thinking. Add a guard for the zero-division edge case and a regression test before merge.

Reviewed by MemPalace-AGI — autonomous research system with perfect memory

mvalentsev · 2026-04-11T02:26:33Z

Quick correction on the ZeroDivisionError flag -- the ratio line already uses max(total_comp_tokens, 1) as the denominator (cli.py:389), so the zero case is handled. The comparison with #569 and #588 is otherwise accurate.

web3guru888 · 2026-04-11T02:47:25Z

Clean fix. All four stale key references are updated consistently (compressed_chars → summary_tokens_est, original_tokens → original_tokens_est, ratio → size_ratio) and the test mocks are updated in both test cases.

The summary simplification is a nice bonus — the old code was converting token estimates back to a character count and then re-running count_tokens() on a synthetic string of that length, which was a lossy round-trip. Using the direct original_tokens_est / summary_tokens_est from compression_stats() is more accurate.

On the ZeroDivisionError flag @mvalentsev noted: confirmed — max(total_comp_tokens, 1) is already in the ratio line, so the zero case is handled. The PR is correct as written.

One minor observation: comp_meta["original_tokens"] = stats["original_tokens_est"] stores the value under the old key name. For palaces that already have compressed entries (created before this fix), the metadata key name will be inconsistent. Not a blocker for this PR — it's a cosmetic inconsistency — but worth noting for a follow-up migration note or a key rename in the metadata write.

LGTM otherwise.

The honest-stats rename in PR MemPalace#147 changed the keys returned by Dialect.compression_stats() (ratio -> size_ratio, compressed_chars -> summary_chars, original_tokens / compressed_tokens -> original_tokens_est / summary_tokens_est). cmd_compress still reads the old names, so mempalace compress throws KeyError on the first drawer it touches and the feature is effectively dead. Also fix the summary line at the bottom of cmd_compress. It called count_tokens("x" * total_original), but count_tokens is word-based (max(1, int(len(text.split()) * 1.3))), and a string of repeated xs is a single "word", so both totals were always 1. Accumulate the per-drawer estimates during the main loop instead, and use a token-based ratio so the summary line is self-consistent with the per-drawer dry-run output. The storage metadata key names on the compressed collection (compression_ratio, original_tokens) stay the same for compatibility with anything already reading them. Only the source of the values is updated. Fixes MemPalace#159 (points 1 and 2)

The new test_cmd_compress_dry_run and test_cmd_compress_stores_results tests (added upstream after this branch diverged) mock compression_stats() with the old key names. Update the mocks to use the post-MemPalace#147 keys (original_tokens_est, summary_tokens_est, size_ratio, summary_chars) so they match what the fixed cmd_compress actually reads.

mvalentsev · 2026-04-11T12:16:09Z

The comp_meta["original_tokens"] key is deliberately kept under its old name for backward compat -- noted in the PR body. Renaming would break anything already reading the mempalace_compressed collection, so the mismatch between source field (original_tokens_est) and stored key (original_tokens) is intentional.

mvalentsev · 2026-04-11T23:45:23Z

Closing -- #569 and #609 on develop together cover all three fixes this PR addressed: the stats key alignment, the backward-compat metadata key preservation, and the token summary bug.

One minor note for anyone touching the compress path later: the merged #609 estimates summary tokens via chars / 3.8, while the per-drawer lines still use Dialect.compression_stats() word-based estimates. Not a regression, just a small inconsistency between the two outputs. Easy follow-up if anyone cares about making them consistent.

adv3nt3 mentioned this pull request Apr 7, 2026

Bugs and improvements #159

Closed

mvalentsev force-pushed the fix/cli-compress-stats-keys branch 7 times, most recently from 5212c4c to eef9b6f Compare April 10, 2026 15:50

mvalentsev requested review from bensig and milla-jovovich as code owners April 10, 2026 15:50

mvalentsev force-pushed the fix/cli-compress-stats-keys branch from eef9b6f to 085423e Compare April 10, 2026 16:40

web3guru888 reviewed Apr 11, 2026

View reviewed changes

mvalentsev added 2 commits April 11, 2026 16:24

mvalentsev force-pushed the fix/cli-compress-stats-keys branch from 085423e to 9cc1bcf Compare April 11, 2026 11:25

bensig changed the base branch from main to develop April 11, 2026 22:23

bensig requested a review from igorls as a code owner April 11, 2026 22:23

mvalentsev closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore mempalace compress after stats rename (#159)#162

fix: restore mempalace compress after stats rename (#159)#162
mvalentsev wants to merge 2 commits intoMemPalace:developfrom
mvalentsev:fix/cli-compress-stats-keys

mvalentsev commented Apr 7, 2026 •

edited

Loading

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvalentsev commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to test

Checklist

Notes on testing scope

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: fix: restore mempalace compress after stats rename (#159)

Executive Summary

Ratings

PR Health

Key Changes Verified

1. Stats Key Alignment (Correct)

2. Accumulator Simplification (Improvement)

3. Stored Metadata (Correct)

4. Other Callers

Issues Found

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Review: fix: restore mempalace compress after stats rename (#159)

Background: Three PRs, One Bug

What #162 Does Better

Issues and Suggestions

Diff Quality

Comparison Verdict

Overall

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mvalentsev commented Apr 7, 2026 •

edited

Loading