Skip to content

feat: add Pi agent JSONL session normalizer#169

Open
adv3nt3 wants to merge 1 commit intoMemPalace:developfrom
adv3nt3:feat/pi-cli-normalizer
Open

feat: add Pi agent JSONL session normalizer#169
adv3nt3 wants to merge 1 commit intoMemPalace:developfrom
adv3nt3:feat/pi-cli-normalizer

Conversation

@adv3nt3
Copy link
Copy Markdown
Contributor

@adv3nt3 adv3nt3 commented Apr 7, 2026

Summary

Add _try_pi_jsonl parser for Pi agent session files stored at ~/.config/pi/agent/sessions/{encoded-cwd}/{timestamp}_{uuid}.jsonl. This is the 8th normalize format for MemPalace, alongside Claude AI JSON, ChatGPT JSON, Claude Code JSONL, Codex CLI JSONL (#61), Gemini CLI JSON (#155), Slack JSON, and plain text.

Pi session format

Pi stores sessions as JSONL files with a tree-structured message history. Sessions are project-scoped — folder names encode the working directory path (e.g. --home-arcka-openclaude--).

Path: ~/.config/pi/agent/sessions/{encoded-cwd}/{timestamp}_{uuid}.jsonl

Structure (one JSON object per line):

{"type":"session","version":3,"id":"c9db2d16-...","timestamp":"2026-04-02T23:48:11.257Z","cwd":"/home/arcka/openclaude"}
{"type":"model_change","id":"62d8b4f0","parentId":null,"timestamp":"...","provider":"github-copilot","modelId":"claude-opus-4.6"}
{"type":"thinking_level_change","id":"c7f4db51","parentId":"62d8b4f0","timestamp":"...","thinkingLevel":"high"}
{"type":"message","id":"m1","parentId":"c7f4db51","timestamp":"...","message":{"role":"user","content":[{"type":"text","text":"Explain the architecture"}]}}
{"type":"message","id":"m2","parentId":"m1","timestamp":"...","message":{"role":"assistant","content":[{"type":"text","text":"The project uses..."}],"provider":"github-copilot","model":"claude-opus-4.6","stopReason":"stop"}}

Event types

type value Contains Extracted?
session Session header — version, id, cwd Fingerprint only
message (role user) User prompts — content as string or [{type, text}] blocks Yes
message (role assistant) Assistant replies — content as [{type, text}] blocks, may include thinking blocks Yes (text only, thinking skipped)
message (role toolResult) Tool outputs — toolCallId, toolName, content Skipped
model_change Provider/model switches Skipped
thinking_level_change Reasoning level adjustments Skipped
compaction Context summarization events Skipped
branch_summary Branch points in tree history Skipped
label Bookmark labels on messages Skipped
custom / custom_message Extension data Skipped

TypeScript types (from Pi source)

interface UserMessage {
  role: "user";
  content: string | (TextContent | ImageContent)[];
  timestamp: number;
}

interface AssistantMessage {
  role: "assistant";
  content: (TextContent | ThinkingContent | ToolCall)[];
  provider: string;
  model: string;
  usage: Usage;
  stopReason: "stop" | "length" | "toolUse" | "error" | "aborted";
}

Design decisions

Only user and assistant messages extracted

The parser extracts type: "message" entries where message.role is "user" or "assistant". Tool results, model changes, thinking level changes, compaction events, and branch summaries are skipped — they're operational metadata, not conversation content.

thinking blocks in assistant content are automatically skipped

Assistant content can include {"type": "thinking", ...} blocks alongside {"type": "text", ...} blocks. The shared _extract_content helper only picks up type == "text", so thinking is naturally filtered out.

Aborted/empty messages are skipped

If an assistant message has empty content ([]) or only thinking blocks, _extract_content returns empty string and the message is not added to the transcript.

Fingerprints on session header with version key

The parser requires a type: "session" line with a version field to positively identify Pi session files. This distinguishes from:

  • Codex JSONL — uses type: "session_meta" (no version key)
  • Claude Code JSONL — uses type: "human" / type: "assistant" at top level
  • Other JSONL — unlikely to have both type: "session" and version

Uses shared _extract_content helper

Pi's user content blocks use the standard {"type": "text", "text": "..."} format (same as Claude/OpenAI), so the shared helper works directly — unlike Gemini which needed custom extraction.

What's NOT handled (and why)

  • Tree structure / branching: Pi sessions support branching via parentId chains. The parser reads messages linearly (file order) without reconstructing the branch tree. This matches how all other parsers work — linear extraction.
  • Compaction summaries: Pi can compact history mid-session. The parser skips compaction events — they're internal context management, not user conversation.
  • Tool call details: Only the text portion of assistant messages is extracted. Tool names, arguments, and results are skipped.
  • Image content: Pi supports ImageContent in messages. The parser skips these (no text to extract).

Verification sources

Changes

1 file changed (mempalace/normalize.py), 52 insertions:

  • New _try_pi_jsonl() parser function
  • Registered in _try_normalize_json() dispatcher after Codex JSONL
  • Module docstring updated to list Pi agent JSONL as supported format

Test plan

  • ruff check mempalace/normalize.py passes clean
  • ruff format --check already formatted
  • python3 -m py_compile mempalace/normalize.py compiles OK
  • Tested against sample Pi session data — produces correct > marker transcripts
  • False positive check — returns None for Codex JSONL, plain text, empty input
  • _extract_content correctly handles Pi's [{"type":"text","text":"..."}] content blocks
  • thinking blocks in assistant content are automatically filtered out
  • Pyright reports 0 new diagnostics
  • Format verified against official Pi session docs via Context7

Refs: #59

cc @tunnckoCore — this implements the Pi parser based on your session data. If you can test against your full sessions, that would help validate.

@tunnckoCore
Copy link
Copy Markdown

tunnckoCore commented Apr 8, 2026

@adv3nt3 thanks, I'll try it a bit later or tomorrow.

a note tho: i think that thinking and compaction should be able to be included? maybe behind a flag or env var. It just makes sense to me for some reason.. 😅🤷‍♂️ but agree it shouldn't be the default.

I've started a Rust impl several hours ago, if you are interested 😉 I'll add this ingest normalizer thing there too. https://github.com/tunnckoCore/mempalace-rust

@adv3nt3
Copy link
Copy Markdown
Contributor Author

adv3nt3 commented Apr 8, 2026

@tunnckoCore I thought about this, I'd keep thinking/compaction out of the default for now.

Thinking blocks are mostly the model planning its next step, and the actual conclusions already show up in the assistant's text. Including them would add a lot of noise to drawers without much recall value. Compaction summaries have a similar problem, they're lossy summaries Pi makes to manage its own context window, and MemPalace already has AAAK for compression, so you'd end up storing a summary of a summary.

That said, happy to add an opt-in flag later if we run into cases where the thinking context is genuinely useful. For now keeping it clean feels right.

Add _try_pi_jsonl parser for Pi agent session files stored at
~/.config/pi/agent/sessions/{encoded-cwd}/{timestamp}_{uuid}.jsonl.

Uses type "message" entries with role "user"/"assistant". Skips
toolResult messages, model_change, thinking_level_change, and other
operational events. Requires session header (type "session" with
"version" key) to avoid false positives.

Format documented at github.com/badlogic/pi-mono session.md and
verified via Context7. Sample data provided by tunnckoCore in MemPalace#59.

Refs: MemPalace#59
@adv3nt3 adv3nt3 force-pushed the feat/pi-cli-normalizer branch from 3592183 to 5f46ff7 Compare April 9, 2026 17:53
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #169feat: add Pi agent JSONL session normalizer

Scope: +52/−0 · 1 file(s)

  • mempalace/normalize.py (modified: +52/−0)

Suggestions

  • 💡 No tests included — consider adding coverage for the new code paths

🟢 Approved — clean, well-structured PR. Good work @adv3nt3!


🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@igorls igorls added area/mining File and conversation mining enhancement New feature or request labels Apr 14, 2026
bensig added a commit that referenced this pull request Apr 18, 2026
Draft plugin specification for source adapters, mirroring RFC 001's
role for storage backends. Formalizes the contract six community
ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's
metadata-only mode have been reinventing ad-hoc, so adapter authors
can build to a stable surface.

Key decisions:
- Single ingest() method; lazy adapters yield SourceItemMetadata
  ahead of drawers, eager adapters interleave
- Declared-transformation model (§1.4) replaces informal verbatim
  promise with a verifiable one; byte_preserving adapters declare
  the empty set, declared_lossy adapters enumerate. Existing
  miner.py and the convo_miner+normalize pipeline map cleanly
- Palace is the incremental cursor via is_current(item, metadata);
  no sidecar persistence
- Routing is adapter-owned; detect_room/detect_hall move into the
  filesystem adapter
- Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as
  json_string field, KG triples route to SQLite knowledge graph
- Closets stay core-built as a post-step; adapters may emit flat
  closet_hints. Closes existing gap where convo drawers get no
  closets
- No per-drawer field renames: source_file, filed_at, source_mtime,
  added_by, normalize_version, entities, ingest_mode all preserved.
  Spec adds adapter_name, adapter_version, privacy_class

§9 enumerates the cleanup PR prerequisites (mempalace/sources/
module, PalaceContext facade, KnowledgeGraph.add_triple gaining
backwards-compatible source_drawer_id + adapter_name params).

Tracking issue: #989
jphein pushed a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
…Code, MemPalace#274/MemPalace#232 Cursor, MemPalace#169 Pi, MemPalace#702 Cursor+factory.ai)

Updates the multi-agent-support bullet to cite the actual upstream
work instead of just gesturing at it. RFC 002 itself is PR MemPalace#990
(tracking issue MemPalace#989). Existing third-party prototypes already
proposed against the spec:

* OpenCode SQLite — PR MemPalace#23
* Cursor SQLite — issue MemPalace#274
* Cursor JSONL (earlier variant) — PR MemPalace#232
* Pi agent JSONL — PR MemPalace#169
* Combined Cursor + factory.ai — PR MemPalace#702

Each becomes a mempalace-source-<agent> package once RFC 002 lands.
Names the path explicitly: fork unblocks the pattern by helping land
RFC 002; per-agent adapter PRs land from their respective authors.

Aider, Gemini CLI, Codex CLI, and Warp are roadmap targets without
existing adapter PRs and are listed as such (no fabricated PR refs).

https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants