Custom tools written during recovery of a 12TB multi-device BTRFS pool with
severe extent tree corruption that the native commands (btrfs check --repair,
--init-extent-tree, etc.) could not repair.
See INCIDENT-ANALYSIS.md for a structured case
study of the recovery, a root cause classification, and a set of constructive
proposals for upstream btrfs-progs improvements that would have prevented
the need for most of these tools.
Use these tools ONLY if btrfs check --repair segfaults, enters an infinite
loop, or leaves the filesystem in worse shape than before.
Documented cases where they help:
btrfs check --repairsegfaults at[3/8] checking extents(Issue #525)btrfs check --init-extent-treedeadlocksbtrfs check --repairenters an infinite loop repeating the same repairs- Extent tree with thousands of METADATA_ITEMs carrying wrong owner/level/generation
- FS_TREE with stale child pointers that reference blocks reused by other trees
- Pool only mounts with
rescue=all,ro, fails to mount RW
These tools are NOT for light corruption. For normal damage, try
btrfs check --repair first.
- BACK UP metadata BEFORE running any tool with
--write:for DEV in sda1 sdb1 sdc1; do sudo dd if=/dev/$DEV of=sb_${DEV}.bin bs=4096 count=1 skip=16 done - These tools make irreversible changes to the filesystem
- All default to scan-only mode (
--writeis opt-in) - The filesystem must be unmounted when running these tools
- Requires btrfs-progs v6.19.1 or similar, with the EEXIST patch applied
The tools use the internal btrfs-progs API and must be built inside the btrfs-progs source tree:
# 1. Clone btrfs-progs
git clone --depth 1 --branch v6.19.1 https://github.com/kdave/btrfs-progs.git
cd btrfs-progs
# 2. Apply the EEXIST patch (required for batch backref injection)
patch -p1 < path/to/btrfs_fixes/patches/alloc_reserved_tree_block_eexist.patch
# 3. Configure and build base btrfs-progs
./autogen.sh
./configure
make -j$(nproc)
# 4. Copy the .c files from this repo into the btrfs-progs directory
cp path/to/btrfs_fixes/programs/*.c .
# 5. For each program, add to the Makefile:
echo '
PROGNAME: PROGNAME.o $(objects) $(libs_shared)
@echo " [LD] $@"
$(Q)$(CC) -o $@ PROGNAME.o $(objects) $(libs_shared) $(LDFLAGS) $(LIBS)
' >> Makefile
# 6. Build
make PROGNAMERecommended execution order:
The most important tool. Recursively walks every tree in the filesystem (ROOT, CHUNK, EXTENT, FS, DEV, CSUM, UUID, FREE_SPACE) and detects metadata blocks that are missing a METADATA_ITEM backref in the extent tree. Injects all missing backrefs in a single transaction to avoid the "root tree moves between commits" problem.
Usage:
sudo ./scan_and_fix_all_backrefs /dev/sdX # scan only
sudo ./scan_and_fix_all_backrefs /dev/sdX --write # scan + injectFixes the owner in the inline TREE_BLOCK_REF when it doesn't match the
block's actual btrfs_header_owner(). Mismatches occur when blocks are
reassigned between trees during failed repairs.
sudo ./fix_owner_refs /dev/sdX # scan
sudo ./fix_owner_refs /dev/sdX --write # fixFixes METADATA_ITEM and EXTENT_ITEM entries with an incorrect level. Corrupt
levels (e.g. 50, 55, 237) are garbage left behind by btrfs check --repair
entering a loop. Verified against the block's real btrfs_header_level().
sudo ./fix_bad_levels /dev/sdX # scan
sudo ./fix_bad_levels /dev/sdX --write # fixDeletes duplicate METADATA_ITEMs (same bytenr, different levels in the key).
Keeps the one whose level matches btrfs_header_level and deletes the other.
sudo ./fix_duplicate_extents /dev/sdX # scan
sudo ./fix_duplicate_extents /dev/sdX --write # delete duplicatesScans every level-1 node of the FS_TREE. Detects stale child pointers using
three checks: owner mismatch, first_key mismatch, or a first key whose type
is invalid for the FS_TREE (e.g. BLOCK_GROUP_ITEM). Removes them with
btrfs_del_ptr.
sudo ./remove_stale_ptrs /dev/sdX # scan
sudo ./remove_stale_ptrs /dev/sdX --write # removeCreates an empty leaf for the UUID tree / CSUM tree respectively. Useful when the ROOT_ITEM points to a block that was reassigned to another tree. The kernel regenerates the UUID tree automatically on RW mount. With an empty CSUM tree, files flagged NODATASUM do not fail verification.
sudo ./fix_uuid_tree /dev/sdX
sudo ./fix_csum_tree /dev/sdXSets the BTRFS_INODE_NODATASUM flag on regular file inodes. Use this if
the csum tree is empty but files still have expected checksums, which causes
read errors. With NODATASUM, the kernel skips csum lookups.
sudo ./set_nodatasum /dev/sdX # scan
sudo ./set_nodatasum /dev/sdX --write # applyVersion with a hardcoded list of stale blocks. Prefer remove_stale_ptrs,
which detects them automatically. Only use this if you need manual control
over which specific blocks to remove.
Initial version with a hardcoded list of missing backrefs. Prefer
scan_and_fix_all_backrefs, which detects them automatically.
When the baseline tools above were insufficient (pool with 200K+ errors spread across multiple trees), these additional tools were built:
Pass 1 and Pass 2 scanners that walk the FS_TREE and extent tree respectively,
producing TSV files with every ref/extent mapping. Used to build input for
rebuild_extent_tree_apply when the extent tree needs to be rebuilt from scratch.
The main Phase 3 writer. Takes a pre-folded list of refs (from scan_fstree_extents
scan_extent_treediff) and injects 3M+ EXTENT_DATA_REFs into the extent tree in chunks of 5000 per transaction. Throttles every 50K items to avoid DM-SMR re-shingle stalls. Verified successful at 3,248,617 inserts in ~34 min on 3× WD40EFAX SMR disks.
sudo ./rebuild_extent_tree_apply /dev/sdX1 refs_folded.txt to_insert.txt watermark.txt --dryrun
sudo ./rebuild_extent_tree_apply /dev/sdX1 refs_folded.txt to_insert.txt watermark.txt --writeSurgical single-field patcher for BLOCK_GROUP_ITEM.used when Fase 3 writer
leaves a specific bg with overshoot due to pre-existing overlapping file_extent_items.
Uses btrfs_set_block_group_used direct setter to avoid btrfs_update_block_group
space_info accounting (which we DON'T want here). Pre-validates flags & BTRFS_BLOCK_GROUP_DATA.
sudo ./patch_block_group_used /dev/sdX1 <bg_bytenr> <bg_length> <new_used> --writeDeletes a hardcoded list of (bytenr, num_bytes, expected_inode) EXTENT_ITEMs from
the extent tree. Used to clean up overlapping stale extents in a single leaf that
prevent RO mount. Per-item sanity checks before delete (7 invariants including
inode allowlist). Runs with rebuilding_extent_tree=1 + reinit_extent_tree=true
to skip space accounting (caller patches used manually first via
patch_block_group_used).
Cleans orphan DIR_ITEM + DIR_INDEX entries from the FS_TREE. Chunks of 100
entries per transaction. Updates parent INODE_ITEM i_size (decrement by
namelen × 2: critical bug fixed: v1 decremented by namelen only,
leaving dirs in invalid state). Hardcoded exclusion list for critical top-level
directory names (e.g. pelis, series, music, backups, homestorage).
NEVER decrement i_size by raw namelen: BTRFS stores namelen × 2 accounting.
Walks FS_TREE for INODE_REF items whose key.offset (parent inode) is in an
orphan parent list. Skips INODE_EXTREF to avoid false positives (EXTREF's
key.offset is a hash, not a parent ID). Chunks of 32 per transaction.
Recomputes i_size = sum(name_len × 2) and nlink = 1 for DIR inodes whose
counts were corrupted by previous orphan cleanup bugs. CRITICAL for safety:
if any DIR has nlink = 2, a single rm -rf on its path will silently delete
thousands of subdirectories (rmdir bomb). Walks DIR_INDEX entries, cross-checks
DIR_ITEM for hash collision detection (0 collisions verified empirically).
Removes orphan inode subtrees (DIR families + standalone REGs) from the FS_TREE. For each target: walks and deletes EXTENT_DATA, INODE_REF, INODE_EXTREF, XATTR, and finally INODE_ITEM. Transaction per DIR family (atomic per subtree), chunks of 50 for standalone REGs. Hardcoded paranoid exclusion list.
Improved version of remove_stale_ptrs: detects empty leaves with parent expected_key
(v1 skipped this case), recursive 2-level scan (root→level1 + level1→leaves),
dynamic buffer (no 512 limit), tolerates read_tree_block failures.
PoC for single extent insertion with validation. Used to validate the API path
before running rebuild_extent_tree_apply.
During the 2026-04-05 session, remove_orphan_inode_subtrees crashed twice on
the same BUG_ON assertion for two different reasons:
Crash vector 1: Direct btrfs_cow_block(leaf) over MIXED leaf (gen 3601,
contains both orphan and live inodes) → update_ref_for_cow walks children →
__btrfs_mod_ref(inc=1) over stale sibling children → btrfs_free_extent(phantom)
returns -ENOENT → BUG_ON → SIGABRT.
Crash vector 2 (discovered later, evaded by filtering):
btrfs_del_items post-purge drains a leaf below LEAF_DATA_SIZE/4 = 4096 bytes →
invokes push_leaf_left(sibling) or push_leaf_right(sibling) → if sibling has
gen ≤ last_snapshot = 3701, btrfs_block_can_be_shared returns 1 →
update_ref_for_cow enters refs > 1 path → btrfs_inc_ref(cow_sibling, 0) →
__btrfs_mod_ref(cow, level=0, inc=1) → iterates all EXTENT_DATAs of stale sibling →
btrfs_inc_extent_ref(phantom_bytenr) → BUG_ON(err) in extent-tree.c:1302 → SIGABRT.
The flags fs_info->rebuilding_extent_tree = 1 and trans->reinit_extent_tree = true
do NOT save the INC path: they only exempt BTRFS_DROP_DELAYED_REF (verified in
extent-tree.c:3885). BTRFS_ADD_DELAYED_REF (from btrfs_inc_ref) is fatal.
Bulletproof criterion for any target inode that will be deleted:
- The leaf hosting the inode's
INODE_ITEMhasgen > 3701(post-crash) - The leaf's parent level-1 has
gen > 3701 - Post-purge estimated
usedbytes > 4096 (no rebalance trigger) - ALL immediate siblings in the parent node have
gen > 3701(even if condition 3 fails, rebalance to post-crash siblings is safe) - Any backref targets (EXTENT_DATA
disk_bytenr) resolve in the current extent tree (no-ENOENTon backref lookup)
Violating any of conditions 3+4 triggers crash vector 2. Condition 5 is
exonerated by reinit_extent_tree for DROP but NOT for INC (which is what
push_leaf_left invokes).
For any candidate orphan inode set, walk the FS_TREE dump and classify each target leaf by the 5 bulletproof conditions. Example pattern (anonymized):
| Leaf | Gen | Orphan items / total | Post-purge used (est) | Rebalance? | Immediate siblings | Verdict |
|---|---|---|---|---|---|---|
$LEAF_A |
post-crash | mostly orphan, heavy purge | below threshold | YES | all post-crash | ✓ safe |
$LEAF_B |
post-crash | mostly live, light purge | above threshold | NO | clean parent | ✓ safe |
$LEAF_C |
post-crash | nearly 100% orphan | far below 4096 | YES forced | pre-crash stale | ❌ CRASH |
Leaves where ≥90% of items are orphan are the danger zone: they will drain
below the rebalance threshold (LEAF_DATA_SIZE/4 = 4096 bytes) with certainty,
forcing push_leaf_left/right. If any immediate sibling in the parent node has
gen ≤ last_snapshot, the push triggers CoW on that sibling, which enters the
btrfs_block_can_be_shared → refs > 1 → btrfs_inc_ref → __btrfs_mod_ref(inc=1)
path and crashes with BUG_ON(err) in btrfs_inc_extent_ref.
Mitigation: exclude the offending inodes from the input file. The tool processes whatever passes the pre-flight validation; leaves with mixed safe/unsafe targets can be partially processed by only listing the safe subset. Per-family transaction semantics mean each safe family commits atomically even if other families are excluded.
Empirical result from one session: starting from N candidate orphans,
after applying all 5 conditions the final safe subset was ~14% of the input,
but that subset committed without a single BUG_ON, with a 0-byte diff on
a baseline sha256 of live files captured pre-write.
patches/alloc_reserved_tree_block_eexist.patch modifies btrfs-progs so that
when alloc_reserved_tree_block finds the METADATA_ITEM already exists it
returns 0 instead of propagating EEXIST. This is required for batch backref
injection to work: when injecting many backrefs, the delayed refs system also
tries to create METADATA_ITEMs for blocks newly allocated via COW and collides
with the ones we already inserted.
# 1. Backup
mkdir -p backup
for DEV in /dev/sdX1 /dev/sdY1; do
sudo dd if=$DEV of=backup/$(basename $DEV).sb bs=4096 count=1 skip=16
done
# 2. Make sure the filesystem is unmounted
sudo umount /mnt/pool 2>/dev/null
# 3. Zero the log tree (if applicable)
sudo btrfs rescue zero-log /dev/sdX1
# 4. Scan + fix everything (in order)
sudo ./scan_and_fix_all_backrefs /dev/sdX1 --write
sudo ./fix_bad_levels /dev/sdX1 --write
sudo ./fix_owner_refs /dev/sdX1 --write
sudo ./fix_duplicate_extents /dev/sdX1 --write
sudo ./remove_stale_ptrs /dev/sdX1 --write
# 5. Re-scan to verify convergence
sudo ./scan_and_fix_all_backrefs /dev/sdX1
sudo ./remove_stale_ptrs /dev/sdX1
# 6. If the csum tree is broken:
sudo ./fix_csum_tree /dev/sdX1
sudo ./set_nodatasum /dev/sdX1 --write
# 7. Try mounting RW
sudo mount -o rw /dev/sdX1 /mnt/pool
# 8. If it mounts, verify with btrfs check readonly
sudo btrfs check --force /dev/sdX1-
Each repair can create new problems through COW: when a tool modifies the extent tree, btrfs COWs the affected nodes. The new nodes copy pointers from the old ones, which can propagate stale pointers. Multiple passes may be required.
-
Data extent ref mismatches are not fixed: these tools only touch metadata backrefs. Incorrect ref counts on data extents (common after failed
btrfs check --repairruns) are not cleaned up. -
Orphan inodes are not cleaned: orphan directory entries in the FS_TREE (references to inodes that no longer exist) are not removed.
-
Does not replace
btrfs check --repair: these tools target specific scenarios. For light or moderate damage,btrfs check --repairis better.
-
NEVER hard power-cycle a multi-device BTRFS filesystem: combined free space tree + extent tree corruption is extremely hard to repair.
-
NEVER run
btrfs check --repairmultiple times in a row if the first run did not resolve everything: it can enter an infinite loop and make the filesystem dramatically worse. -
Always back up the superblocks before every write operation.
-
trans->reinit_extent_tree = trueis key to ignoring DROP failures in delayed refs for blocks without backrefs. -
fs_info->rebuilding_extent_tree = 1disables space checks during repairs. -
One large commit with many inserts is better than many small commits, because intermediate commits move the root tree.
-
backup_slotsin the SB are NOT historical backups: they are a sliding window of the most recent 4 commits only. Abtrfs check --repairloop of 46,000+ commits will rotate every slot ~11,000 times in minutes, obliterating any pre-crash state recoverable from the kernel. For real retention you need explicitbtrfs subvolume snapshotorbtrfs sendstreams to another device. -
reinit_extent_treeis ASYMMETRIC: only exemptsBTRFS_DROP_DELAYED_REF, NOTBTRFS_ADD_DELAYED_REF. Any code path that callsbtrfs_inc_refon a stale leaf (includingpush_leaf_left/rightduring rebalance) will still crash viabtrfs_inc_extent_ref→BUG_ON(err). -
The safety criterion for processing inodes in a damaged FS_TREE must include siblings, not just the target leaf itself. See "Bulletproof subset criterion" section.
-
Baseline sha256 of LIVE files is the only empirical proof of invariants. Capture it before any write operation, diff after. Any mismatch = rollback.
-
DIR
i_sizeis stored assum(name_len × 2), NOTsum(name_len). Any orphan cleanup tool that decrements i_size on entry removal must decrement bynamelen × 2. Getting this wrong leaves DIRs in an invalid state that can manifest asnlink = 2later: which triggers armdirbomb if the pool is mounted RW (a singlerm -rfon a parent can delete thousands of subdirs silently). -
Expert reviewer agents with empirical evidence are critical. The 2026-04-05 session used two parallel Opus reviewers (btrfs internals + ops) that analyzed the proposed plan against dump-tree output. They caught a deterministic crash vector (push_leaf_left → stale sibling) that would have repeated the previous failures. A textual plan review without empirical dump-tree analysis would have missed this.
These tools were written for a specific recovery case where the native tools were failing. They are not tested for general use cases. Only use them if you understand the code and accept the risk of data loss.
Always copy your data before attempting any repair if at all possible.
GPL-2.0 (compatible with btrfs-progs, whose internal API these tools use).