What is a persistent B-tree?

Aug 20, 2025

A persistent data structure preserves old versions after updates. Instead of mutating in place, each update returns a new root while structurally sharing most of the old nodes. Reads can keep using the old root forever; writes create a new version.

There are flavors:

Partial persistence: you can read any old version, but only update the latest.
Full persistence: you can branch from any version, producing multiple future timelines.

Most systems use partial persistence for simplicity (e.g., MVCC snapshots, copy-on-write filesystems).

A persistent B-tree is just a B-tree where nodes are treated as immutable objects. Updates do path copying: along the root-to-leaf path you touch, you allocate fresh copies of nodes with your changes; untouched subtrees are shared.

Why use this instead of a normal (mutable) B-tree?

Stable references to past versions: perfect for snapshots, point-in-time queries, time-travel debugging, and repeatable analytics.
Lock-free readers: readers pin a root (e.g., Arc/shared_ptr) and traverse without locks while writers build a new version.
Crash safety (when combined with COW to stable storage): never overwrite live data; write new blocks, then atomically flip the root.

How it works (path copying / COW)

Imagine inserting into a B-tree:

Find the leaf: like a normal B-tree.
Copy the leaf and insert the key/value (split if needed).
Bubble up: for each ancestor on the search path:

Copy the ancestor,
Update the child pointer that changed (and insert a separator key if a split happened),
If the ancestor overflows, split and continue upward.

If the root splits, create a new root with two children.
Return the new root pointer as the new version.

Only O(height) nodes are copied — i.e., O(log_B N) where B is branching factor.

Deletes follow the same pattern (merge/borrow while copying the path).

Complexity & trade-offs

Time per update: still O(log_B N). Constant factors grow a bit (alloc/copy per level).
Space per update: O(log_B N) new nodes; everything else is shared.
Read concurrency: trivial — readers just pin a root and traverse; no writer interference.
Write amplification: higher than in-place mutation, but often acceptable with large nodes (e.g., 256–4096 bytes).
Garbage collection: reference counting, epochs, or snapshot lifetimes determine when old nodes can be reclaimed.

Design choices that matter

Node size: Pick to match cache lines for in-memory (e.g., 256–1,024 bytes) or disk pages for on-disk (e.g., 4–16 KiB).
B-tree vs B+-tree: B+ keeps values in leaves and makes internal nodes pure indexes (faster scans). Both persist well; B+ is common for DBs.
Memory management: Arc in Rust or shared_ptr in C++ is easy; arenas + intrusive refcounts are faster.
Version management: Keep a vector/log of root pointers; oldest unpinned root bounds reclamation.
Crash consistency (on disk): write new nodes first, then atomically publish the root (e.g., by updating a superblock pointer).

A minimal persistent B+-tree in Rust (path-copying with `Arc`)

This sketch shows: immutable nodes, path copying on insert, snapshots as root Arc<Node>. It omits deletion and bulk ops to keep it readable, but the structure is right.

use std::sync::Arc;

// Tunables
const MIN_KEYS: usize = 8;        // Minimal occupancy per node (except root)
const MAX_KEYS: usize = 2 * MIN_KEYS;

#[derive(Clone)]
enum Node<K: Ord + Clone, V: Clone> {
    Internal(Internal<K, V>),
    Leaf(Leaf<K, V>),
}

#[derive(Clone)]
struct Internal<K: Ord + Clone, V: Clone> {
    keys: Vec<K>,                   // separators: len = children.len() - 1
    children: Vec<Arc<Node<K, V>>>, // len >= 2
}

#[derive(Clone)]
struct Leaf<K: Ord + Clone, V: Clone> {
    keys: Vec<K>,
    vals: Vec<V>,                   // same length as keys
}

#[derive(Clone)]
pub struct BPlusTree<K: Ord + Clone, V: Clone> {
    root: Arc<Node<K, V>>,
}

impl<K: Ord + Clone, V: Clone> BPlusTree<K, V> {
    pub fn new() -> Self {
        let leaf = Leaf { keys: Vec::new(), vals: Vec::new() };
        Self { root: Arc::new(Node::Leaf(leaf)) }
    }

    pub fn get(&self, key: &K) -> Option<V> {
        let mut node = self.root.clone();
        loop {
            match &*node {
                Node::Leaf(leaf) => {
                    match leaf.keys.binary_search(key) {
                        Ok(i) => return Some(leaf.vals[i].clone()),
                        Err(_) => return None,
                    }
                }
                Node::Internal(internal) => {
                    let i = match internal.keys.binary_search(key) {
                        Ok(i) => i + 1,
                        Err(i) => i,
                    };
                    node = internal.children[i].clone();
                }
            }
        }
    }

    /// Returns a new tree with (key, val) inserted (replaces value if key exists).
    pub fn insert(&self, key: K, val: V) -> Self {
        match insert_rec(self.root.clone(), key, val) {
            InsertResult::NoSplit(new_root) => Self { root: new_root },
            InsertResult::Split(left, sep, right) => {
                // Root split: create a new root with two children.
                let new_root = Node::Internal(Internal {
                    keys: vec![sep],
                    children: vec![left, right],
                });
                Self { root: Arc::new(new_root) }
            }
        }
    }
}

// The result of inserting into a node.
enum InsertResult<K: Ord + Clone, V: Clone> {
    NoSplit(Arc<Node<K, V>>),
    Split(Arc<Node<K, V>>, K, Arc<Node<K, V>>), // left, separator, right
}

fn insert_rec<K: Ord + Clone, V: Clone>(
    node: Arc<Node<K, V>>,
    key: K,
    val: V,
) -> InsertResult<K, V> {
    match &*node {
        Node::Leaf(leaf) => {
            // Copy-on-write: make a fresh leaf with the update.
            let mut keys = leaf.keys.clone();
            let mut vals = leaf.vals.clone();
            match keys.binary_search(&key) {
                Ok(i) => { vals[i] = val; } // replace
                Err(i) => {
                    keys.insert(i, key);
                    vals.insert(i, val);
                }
            }
            if keys.len() <= MAX_KEYS {
                InsertResult::NoSplit(Arc::new(Node::Leaf(Leaf { keys, vals })))
            } else {
                // Split leaf.
                let mid = keys.len() / 2;
                let right_keys = keys.split_off(mid);
                let right_vals = vals.split_off(mid);
                let left = Arc::new(Node::Leaf(Leaf { keys, vals }));
                let right = Arc::new(Node::Leaf(Leaf { keys: right_keys.clone(), vals: right_vals }));
                // In B+ trees, the separator promoted to the parent is the first key in the right node.
                let sep = right_keys[0].clone();
                InsertResult::Split(left, sep, right)
            }
        }
        Node::Internal(internal) => {
            // Descend to the target child.
            let i = match internal.keys.binary_search(&key) {
                Ok(i) => i + 1,
                Err(i) => i,
            };
            let child = internal.children[i].clone();
            match insert_rec(child, key, val) {
                InsertResult::NoSplit(new_child) => {
                    // Copy-on-write internal: replace one child pointer.
                    let mut keys = internal.keys.clone();
                    let mut children = internal.children.clone();
                    children[i] = new_child;
                    InsertResult::NoSplit(Arc::new(Node::Internal(Internal { keys, children })))
                }
                InsertResult::Split(left, sep, right) => {
                    // Copy and splice sep + new right child.
                    let mut keys = internal.keys.clone();
                    let mut children = internal.children.clone();
                    keys.insert(i, sep);
                    children[i] = left;
                    children.insert(i + 1, right);
                    if keys.len() <= MAX_KEYS {
                        InsertResult::NoSplit(Arc::new(Node::Internal(Internal { keys, children })))
                    } else {
                        // Split internal node.
                        let mid = keys.len() / 2;
                        let mut right_keys = keys.split_off(mid + 1);
                        let sep_up = keys.pop().unwrap(); // median promoted upward
                        let mut right_children = children.split_off(mid + 1);
                        right_children.insert(0, children.pop().unwrap());

                        let left_node = Arc::new(Node::Internal(Internal {
                            keys,
                            children,
                        }));
                        let right_node = Arc::new(Node::Internal(Internal {
                            keys: std::mem::take(&mut right_keys),
                            children: std::mem::take(&mut right_children),
                        }));
                        InsertResult::Split(left_node, sep_up, right_node)
                    }
                }
            }
        }
    }
}

Snapshotting: every time you call insert, you get a new BPlusTree with a new root: Arc<Node<…>>. Keep the old one around for a read-only snapshot. Cloning a snapshot is O(1) (just bumps an Arc count).

Extending this:

Add remove, range scans/iterators, and bulk loads.
Track statistics and node counts for GC/telemetry.
For on-disk: replace Arc<Node> with IDs and a page cache; writes become “write new pages, then publish root”.

Practical concerns & tips

Iterator stability: In a persistent tree, an iterator over version V stays valid as long as V’s root is alive. Great for long scans and analytical queries.
Space reclamation: If you use refcounting, space is reclaimed as soon as the last snapshot releasing those nodes is dropped. For high write rates, consider epoch-based reclamation or generational GC to reduce refcount churn.
Batching updates: Group many inserts into a single “version” to amortize path copying.
Compression/compaction: With large node sizes, lightweight compression (varint keys, prefix compression) significantly reduces memory/IO.
Concurrency: Readers just grab a root pointer. Writers build a new version privately, then publish a single pointer swap in a version registry.
On-disk variants: Use fixed-size pages, checksums, and a double-indirect superblock for atomic root updates. This is the essence of COW filesystems and LSM/B-tree hybrids.

When should you not use a persistent B-tree?

If you need to mutate in place to minimize write amplification (e.g., small embedded devices with very limited flash endurance).
If you never read older versions and memory is tight (persistent overhead won’t help you).
If your workload is heavy on sequential writes and massive ingest, an LSM-tree may be a better fit.

TL;DR

A persistent B-tree gives you stable, zero-copy snapshots and lock-free readers by treating nodes as immutable and doing path copying on updates. You pay O(log_B N) fresh nodes per write; you get simple concurrency and time-travel reads. The Rust sketch above show the core technique; from here, add deletion, range scans, and—if needed—an on-disk page layer with copy-on-write publishing of the root.

Andrew’s Substack

Discussion about this post

Ready for more?