Skip to content

MemTable & WAL Specification (Experimental)

Lance MemTable & WAL (MemWAL) specification describes a Log-Structured-Merge (LSM) tree architecture for Lance tables, enabling high-performance streaming write workloads while maintaining indexed read performance for key workloads including scan, point lookup, vector search and full-text search.

Overall Architecture

MemWAL Overview

A Lance table is called a base table under the context of the MemWAL spec. It must have an unenforced primary key defined in the table schema.

On top of the base table, the MemWAL spec defines a set of shards. Writers write to shards, and data in each shard is merged into the base table asynchronously. An index is kept in the base table for readers to quickly discover the state of all shards at a point of time.

MemWAL Shard

A MemWAL Shard is the main unit to horizontally scale out writes.

Each shard has exactly one active writer at any time. Writers claim a shard and then write data to that shard. Data in each shard is expected to be merged into the base table asynchronously.

Rows of the same primary key must be written to one and only one shard. If two shards contain rows with the same primary key, the following scenario can cause data corruption:

  1. Shard A receives a write with primary key pk=1 at time T1
  2. Shard B receives a write with primary key pk=1 at time T2 (T2 > T1)
  3. The row in shard B is merged into the base table first
  4. The row in shard A is merged into the base table second
  5. The row from Shard A (older) now overwrites the row from Shard B (newer)

This violates the expected "last write wins" semantics. By ensuring each primary key is assigned to exactly one shard via the shard spec, merge order between shards becomes irrelevant for correctness.

See MemWAL Shard Architecture for the complete shard architecture.

MemWAL Index

A MemWAL Index is the centralized structure for all MemWAL metadata on top of a base table. A table has at most one MemWAL index. It stores:

  • Configuration: Shard specs defining how rows map to shards, and which indexes to maintain
  • Merge progress: Last generation merged to base table for each shard
  • Index catchup progress: Which merged generation each base table index has been rebuilt to cover
  • Shard snapshots: Snapshot of all shard states for read optimization

The index is the source of truth for configuration, merge progress and index catchup progress Writers and mergers read the MemWAL index to get these configurations before writing.

Each shard's manifest is authoritative for its own state. Readers use shard snapshots is a read-only optimization to see a point-in-time view of all shards without the need to open each shard manifest.

See MemWAL Index Details for the complete structure.

Shard Architecture

Shard Architecture

Within a shard, writes are stored in an in-memory table (MemTable). It is also written to the shard's Write-Ahead Log (WAL) for durability guarantee. The MemTable is periodically flushed to storage based on memory pressure and other conditions. Flushed MemTables in storage are then asynchronously merged into the base table.

MemTable

A MemTable holds rows inserted into the shard before flushing to storage. It serves 2 purposes:

  1. build up data and related indexes to be flushed to storage as a flushed MemTable
  2. allow a reader to potentially access data that is not flushed to storage yet

MemTable Format

The complete in-memory format of a MemTable is implementation-specific and out of the scope of this spec. The Lance core Rust SDK maintains one default implementation and is available through all its language binding SDKs, but integrations are free to build their own MemTable format depending on the specific use cases, as long as it follows the MemWAL storage layout, reader and writer requirements when flushing MemTable.

Conceptually, because Lance uses Arrow as its in-memory data exchange format, for the ease of explanation in this spec, we will treat MemTable as a list of Arrow record batches, and each write into the MemTable is a new Arrow record batch.

MemTable Generation

Based on conditions like memory limit and durability requirements, a MemTable needs to be flushed to storage and discarded. When that happens, new writes go to a new MemTable and the cycle repeats. Each MemTable is assigned a monotonically increasing generation number starting from 1. When MemTable of generation N is discarded, the next MemTable gets assigned generation N+1.

WAL

WAL serves as the durable storage of all MemTables in a shard. It consists of data in MemTables ordered by generation. Every time we write to the WAL, we call it a WAL Flush.

WAL Durability

When a write is flushed to WAL, the specific write becomes durable. Otherwise, if the MemTable is lost, data is also lost.

Multiple writes can be batched together in a single WAL flush to reduce WAL flush frequency and improve throughput. The more writes a single WAL flush batches, the longer it takes for a write to be durable.

The whole LSM tree's durability is determined by the durability of the WAL. For example, if WAL is stored in Amazon S3, it has 99.999999999% durability. If it is stored in local disk, the data will be lost if the local disk is damaged.

WAL Entry

Each time a WAL flush happens, it adds a new WAL Entry to the WAL. In other words, a WAL consists of an ordered list of WAL entries starting from position 0. Writer must flush WAL entries in sequential order from lower to higher position. If WAL entry N is not flushed fully, WAL entry N+1 must not exist in storage.

WAL Replay

Replaying a WAL means to read data in the WAL from a lower to a higher position. This is commonly used to recover the latest MemTable after it is lost, by reading from the start position of the latest MemTable generation till the highest position in the WAL, assuming proper fencing to guard against multiple writers to the same shard.

See Writer Fencing for the full fencing mechanism.

WAL Entry Format

Each WAL entry is a file in storage following the Apache Arrow IPC stream format to store the batch of writes in the MemTable. The writer epoch is stored in the stream's Arrow schema metadata with key writer_epoch for fencing validation during replay.

WAL Storage Layout

Each WAL entry is stored within the WAL directory of the shard located at _mem_wal/{shard_id}/wal.

WAL files use bit-reversed 64-bit binary naming to distribute files evenly across the directory keyspace. This optimizes S3 throughput by spreading sequential writes across S3's internal partitions, minimizing throttling. The filename is the bit-reversed binary representation of the entry ID with suffix .lance. For example, entry ID 5 (binary 000...101) becomes 1010000000000000000000000000000000000000000000000000000000000000.arrow.

Flushed MemTable

A flushed MemTable is created by flushing the MemTable to storage. In Lance MemWAL spec, a flushed MemTable must be a Lance table following the Lance table format spec.

Note

This is called Sorted String Table (SSTable) or Sorted Run in many LSM-tree literatures and implementations. However, since our MemTable is not sorted, we just use the term flushed MemTable to avoid confusion.

Flushed MemTable Storage Layout

The MemTable of generation i is flushed to _mem_wal/{region_uuid}/{random_hex}_gen_{i}/ directory, where {random_hex} is a random 8-character hex value generated at flush time. The random hex value is necessary to ensure if one MemTable flush attempt fails, The retry can use another directory. The content within the generation directory follows the Lance table storage layout.

Merging MemTable to Base Table

Generation numbers determine merge order of flushed MemTable into base table: lower numbers represent older data and must be merged to the base table first to preserve correct upsert semantics.

Within a single flushed MemTable, if there are multiple rows of the same primary key, the row that is last inserted wins.

Shard Manifest

Each shard has a manifest file. This is the source of truth for the state of a shard.

Shard Manifest Contents

The manifest contains:

  • Fencing state: writer_epoch as the latest writer fencing token, see Writer Fencing for more details.
  • WAL pointers: replay_after_wal_entry_position (last entry position flushed to MemTable, 0-based), wal_entry_position_last_seen (last entry position seen at manifest update, 0-based)
  • Generation trackers: current_generation (next generation to flush), flushed_generations list of generation number and directory path pairs (e.g., generation 1 at a1b2c3d4_gen_1)

Note: wal_entry_position_last_seen is a hint that may be stale since it's not updated on WAL write. It is updated opportunistically by any reader that can update the shard manifest. The manifest itself is atomically written, but recovery must try to get newer WAL files to find the actual state beyond this hint.

The manifest is serialized as a protobuf binary file using the ShardManifest message.

ShardManifest protobuf message
message ShardManifest {
  // Shard identifier (UUID v4).
  UUID shard_id = 11;

  // Manifest version number.
  // Matches the version encoded in the filename.
  uint64 version = 1;

  // Shard spec ID this shard was created with.
  // Set at shard creation and immutable thereafter.
  // A value of 0 indicates a manually-created shard not governed by any spec.
  uint32 shard_spec_id = 10;

  // Writer fencing token - monotonically increasing.
  // A writer must increment this when claiming the shard.
  uint64 writer_epoch = 2;

  // The most recent WAL entry position (0-based) that has been flushed to a MemTable.
  // During recovery, replay starts from replay_after_wal_entry_position + 1.
  uint64 replay_after_wal_entry_position = 3;

  // The most recent WAL entry position (0-based) at the time manifest was updated.
  // This is a hint, not authoritative - recovery must list files to find actual state.
  uint64 wal_entry_position_last_seen = 4;

  // Next generation ID to create (incremented after each MemTable flush).
  uint64 current_generation = 6;

  // Field 7 removed: merged_generation moved to MemWalIndexDetails.merged_generations
  // which is the authoritative source for merge progress.

  // List of flushed MemTable generations and their directory paths.
  repeated FlushedGeneration flushed_generations = 8;

}

Shard Manifest Versioning

Manifests are versioned starting from 1 and immutable. Each update creates a new manifest file at the next version number. Updates use put-if-not-exists or file rename to ensure atomicity depending on the storage system. If two processes compete, one wins and the other retries.

To commit a manifest version:

  1. Compute the next version number
  2. Write the manifest to {bit_reversed_version}.binpb using put-if-not-exists
  3. In parallel best-effort write to version_hint.json with {"version": <new_version>} (failure is acceptable)

To read the latest manifest version:

  1. Read version_hint.json to get the latest version hint. If not found, start from version 1
  2. Check existence for subsequent versions from the starting version
  3. Continue until a version is not found
  4. The latest version is the last found version

Note

This works because the write rate to shard manifests is significantly lower than read rates. Shard manifests are only updated when shard metadata changes (MemTable flush), not on every write. This ensures HEAD requests will eventually terminate and find the latest version.

Shard Manifest Storage Layout

All shard manifest versions are stored in _mem_wal/{shard_id}/manifest directory.

Each shard manifest version file uses bit-reversed 64-bit binary naming, the same scheme as WAL files. For example, version 5 becomes 1010000000000000000000000000000000000000000000000000000000000000.binpb.

MemWAL Index Details

The MemWAL Index uses the standard index storage at _indices/{UUID}/.

The index stores its data in two parts:

  1. Index details (index_details in IndexMetadata): Contains configuration, merge progress, and snapshot metadata
  2. Shard snapshots: Stored as a Lance file or inline, depending on shard count

Index Details

The index_details field in IndexMetadata contains a MemWalIndexDetails protobuf message with the following key fields:

  • Configuration fields (shard_specs, maintained_indexes) are the source of truth for MemWAL configuration. Writers read these fields to determine how to partition data and which indexes to maintain.
  • Merge progress (merged_generations) tracks the last generation merged to the base table for each shard. This field is updated atomically with merge-insert data commits, enabling conflict resolution when multiple mergers operate concurrently. Each entry contains the shard UUID and generation number.
  • Index catchup progress (index_catchup) tracks which merged generation each base table index has been rebuilt to cover. When data is merged from a flushed MemTable to the base table, the base table's indexes may be rebuilt asynchronously. During this window, queries should use the flushed MemTable's pre-built indexes instead of scanning unindexed data in the base table. See Indexed Read Plan for details.
  • Shard snapshot fields (snapshot_ts_millis, num_shards, inline_snapshots) provide a snapshot of shard states. The actual shard manifests remain authoritative for shard state. When num_shards is 0, the inline_snapshots field may be None or an empty Lance file with 0 rows but proper schema.
MemWalIndexDetails protobuf message
message MemWalIndexDetails {
  // Snapshot timestamp (Unix timestamp in milliseconds).
  int64 snapshot_ts_millis = 1;

  // Number of shards in the snapshot.
  // Used to determine storage format without reading the snapshot data.
  uint32 num_shards = 2;

  // Inline shard snapshots for small shard counts.
  // When num_shards <= threshold (implementation-defined, e.g., 100),
  // snapshots are stored inline as serialized bytes.
  // Format: Lance file bytes with the shard snapshot schema.
  optional bytes inline_snapshots = 3;

  // Shard specs defining how to derive shard identifiers.
  // This configuration determines how rows are partitioned into shards.
  repeated ShardSpec shard_specs = 7;

  // Indexes from the base table to maintain in MemTables.
  // These are index names referencing indexes defined on the base table.
  // The primary key btree index is always maintained implicitly and
  // should not be listed here.
  //
  // For vector indexes, MemTables inherit quantization parameters (PQ codebook,
  // SQ params) from the base table index to ensure distance comparability.
  repeated string maintained_indexes = 8;

  // Last generation merged to base table for each shard.
  // This is updated atomically with merge-insert data commits, enabling
  // conflict resolution when multiple mergers operate concurrently.
  //
  // Note: This is separate from shard snapshots because:
  // 1. merged_generations is updated by mergers (atomic with data commit)
  // 2. shard snapshots are updated by background index builder
  repeated MergedGeneration merged_generations = 9;

  // Per-index catchup progress tracking.
  // When data is merged to the base table, base table indexes are rebuilt
  // asynchronously. This field tracks which generation each index covers.
  //
  // For indexed queries, if an index's caught_up_generation < merged_generation,
  // readers should use flushed MemTable indexes for the gap instead of
  // scanning unindexed data in the base table.
  //
  // If an index is not present in this list, it is assumed to be fully caught up.
  repeated IndexCatchupProgress index_catchup = 10;

}

Shard Identifier

Each shard has a unique identifier across all shards following UUID v4 standard. When a new shard is created, it is assigned a new identifier.

Shard Spec

A Shard Spec defines how all rows in a table are logically divided into different shards, enabling automatic shard assignment and query-time shard pruning.

Each shard spec has:

  • Spec ID: A positive integer that uniquely identifies this spec within the MemWAL index. IDs are never reused.
  • Shard fields: An array of field definitions that determine how to compute shard values.

Each shard is bound to a specific shard spec ID, recorded in its manifest. Shards without a spec ID (spec_id = 0) are manually-created shards not governed by any spec.

A shard spec's field array consists of shard field definitions. Each shard field has the following properties:

Property Description
field_id Unique string identifier for this shard field
source_ids Array of field IDs referencing source columns in the schema
transform A well-known shard expression, specify this or expression
expression A DataFusion SQL expression for custom logic, specify this or transform
result_type The output type of the shard value

Shard Expression

A Shard Expression is a DataFusion SQL expression that derives a shard value from source column(s). Source columns are referenced as col0, col1, etc., corresponding to the order of field IDs in source_ids.

Shard expressions must satisfy the following requirements:

  1. Deterministic: The same input value must always produce the same output value.
  2. Stateless: The expression must not depend on external state (e.g., current time, random values, session variables).
  3. Type-promotion resistant: The expression must produce the same result for equivalent values regardless of their numeric type (e.g., int32(5) and int64(5) must yield the same shard value).
  4. Column removal resistant: If a source field ID is not found in the schema, the column should be interpreted as NULL.
  5. NULL-safe: The expression should properly handle NULL inputs and have defined behavior (e.g., return NULL if input is NULL for single-column expressions).
  6. Consistent with result type: The expression's return type must be consistent with result_type in non-NULL cases.

Shard Transform

A Shard Transform is a well-known shard expression with a predefined name. When a transform is specified, the expression is derived automatically.

Transform Parameters Shard Expression Result Type
identity (none) col0 same as source
year (none) date_part('year', col0) int32
month (none) date_part('month', col0) int32
day (none) date_part('day', col0) int32
hour (none) date_part('hour', col0) int32
bucket num_buckets abs(murmur3(col0)) % N int32
multi_bucket num_buckets abs(murmur3_multi(col0, col1, ...)) % N int32
truncate width left(col0, W) (string) or col0 - (col0 % W) (numeric) same as source

The bucket and multi_bucket transforms use Murmur3 hash functions:

  • murmur3(col): Computes the 32-bit Murmur3 hash (x86 variant, seed 0) of a single column. Returns a signed 32-bit integer. Returns NULL if input is NULL.
  • murmur3_multi(col0, col1, ...): Computes the Murmur3 hash across multiple columns. Returns a signed 32-bit integer. NULL fields are ignored during hashing; returns NULL only if all inputs are NULL.

The hash result is wrapped with abs() and modulo N to produce a non-negative bucket number in the range [0, N).

Shard Snapshot Storage

Shard snapshots are stored using one of two strategies based on the number of shards:

Shard Count Storage Strategy Location
<= 100 (threshold) Inline inline_snapshots field in index details
> 100 External Lance file _indices/{UUID}/index.lance

The threshold (100 shards) is implementation-defined and may vary.

Inline storage: For small shard counts, snapshots are serialized as a Lance file and stored in the inline_snapshots field. This keeps the index metadata compact while avoiding an additional file read for common cases.

External Lance file: For large shard counts, snapshots are stored as a Lance file at _indices/{UUID}/index.lance. This file uses standard Lance format with the shard snapshot schema, enabling efficient columnar access and compression.

Shard Snapshot Arrow Schema

Shard snapshots are stored as a Lance file with one row per shard. The schema has one column per ShardManifest field plus shard spec columns:

Column Type Description
shard_id fixed_size_binary(16) Shard UUID bytes
version uint64 Shard manifest version
shard_spec_id uint32 Shard spec ID (0 if manual)
writer_epoch uint64 Writer fencing token
replay_after_wal_entry_position uint64 Last WAL entry position (0-based) flushed to MemTable
wal_entry_position_last_seen uint64 Last WAL entry position (0-based) seen (hint)
current_generation uint64 Next generation to flush
flushed_generations list<struct<generation: uint64, path: string>> Flushed MemTable paths
region_field_{field_id} varies Shard field value (one column per field in shard spec)

For example, with a shard spec containing a field user_bucket of type int32:

Column Type Description
... ... (base columns above)
region_field_user_bucket int32 Bucket value for this shard

This schema directly corresponds to the fields in the ShardManifest protobuf message plus the computed shard field values.

Storage Layout

Here is a recap of the storage layout with all the files and concepts defined so far:

{table_path}/
├── _indices/
│   └── {index_uuid}/                    # MemWAL Index (uses standard index storage)
│       └── index.lance                  # Serialized shard snapshots (Lance file)
└── _mem_wal/
    └── {region_uuid}/                   # Shard directory (UUID v4)
        ├── manifest/
        │   ├── {bit_reversed_version}.binpb     # Serialized shard manifest (bit-reversed naming)
        │   └── version_hint.json                # Version hint file
        ├── wal/
        │   ├── {bit_reversed_entry_id}.lance    # WAL data files (bit-reversed naming)
        │   └── ...
        └── {random_hash}_gen_{i}/        # Flushed MemTable (generation i, random prefix)
            ├── _versions/
            │   └── {version}.manifest    # Table manifest (V2 naming scheme)
            ├── _indices/                 # Indexes
            │   ├── {vector_index}/
            │   └── {scalar_index}/
            └── bloom_filter.bin          # Primary key bloom filter

Implementation Expectation

This specification describes the storage layout for the LSM tree architecture. Implementations are free to use any approach to fulfill the storage layout requirements. Once data is written to the expected storage layout, the reader and writer expectations apply.

The specification defines:

  • Storage layout: The directory structure, file formats, and naming conventions for WAL entries, flushed MemTables, shard manifests, and the MemWAL index
  • Durability guarantees: How data is persisted through WAL entries and flushed MemTables
  • Consistency model: How readers and writers coordinate through manifests and epoch-based fencing

Implementations may choose different approaches for:

  • In-memory data structures and indexing
  • Buffering strategies before WAL flush
  • Background task scheduling and concurrency
  • Query execution strategies

As long as the storage layout is correct and the documented invariants are maintained, implementations can optimize for their specific use cases.

Writer Expectations

A writer operates on a single shard and is responsible for:

  1. Claiming the shard using epoch-based fencing
  2. Writing data to WAL entries and flushed MemTables following the storage layout
  3. Maintaining the shard manifest to track WAL and generation progress

Writer Fencing

Writers use epoch-based fencing to ensure single-writer semantics per shard.

To claim a shard:

  1. Load the latest shard manifest
  2. Increment writer_epoch by one
  3. Atomically write a new manifest version
  4. If the write fails (another writer claimed the epoch), reload and retry with a higher epoch

Before any manifest update, a writer must verify its writer_epoch remains valid:

  • If local_writer_epoch == stored_writer_epoch: The writer is still active and may proceed
  • If local_writer_epoch < stored_writer_epoch: The writer has been fenced and must abort

For a concrete example, see Appendix 1: Writer Fencing Example.

Background Job Expectations

Background jobs handle merging flushed MemTables to the base table and garbage collection.

MemTable Merger

Flushed MemTables must be merged to the base table in ascending generation order within each shard. This ordering is essential for correct upsert semantics: newer generations must overwrite older ones.

The merge uses Lance's merge-insert operation with atomic transaction semantics:

  • merged_generations[shard_id] is updated atomically with the data commit
  • On commit conflict, check the conflicting commit's merged_generations to determine if the generation was already merged

For a concrete example, see Appendix 2: Concurrent Merger Example.

Garbage Collector

The garbage collector removes obsolete data from shard directories. Flushed MemTables and their referenced WAL files may be deleted after:

  1. The generation has been merged to the base table (generation <= merged_generations[shard_id])
  2. All maintained indexes have caught up (generation <= min(index_catchup[I].caught_up_generation))
  3. No retained base table version references the generation for time travel

Reader Expectations

LSM Tree Merging Read

Readers MUST merge results from multiple data sources (base table, flushed MemTables, in-memory MemTables) by primary key to ensure correctness.

When the same primary key exists in multiple sources, the reader must keep only the newest version based on:

  1. Generation number (_gen): Higher generation wins. The base table has generation 0, MemTables have positive integers starting from 1.
  2. Row address (_rowaddr): Within the same generation, higher row address wins (later writes within a batch overwrite earlier ones).

The ordering for "newest" is: highest _gen first, then highest _rowaddr.

This deduplication is essential because:

  • A row updated in a MemTable also exists (with older data) in the base table
  • A flushed MemTable that has been merged to the base table may not yet be garbage collected, causing the same row to appear in both
  • A single write batch may contain multiple updates to the same primary key

Without proper merging, queries would return duplicate or stale rows.

Reader Consistency

Reader consistency depends on two factors:

  1. access to in-memory MemTables
  2. the source of shard metadata (either through MemWAL index or shard manifests)

Strong consistency requires access to in-memory MemTables for all shards involved in the query and reading shard manifests directly. Otherwise, the query is eventually consistent due to missing unflushed data or stale MemWAL Index snapshots.

Note

Reading a stale MemWAL Index does not impact correctness, only freshness:

- **Merged MemTable still in index**: If a flushed MemTable has been merged to the base table but still shows in the MemWAL index, readers query both. This results in some inefficiency for querying the same data twice, but [LSM-tree merging](#lsm-tree-merging-read) ensures correct results since both contain the same data. The inefficiency is also compensated by the fact that the data is covered by index and we rarely end up scanning both data.
- **Garbage collected MemTable still in index**: If a flushed MemTable has been garbage collected, but is still in the MemWAL index, readers would fail to open it and skip it. This is also safe because if it is garbage collected, the data must already exist in the base table.
- **Newly flushed MemTable not in index**: If a newly flushed MemTable is added after the snapshot was built, it is not queried. The result is eventually consistent but correct for the snapshot's point in time.

Query Planning

MemTable Collection

The query planner collects datasets from multiple sources and assembles them for unified query execution. Datasets come from:

  1. base table (representing already-merged data)
  2. flushed MemTables (persisted but not yet merged)
  3. optionally in-memory MemTables (if accessible).

Each dataset is tagged with a generation number: 0 for the base table, and positive integers for MemTable generations. Within a shard, the generation number determines data freshness, with higher numbers representing newer data. Rows from different shards do not need deduplication since each primary key maps to exactly one shard.

The planner also collects bloom filters from each generation for staleness detection during search queries.

Shard Pruning

Before executing queries, if shard spec is available, the planner evaluates filter predicates against shard specs to determine which shards may contain matching data. This pruning step reduces the number of shards to scan.

For each filter predicate:

  1. Extract predicates on columns used in shard specs
  2. Evaluate which shard values can satisfy the predicate
  3. Prune shards whose values cannot match

For example, with a shard spec using bucket(user_id, 10) and a filter user_id = 123:

  1. Compute bucket(123, 10) = 3
  2. Only scan shards with bucket value 3
  3. Skip all other shards

Shard pruning applies to both scan queries and prefilters in search queries.

Indexed Read Plan

When data is merged from a flushed MemTable to the base table, the base table's indexes are rebuilt asynchronously by the base table index builders. During this window, the merged data exists in the base table but is not yet covered by the base table's indexes.

Without special handling, indexed queries would fall back to expensive full scans for the unindexed part of the base table. To maintain indexed read performance, the query planner should use index_catchup progress to determine the optimal data source for each query.

The key insight is that flushed MemTables serve as a bridge between the base table's index catchup and the current merged state. For a query that requires a specific index for acceleration, when index_gen < merged_gen, the generations in the gap (index_gen, merged_gen] have data already merged in the base table but are not covered by the base table's index. Since flushed MemTables contain pre-built indexes (created during MemTable flush), queries can use these indexes instead of scanning unindexed data in the base table. This ensures all reads remain indexed regardless of how far behind the async index builder is.

Appendices

Appendix 1: Writer Fencing Example

This example demonstrates how epoch-based fencing prevents data corruption when two writers compete for the same shard.

Initial State

Shard manifest (version 1):
  writer_epoch: 5
  replay_after_wal_entry_position: 10
  wal_entry_position_last_seen: 12

Scenario

Step Writer A Writer B Manifest State
1 Loads manifest, sees epoch=5 epoch=5, version=1
2 Increments to epoch=6, writes manifest v2 epoch=6, version=2
3 Starts writing WAL entries 13, 14, 15
4 Loads manifest v2, sees epoch=6 epoch=6, version=2
5 Increments to epoch=7, writes manifest v3 epoch=7, version=3
6 Starts writing WAL entries 16, 17
7 Tries to flush MemTable, loads manifest
8 Sees epoch=7, but local epoch=6
9 Writer A is fenced! Aborts all operations
10 Continues writing normally epoch=7, version=3

What Happens to Writer A's WAL Entries?

Writer A wrote WAL entries 13, 14, 15 with writer_epoch=6 in their schema metadata.

When Writer B performs crash recovery or MemTable flush:

  1. Reads WAL entries sequentially starting from replay_after_wal_entry_position + 1 (entry 11, since positions are 0-based)
  2. For each entry, checks existence using HEAD request on the bit-reversed filename
  3. Continues until an entry is not found (e.g., entry 18 doesn't exist)
  4. Finds entries 13, 14, 15, 16, 17
  5. Reads each file's writer_epoch from schema metadata
  6. Entries 13, 14, 15 have writer_epoch=6 which is <= current epoch (7) -> valid, will be replayed
  7. Entries 16, 17 have writer_epoch=7 -> valid, will be replayed

Key Points

  1. No data loss: Writer A's entries are not discarded. They were written with a valid epoch at the time and will be included in recovery.

  2. Consistency preserved: Writer A is prevented from making further writes that could conflict with Writer B.

  3. Orphaned files are safe: WAL files from fenced writers remain on storage and are replayed by the new writer. They are only garbage collected after being included in a flushed MemTable that has been merged.

  4. Epoch validation timing: Writers check their epoch before manifest updates (MemTable flush), not on every WAL write. This keeps the hot path fast while ensuring consistency at commit boundaries.

Appendix 2: Concurrent Merger Example

This example demonstrates how MemWAL Index and conflict resolution handle concurrent mergers safely.

Initial State

MemWAL Index:
  merged_generations: {shard: 5}

Shard manifest (version 1):
  current_generation: 8
  flushed_generations: [(6, "abc123_gen_6"), (7, "def456_gen_7")]

Scenario 1: Racing on the Same Generation

Two mergers both try to merge generation 6 concurrently.

Step Merger A Merger B MemWAL Index
1 Reads index: merged_gen=5 merged_gen=5
2 Reads shard manifest
3 Starts merging gen 6
4 Reads index: merged_gen=5 merged_gen=5
5 Reads shard manifest
6 Starts merging gen 6
7 Commits (merged_gen=6) merged_gen=6
8 Tries to commit
9 Conflict: reads new index
10 Sees merged_gen=6 >= 6, aborts
11 Reloads, continues to gen 7

Merger B's conflict resolution detected that generation 6 was already merged by checking the MemWAL Index in the conflicting commit.

Scenario 2: Crash After Table Commit

Merger A crashes after committing to the table.

Step Merger A Merger B MemWAL Index
1 Reads index: merged_gen=5 merged_gen=5
2 Merges gen 6, commits merged_gen=6
3 CRASH merged_gen=6
4 Reads index: merged_gen=6 merged_gen=6
5 Reads shard manifest
6 Skips gen 6 (already merged)
7 Merges gen 7, commits merged_gen=7

The MemWAL Index is the single source of truth. Merger B correctly used it to determine that generation 6 was already merged.

Key Points

  1. Single source of truth: merged_generations is the authoritative source for merge progress, updated atomically with data.

  2. Conflict resolution uses MemWAL Index: When a commit conflicts, the merger checks the conflicting commit's MemWAL Index.

  3. No progress regression: Because MemWAL Index is updated atomically with data, concurrent mergers cannot regress the merge progress.