Skip to content

durability: Flush batches#4478

Merged
clockwork-labs-bot merged 3 commits intokim/commitlog/append-commit-reopenfrom
kim/durability/flush-batch
Feb 27, 2026
Merged

durability: Flush batches#4478
clockwork-labs-bot merged 3 commits intokim/commitlog/append-commit-reopenfrom
kim/durability/flush-batch

Conversation

@kim
Copy link
Contributor

@kim kim commented Feb 26, 2026

Instead of periodic flush + sync, the simplified commitlog buffering makes it easy to just pop a batch of transaction from the queue, commit them, and then flush + sync after each batch.

This may lead to an overall higher number of fsyncs on the system, but also adapts to the tx throughput of the individual database (up to the batch size).

After taking some measurements, we may want to make the batch size configurable at runtime.

Depends-on: #4404

Instead of periodic flush + sync, the simplified commitlog buffering
makes it easy to just pop a batch of transaction from the queue, commit
them, and then flush + sync after each batch.

This may lead to an overall higher number of fsyncs on the system, but
also adapts to the tx throughput of the individual database (up to the
batch size).

After taking some measurements, we may want to make the batch size
configurable at runtime.

Depends-on: #4404
@cloutiertyler
Copy link
Contributor

I think we need both an interval and a batch, no? If a database has very low throughput, this will fsync very infrequently and spike latency. (e.g. one client doing a txn every minute or so).

@kim
Copy link
Contributor Author

kim commented Feb 26, 2026

The idea is to fsync every batch (we did flush each tx before), amortising the cost by batching up to 32 txs.

recv_many may return a single element if there aren’t more in the channel.

@cloutiertyler
Copy link
Contributor

cloutiertyler commented Feb 26, 2026

The batching approach improves latency under low load (no waiting for timer ticks), but I'm concerned the fixed batch size of 32 couples throughput to fsync latency in a way the current time-based approach doesn't.

With time-based sync, throughput is bounded by disk write speed. With batch-capped sync, throughput is bounded by batch_size / fsync_latency.

Example NVME-ish numbers: Assume 1KB per transaction, 3 GB/s disk throughput, and 1ms fsync latency.

  • Time-based (current): Sync every 10ms. Disk can write 30MB in that interval = 30,000 tx. Throughput: 3,000,000 tx/sec (disk-limited)
  • Batch-capped (new): Sync after 32 tx (32KB), fsync takes 1ms. Throughput: 32,000 tx/sec (fsync-latency-limited)

The batch-capped would be ~1% disk utilization under sustained load, whereas I think time-based would be close to 100%. If fsync takes longer (e.g. 3ms), it's more like a theoretical max of 10,000 tx/second. And the smaller the transactions are, the less disk utilization we'd have.

The current implementation also appears to adapt to slow fsync via MissedTickBehavior::Delay, if I'm understanding correctly. If fsync takes longer than the interval, it looks like the next tick fires immediately when fsync completes, so we fsync continuously. But each sync includes all transactions that accumulated during the previous fsync. So in a high transaction throughput scenario, I think removing the 32 transaction bound would have the same behavior as the current logic, but with the added benefit of fsyncing continuously for lower throughput scenarios (rather than waiting for the interval), right?

If the concern is pulling too many transactions off the queue at once, the current implementation already does this, no? It uses an unbounded channel and syncs everything that has accumulated during the interval, with no cap.

recv_many already returns early when fewer than 32 transactions are queued, so removing or significantly increasing the cap would:

  • Preserve the latency improvement under low load (sync as soon as data arrives)
  • Not artificially limit throughput under high load (sync everything queued, then continue)

The cap acts like backpressure, but it's not responding to any actual signal (disk saturation, memory pressure). Is there a specific reason to bound it at 32 rather than letting it drain whatever's queued?

Copy link
Contributor

@cloutiertyler cloutiertyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes resolve my throughput concerns.

clog.flush()?;
tx_buf = spawn_blocking(move || -> io::Result<Vec<Transaction<Txdata<T>>>> {
for tx in tx_buf.drain(..) {
clog.commit([tx])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only commit one at a time, when we may have multiple transactions ready to be committed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only appends the commit to a BufWriter, it doesn't translate to a write. That way, if a BufWriter flush results in a torn write, we don't lose the entire batch of transactions (we can't recover partially written commits).

break;
}
// Reclaim burst capacity.
tx_buf.shrink_to(self.batch_capacity.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We should probably only shrink if n < batch_capacity, so that we aren't constantly allocating and deallocating under high load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shrink to the batch capacity -- you mean maybe define a high watermark capacity?

@clockwork-labs-bot clockwork-labs-bot merged commit d6e2ba5 into kim/commitlog/append-commit-reopen Feb 27, 2026
40 of 50 checks passed
@kim kim deleted the kim/durability/flush-batch branch February 28, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants