durability: Flush batches#4478
durability: Flush batches#4478clockwork-labs-bot merged 3 commits intokim/commitlog/append-commit-reopenfrom
Conversation
Instead of periodic flush + sync, the simplified commitlog buffering makes it easy to just pop a batch of transaction from the queue, commit them, and then flush + sync after each batch. This may lead to an overall higher number of fsyncs on the system, but also adapts to the tx throughput of the individual database (up to the batch size). After taking some measurements, we may want to make the batch size configurable at runtime. Depends-on: #4404
|
I think we need both an interval and a batch, no? If a database has very low throughput, this will fsync very infrequently and spike latency. (e.g. one client doing a txn every minute or so). |
|
The idea is to fsync every batch (we did flush each tx before), amortising the cost by batching up to 32 txs.
|
|
The batching approach improves latency under low load (no waiting for timer ticks), but I'm concerned the fixed batch size of 32 couples throughput to fsync latency in a way the current time-based approach doesn't. With time-based sync, throughput is bounded by disk write speed. With batch-capped sync, throughput is bounded by Example NVME-ish numbers: Assume 1KB per transaction, 3 GB/s disk throughput, and 1ms fsync latency.
The batch-capped would be ~1% disk utilization under sustained load, whereas I think time-based would be close to 100%. If fsync takes longer (e.g. 3ms), it's more like a theoretical max of 10,000 tx/second. And the smaller the transactions are, the less disk utilization we'd have. The current implementation also appears to adapt to slow fsync via If the concern is pulling too many transactions off the queue at once, the current implementation already does this, no? It uses an unbounded channel and syncs everything that has accumulated during the interval, with no cap.
The cap acts like backpressure, but it's not responding to any actual signal (disk saturation, memory pressure). Is there a specific reason to bound it at 32 rather than letting it drain whatever's queued? |
cloutiertyler
left a comment
There was a problem hiding this comment.
New changes resolve my throughput concerns.
| clog.flush()?; | ||
| tx_buf = spawn_blocking(move || -> io::Result<Vec<Transaction<Txdata<T>>>> { | ||
| for tx in tx_buf.drain(..) { | ||
| clog.commit([tx])?; |
There was a problem hiding this comment.
Why do we only commit one at a time, when we may have multiple transactions ready to be committed?
There was a problem hiding this comment.
This only appends the commit to a BufWriter, it doesn't translate to a write. That way, if a BufWriter flush results in a torn write, we don't lose the entire batch of transactions (we can't recover partially written commits).
| break; | ||
| } | ||
| // Reclaim burst capacity. | ||
| tx_buf.shrink_to(self.batch_capacity.get()); |
There was a problem hiding this comment.
Nit: We should probably only shrink if n < batch_capacity, so that we aren't constantly allocating and deallocating under high load.
There was a problem hiding this comment.
We shrink to the batch capacity -- you mean maybe define a high watermark capacity?
d6e2ba5
into
kim/commitlog/append-commit-reopen
Instead of periodic flush + sync, the simplified commitlog buffering makes it easy to just pop a batch of transaction from the queue, commit them, and then flush + sync after each batch.
This may lead to an overall higher number of fsyncs on the system, but also adapts to the tx throughput of the individual database (up to the batch size).
After taking some measurements, we may want to make the batch size configurable at runtime.
Depends-on: #4404