[GLUTEN-11605][VL] Write per-block column statistics in shuffle writer by acvictor · Pull Request #11769 · apache/gluten

acvictor · 2026-03-16T10:53:59Z

What changes are proposed in this pull request?

This PR adds per-block column statistics (min/max/hasNull) to the shuffle writer pipeline as a prerequisite for block-level pruning using dynamic filters at the shuffle reader. When spark.gluten.sql.columnar.backend.velox.valueStream.dynamicFilter.enabled is true, the shuffle writer computes per-column min/max statistics from raw Arrow buffers during evictBuffers() and serializes them as a kStatisticsPayload block before each non-dictionary payload in the output file. This mirrors how parquet row group statistics enable predicate pushdown.

How was this patch tested?

Added new tests and also ran the CI with config set to true.

Was this patch authored or co-authored using generative AI tooling?

No

Related issue: #11605

acvictor · 2026-03-18T18:13:33Z

@marin-ma @zhztheplayer this is ready for review. I will push one more commit to again disable by default.

marin-ma

@acvictor Thanks for adding this feature! Please check my comments below.

As adding the statistics can grow the shuffle data size, have you tested the overall growth of shuffled data size for tpch/tpcds benchmarks?

marin-ma · 2026-03-19T09:28:46Z

cpp/core/shuffle/BlockStatistics.cc

+  }
+  // Check each bit — return early on first null found.
+  for (uint32_t i = 0; i < numRows; ++i) {
+    if (!arrow::bit_util::GetBit(validityBuffer->data(), i)) {


Perhaps check each bytes by comparing with 0xff can be faster

Done, thanks!

marin-ma · 2026-03-19T09:38:32Z

cpp/core/jni/JniWrapper.cc

+  // Reuse the dynamic filter config to also enable block statistics collection,
+  // since stats are only useful when dynamic filter pushdown is active.
+  const auto& confMap = ctx->getConfMap();
+  auto it = confMap.find("spark.gluten.sql.columnar.backend.velox.valueStream.dynamicFilter.enabled");


Other option values are passing through function args. Can you add a new arg enableBlockStatistics?

marin-ma · 2026-03-19T09:42:07Z

cpp/core/shuffle/Payload.cc

+    mergedStats.merge(*append->blockStats_);
+    result->setBlockStats(std::move(mergedStats));
+  } else if (source->hasBlockStats()) {
+    result->setBlockStats(*source->blockStats_);


If only source or append has blockStats, should we either discard it or compute for the missing side and merge them?

Done - stats are only kept when both payloads have them and if one side is missing stats are discarded

weiting-chen · 2026-03-26T07:41:30Z

cpp/core/shuffle/BlockStatistics.cc

+    if (!isRowValid(validityBuffer, i)) {
+      continue;
+    }
+    T val = values[i];


Check if require to consider NaN cases,
NaN may be silently skipped by the comparisons (neither updates min nor max) and making minVal = maxVal = NaN.
See if require to add a NaN check:
if constexpr (std::is_floating_point_v) {
if (std::isnan(val)) continue; // Skip NaN values
}

weiting-chen · 2026-03-26T07:50:27Z

cpp/core/shuffle/Payload.cc

+    case Type::kCompressed: {
+      int64_t size = sizeof(Type) + sizeof(uint32_t) + sizeof(uint32_t); // type + numRows + numBuffers
+      if (!buffers_.empty() && buffers_[0]) {
+        size += buffers_[0]->size();


Missing buffer size field?
size += sizeof(int64_t) + buffers_[0]->size(); // buffer size field + data

acvictor · 2026-03-26T12:24:18Z

@acvictor Thanks for adding this feature! Please check my comments below.

As adding the statistics can grow the shuffle data size, have you tested the overall growth of shuffled data size for tpch/tpcds benchmarks?

Will revert on this soon

github-actions bot added the VELOX label Mar 16, 2026

acvictor force-pushed the acvictor/writerChanges branch 6 times, most recently from cb073fd to 19b8d5a Compare March 16, 2026 11:46

Initial changes

50e0444

acvictor force-pushed the acvictor/writerChanges branch from 19b8d5a to 50e0444 Compare March 16, 2026 13:51

Test with true

336f6de

acvictor marked this pull request as ready for review March 17, 2026 13:57

marin-ma reviewed Mar 19, 2026

View reviewed changes

acvictor added 2 commits March 20, 2026 08:33

Address PR comments

52018ad

Set to false

a61deb5

weiting-chen mentioned this pull request Mar 25, 2026

Gluten 2026 Roadmap #11827

Open

60 tasks

weiting-chen reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-11605][VL] Write per-block column statistics in shuffle writer#11769

[GLUTEN-11605][VL] Write per-block column statistics in shuffle writer#11769
acvictor wants to merge 4 commits intoapache:mainfrom
acvictor:acvictor/writerChanges

acvictor commented Mar 16, 2026 •

edited

Loading

Uh oh!

acvictor commented Mar 18, 2026

Uh oh!

marin-ma left a comment

Uh oh!

marin-ma Mar 19, 2026

Uh oh!

acvictor Mar 20, 2026

Uh oh!

marin-ma Mar 19, 2026

Uh oh!

acvictor Mar 20, 2026

Uh oh!

marin-ma Mar 19, 2026

Uh oh!

acvictor Mar 20, 2026

Uh oh!

weiting-chen Mar 26, 2026

Uh oh!

weiting-chen Mar 26, 2026

Uh oh!

acvictor commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

acvictor commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

acvictor commented Mar 18, 2026

Uh oh!

marin-ma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acvictor commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

acvictor commented Mar 16, 2026 •

edited

Loading