[Sort Pushdown · Future B] Page-level dynamic prune at RG boundary — refresh PagePruningPredicate using runtime DynamicFilter (follow-up #22450)

## Summary

Today page-level pruning in Parquet (`opener/mod.rs:1314` → `PagePruningPredicate::prune_plan_with_page_index_and_metrics`) runs **once at file open** with the static query predicate. #22450 added dynamic RG-level pruning at every RG boundary (`should_prune` in `push_decoder.rs:183`), but its rebuild path never re-evaluates the page-level predicate.

This issue extends #22450's "refresh at RG boundary" pattern to **also refresh the `PagePruningPredicate`**, so the page-level `RowSelection` of upcoming RGs is tightened by the latest TopK threshold.

## Current state (source-confirmed)

| Prune type | Where | Data | Dynamic? |
|---|---|---|---|
| RG-level (#22450) | `push_decoder.rs:183 should_prune` (RG boundary) | RG metadata min/max | ✅ rebuilt every RG boundary |
| **Page-level** | `opener/mod.rs:1314` (**file open only**) | page index | ❌ snapshot at file open |
| Row-level (RowFilter) | per batch | filter column values | ✅ reads latest threshold |

Gap: after #22450, RG-level is dynamic but page-level is still static. If TopK heap tightens after file open, surviving RGs still have their initial (loose) page-level `RowSelection` — pages whose min/max no longer survive the new threshold are still fetched + decompressed + decoded for filter-col evaluation.

## Proposal

At every RG boundary (`PushDecoderStreamState::transition`):

1. `tracker.changed()` — same single atomic load #22450 uses
2. If changed: rebuild a fresh `PagePruningPredicate` from latest filter
3. Walk remaining RGs in access plan; refine each `RowSelection` via `prune_plan_with_page_index_and_metrics`
4. Apply via existing `into_builder() → with_row_groups(...) → build()`

Errors fall back to "keep current selection" (mirrors `should_prune`).

## Expected wins

Saves filter-column **IO + decompress + decode** for individual dead pages — extends #22450's "chip away Layer B residue" philosophy from RG to page granularity.

Most useful when:
- RGs are large (many pages each)
- Threshold tightens significantly mid-scan (e.g. after first few RGs fill the heap)
- Page index is enabled (prerequisite — without it, no-op)

## Prerequisites

- `datafusion.execution.parquet.enable_page_index = true`
- Filter column present in file schema
- Predicate chain contains a `DynamicFilter` (TopK source)

## Open design questions

1. **Refresh frequency**: every RG boundary, or only when `tracker.changed()` returns true?
2. **Granularity**: refresh access plan for *all* surviving RGs, or only the next one to be touched?
3. **arrow-rs API gap**: does the existing `with_row_groups(...)` path accept an updated per-RG `RowSelection`, or do we need a new arrow-rs API hook? (May overlap with arrow-rs#10158 territory.)
4. **Stretch goal · mid-RG refresh**: refresh *between* pages of the same RG, not just at RG boundary. Needs a brand-new arrow-rs "mid-RG predicate adapt" callback hook.

## Related

- #22450 — RG-level dynamic prune (the foundation this extends)
- #23067 — Per-RG \`fully_matched\` RowFilter skip
- arrow-rs#10158 — \`peek_next_row_group\` (related rebuild surface)
- arrow-rs#9937 — Page-level reverse iteration (independent but adjacent)

Part of the Sort Pushdown EPIC #23036, future direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Sort Pushdown · Future B] Page-level dynamic prune at RG boundary — refresh PagePruningPredicate using runtime DynamicFilter (follow-up #22450) #23216

Summary

Current state (source-confirmed)

Proposal

Expected wins

Prerequisites

Open design questions

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Prune type	Where	Data	Dynamic?
RG-level (#22450)	`push_decoder.rs:183 should_prune` (RG boundary)	RG metadata min/max	✅ rebuilt every RG boundary
Page-level	`opener/mod.rs:1314` (file open only)	page index	❌ snapshot at file open
Row-level (RowFilter)	per batch	filter column values	✅ reads latest threshold

Uh oh!

[Sort Pushdown · Future B] Page-level dynamic prune at RG boundary — refresh PagePruningPredicate using runtime DynamicFilter (follow-up #22450) #23216

Description

Summary

Current state (source-confirmed)

Proposal

Expected wins

Prerequisites

Open design questions

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions