Skip to content

Support dynamic filter pushdown in FilteredReadExec #7562

Description

@wjones127

DataFusion 53.1 (our vendored version) implements dynamic filter pushdown for hash joins: HashJoinExec builds a DynamicFilterPhysicalExpr from the build side's join keys once the build phase completes, and propagates it to the probe side via ExecutionPlan::gather_filters_for_pushdown / handle_child_pushdown_result (datafusion-physical-plan-53.1.0/src/joins/hash_join/{exec.rs,shared_bounds.rs}; see also the DataFusion blog post).

FilteredReadExec (rust/lance/src/io/exec/filtered_read.rs) does not implement either trait method, so this pushdown is currently a no-op for any plan where a Lance scan is the probe side of a join — the filter is generated and offered but has nowhere to land.

Why this matters

merge_insert (#7367) plans the target scan as a FilteredReadExec probed against a hash join built from the (usually smaller) source side. If FilteredReadExec accepted the pushed-down dynamic filter, it could skip decoding — or, if per-page/fragment key statistics are available, skip reading entirely — target rows that can't match any source key, in a single sequential scan pass. That would avoid the join → TakeExec round trip (and its random-I/O cost) that late-materialization approaches have to reason about, for at least the case where a dynamic filter usefully narrows the scan.

This is likely useful beyond merge_insert for any join where a Lance scan is the probe side and the build side is small/selective.

Scope

  • Implement ExecutionPlan::gather_filters_for_pushdown / handle_child_pushdown_result on FilteredReadExec to accept a dynamically-populated PhysicalExpr.
  • Apply the filter during scan execution: at minimum evaluate it per-batch/row before returning results; if feasible, use existing column statistics (if any) for page/fragment-level pruning.
  • Benchmark against a join-heavy workload (e.g. merge_insert backfill) to confirm it reduces I/O without regressing plans where the filter doesn't help (non-clustered keys, filters that can't prune anything).

Surfaced while assessing follow-up work for #7367 (merge_insert late materialization).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions