Skip to content

Projection pushdown into file scan duplicates non-deterministic functions (regression in 52.0.0) #23220

Description

@deepyaman

Describe the bug

When a non-deterministic/volatile function (e.g. random(), uuid()) is computed once in a subquery and then referenced multiple times in the outer projection, DataFusion >= 52.0.0 pushes the outer projection into the file-scan DataSourceExec and inlines the subquery alias, turning the single call into N independent calls.

Two references to what should be the same "locked-in" value then diverge. This worked correctly in 51.0.0 and regressed in 52.0.0 (both 52.0.0 and 53.0.0 are affected).

It only reproduces with a file scan (Parquet/CSV); an in-memory MemTable is not affected, which points at projection pushdown into the file source.

To Reproduce

datafusion-cli:

COPY (SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 3) TO 't.parquet';
CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 't.parquet';

EXPLAIN
SELECT s.r AS x, s.r AS y
FROM (SELECT random() AS r FROM t) AS s;

51.0.0 — correct (random() evaluated once, then reused):

ProjectionExec: expr=[r@0 as x, r@0 as y]
  ProjectionExec: expr=[random() as r]
    DataSourceExec: file_groups={...t.parquet}, file_type=parquet

52.0.0 / 53.0.0 — incorrect (random() inlined and duplicated):

DataSourceExec: file_groups={...t.parquet}, projection=[random() as x, random() as y], file_type=parquet

Executing the query confirms x != y on 53.0.0, whereas x == y on 51.0.0.

Expected behavior

A volatile/non-deterministic expression aliased in a subquery should be evaluated once and reused by later references, as in 51.0.0. The optimizer should not inline/duplicate a volatile expression when pushing a projection into a scan (cf. #10337 for the CTE analogue).

Additional context

  • Regression introduced in 52.0.0 (51.0.0 correct; 52.0.0 and 53.0.0 affected).
  • Reproduces with Parquet and CSV file scans; not with in-memory tables.
  • Surfaced downstream in Ibis, which relies on the subquery-aliasing pattern to "lock in" random()/uuid() values (ibis/backends/tests/test_impure.py::test_impure_correlated and ::test_chained_selections). Equivalent Ibis reproducer:
import ibis
from ibis import _

con = ibis.datafusion.connect()
ibis.memtable({"id": [1, 2, 3]}).to_parquet("t.parquet")  # file-backed; bug needs a file scan
t = con.read_parquet("t.parquet")

expr = t.select(common=ibis.random()).select(x=_.common, y=_.common)
df = expr.execute()
print((df.x == df.y).all())   # True on 51.0.0, False on >= 52.0.0

Generated-by: Claude Opus 4.8 noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions