Describe the bug
When a non-deterministic/volatile function (e.g. random(), uuid()) is computed once in a subquery and then referenced multiple times in the outer projection, DataFusion >= 52.0.0 pushes the outer projection into the file-scan DataSourceExec and inlines the subquery alias, turning the single call into N independent calls.
Two references to what should be the same "locked-in" value then diverge. This worked correctly in 51.0.0 and regressed in 52.0.0 (both 52.0.0 and 53.0.0 are affected).
It only reproduces with a file scan (Parquet/CSV); an in-memory MemTable is not affected, which points at projection pushdown into the file source.
To Reproduce
datafusion-cli:
COPY (SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 3) TO 't.parquet';
CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 't.parquet';
EXPLAIN
SELECT s.r AS x, s.r AS y
FROM (SELECT random() AS r FROM t) AS s;
51.0.0 — correct (random() evaluated once, then reused):
ProjectionExec: expr=[r@0 as x, r@0 as y]
ProjectionExec: expr=[random() as r]
DataSourceExec: file_groups={...t.parquet}, file_type=parquet
52.0.0 / 53.0.0 — incorrect (random() inlined and duplicated):
DataSourceExec: file_groups={...t.parquet}, projection=[random() as x, random() as y], file_type=parquet
Executing the query confirms x != y on 53.0.0, whereas x == y on 51.0.0.
Expected behavior
A volatile/non-deterministic expression aliased in a subquery should be evaluated once and reused by later references, as in 51.0.0. The optimizer should not inline/duplicate a volatile expression when pushing a projection into a scan (cf. #10337 for the CTE analogue).
Additional context
- Regression introduced in 52.0.0 (51.0.0 correct; 52.0.0 and 53.0.0 affected).
- Reproduces with Parquet and CSV file scans; not with in-memory tables.
- Surfaced downstream in Ibis, which relies on the subquery-aliasing pattern to "lock in"
random()/uuid() values (ibis/backends/tests/test_impure.py::test_impure_correlated and ::test_chained_selections). Equivalent Ibis reproducer:
import ibis
from ibis import _
con = ibis.datafusion.connect()
ibis.memtable({"id": [1, 2, 3]}).to_parquet("t.parquet") # file-backed; bug needs a file scan
t = con.read_parquet("t.parquet")
expr = t.select(common=ibis.random()).select(x=_.common, y=_.common)
df = expr.execute()
print((df.x == df.y).all()) # True on 51.0.0, False on >= 52.0.0
Generated-by: Claude Opus 4.8 noreply@anthropic.com
Describe the bug
When a non-deterministic/volatile function (e.g.
random(),uuid()) is computed once in a subquery and then referenced multiple times in the outer projection, DataFusion >= 52.0.0 pushes the outer projection into the file-scanDataSourceExecand inlines the subquery alias, turning the single call into N independent calls.Two references to what should be the same "locked-in" value then diverge. This worked correctly in 51.0.0 and regressed in 52.0.0 (both 52.0.0 and 53.0.0 are affected).
It only reproduces with a file scan (Parquet/CSV); an in-memory
MemTableis not affected, which points at projection pushdown into the file source.To Reproduce
datafusion-cli:
51.0.0 — correct (
random()evaluated once, then reused):52.0.0 / 53.0.0 — incorrect (
random()inlined and duplicated):Executing the query confirms
x != yon 53.0.0, whereasx == yon 51.0.0.Expected behavior
A volatile/non-deterministic expression aliased in a subquery should be evaluated once and reused by later references, as in 51.0.0. The optimizer should not inline/duplicate a volatile expression when pushing a projection into a scan (cf. #10337 for the CTE analogue).
Additional context
random()/uuid()values (ibis/backends/tests/test_impure.py::test_impure_correlatedand::test_chained_selections). Equivalent Ibis reproducer:Generated-by: Claude Opus 4.8 noreply@anthropic.com