Skip to content

Improve Subquery Joins #79

@theory

Description

@theory

#58 improved quite a few subquery joins, specifically SEMI JOINs, but others still don't push down consistently. An example using the taxi data from the tutorial:

create table trips_local as SELECT * FROM TRIPS limit 100;
explain (verbose)
SELECT *
 FROM (select vendor_id, count(*) num FROM trips group by vendor_id) AS agg
 JOIN trips_local l ON agg.vendor_id = l.vendor_id;

Produces this plan:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     QUERY PLAN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=9.32..22.02 rows=100 width=287)
   Output: trips.vendor_id, (count(*)), l.trip_id, l.vendor_id, l.pickup_date, l.pickup_datetime, l.dropoff_date, l.dropoff_datetime, l.store_and_fwd_flag, l.rate_code_id, l.pickup_longitude, l.pickup_latitude, l.dropoff_longitude, l.dropoff_latitude, l.passenger_count, l.trip_distance, l.fare_amount, l.extra, l.mta_tax, l.tip_amount, l.tolls_amount, l.ehail_fee, l.improvement_surcharge, l.total_amount, l.payment_type, l.trip_type, l.pickup, l.dropoff, l.cab_type, l.pickup_nyct2010_gid, l.pickup_ctlabel, l.pickup_borocode, l.pickup_ct2010, l.pickup_boroct2010, l.pickup_cdeligibil, l.pickup_ntacode, l.pickup_ntaname, l.pickup_puma, l.dropoff_nyct2010_gid, l.dropoff_ctlabel, l.dropoff_borocode, l.dropoff_ct2010, l.dropoff_boroct2010, l.dropoff_cdeligibil, l.dropoff_ntacode, l.dropoff_ntaname, l.dropoff_puma
   Merge Cond: (trips.vendor_id = l.vendor_id)
   ->  GroupAggregate  (cost=1.00..12.50 rows=200 width=40)
         Output: trips.vendor_id, count(*)
         Group Key: trips.vendor_id
         ->  Foreign Scan on taxi.trips  (cost=1.00..5.50 rows=1000 width=32)
               Output: trips.trip_id, trips.vendor_id, trips.pickup_date, trips.pickup_datetime, trips.dropoff_date, trips.dropoff_datetime, trips.store_and_fwd_flag, trips.rate_code_id, trips.pickup_longitude, trips.pickup_latitude, trips.dropoff_longitude, trips.dropoff_latitude, trips.passenger_count, trips.trip_distance, trips.fare_amount, trips.extra, trips.mta_tax, trips.tip_amount, trips.tolls_amount, trips.ehail_fee, trips.improvement_surcharge, trips.total_amount, trips.payment_type, trips.trip_type, trips.pickup, trips.dropoff, trips.cab_type, trips.pickup_nyct2010_gid, trips.pickup_ctlabel, trips.pickup_borocode, trips.pickup_ct2010, trips.pickup_boroct2010, trips.pickup_cdeligibil, trips.pickup_ntacode, trips.pickup_ntaname, trips.pickup_puma, trips.dropoff_nyct2010_gid, trips.dropoff_ctlabel, trips.dropoff_borocode, trips.dropoff_ct2010, trips.dropoff_boroct2010, trips.dropoff_cdeligibil, trips.dropoff_ntacode, trips.dropoff_ntaname, trips.dropoff_puma
               Remote SQL: SELECT vendor_id FROM taxi.trips ORDER BY vendor_id ASC NULLS LAST
   ->  Sort  (cost=8.32..8.57 rows=100 width=247)
         Output: l.trip_id, l.vendor_id, l.pickup_date, l.pickup_datetime, l.dropoff_date, l.dropoff_datetime, l.store_and_fwd_flag, l.rate_code_id, l.pickup_longitude, l.pickup_latitude, l.dropoff_longitude, l.dropoff_latitude, l.passenger_count, l.trip_distance, l.fare_amount, l.extra, l.mta_tax, l.tip_amount, l.tolls_amount, l.ehail_fee, l.improvement_surcharge, l.total_amount, l.payment_type, l.trip_type, l.pickup, l.dropoff, l.cab_type, l.pickup_nyct2010_gid, l.pickup_ctlabel, l.pickup_borocode, l.pickup_ct2010, l.pickup_boroct2010, l.pickup_cdeligibil, l.pickup_ntacode, l.pickup_ntaname, l.pickup_puma, l.dropoff_nyct2010_gid, l.dropoff_ctlabel, l.dropoff_borocode, l.dropoff_ct2010, l.dropoff_boroct2010, l.dropoff_cdeligibil, l.dropoff_ntacode, l.dropoff_ntaname, l.dropoff_puma
         Sort Key: l.vendor_id
         ->  Seq Scan on taxi.trips_local l  (cost=0.00..5.00 rows=100 width=247)
               Output: l.trip_id, l.vendor_id, l.pickup_date, l.pickup_datetime, l.dropoff_date, l.dropoff_datetime, l.store_and_fwd_flag, l.rate_code_id, l.pickup_longitude, l.pickup_latitude, l.dropoff_longitude, l.dropoff_latitude, l.passenger_count, l.trip_distance, l.fare_amount, l.extra, l.mta_tax, l.tip_amount, l.tolls_amount, l.ehail_fee, l.improvement_surcharge, l.total_amount, l.payment_type, l.trip_type, l.pickup, l.dropoff, l.cab_type, l.pickup_nyct2010_gid, l.pickup_ctlabel, l.pickup_borocode, l.pickup_ct2010, l.pickup_boroct2010, l.pickup_cdeligibil, l.pickup_ntacode, l.pickup_ntaname, l.pickup_puma, l.dropoff_nyct2010_gid, l.dropoff_ctlabel, l.dropoff_borocode, l.dropoff_ct2010, l.dropoff_boroct2010, l.dropoff_cdeligibil, l.dropoff_ntacode, l.dropoff_ntaname, l.dropoff_puma
(14 rows)

Time: 33.099 ms

Which seems wild, frankly. Figure out whether it's possible to convince PostgreSQL to push down the list of l.vendor_id values to query directly against the trips table in ClickHouse.

Another

Another variant:

taxi=# explain (verbose) WITH agg AS (select vendor_id, count(*) num FROM trips group by vendor_id) select agg.vendor_id from agg;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     QUERY PLAN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on agg  (cost=1.00..12.00 rows=200 width=32)
   Output: agg.vendor_id
   ->  GroupAggregate  (cost=1.00..10.00 rows=200 width=40)
         Output: trips.vendor_id, NULL::bigint
         Group Key: trips.vendor_id
         ->  Foreign Scan on taxi.trips  (cost=1.00..5.50 rows=1000 width=32)
               Output: trips.trip_id, trips.vendor_id, trips.pickup_date, trips.pickup_datetime, trips.dropoff_date, trips.dropoff_datetime, trips.store_and_fwd_flag, trips.rate_code_id, trips.pickup_longitude, trips.pickup_latitude, trips.dropoff_longitude, trips.dropoff_latitude, trips.passenger_count, trips.trip_distance, trips.fare_amount, trips.extra, trips.mta_tax, trips.tip_amount, trips.tolls_amount, trips.ehail_fee, trips.improvement_surcharge, trips.total_amount, trips.payment_type, trips.trip_type, trips.pickup, trips.dropoff, trips.cab_type, trips.pickup_nyct2010_gid, trips.pickup_ctlabel, trips.pickup_borocode, trips.pickup_ct2010, trips.pickup_boroct2010, trips.pickup_cdeligibil, trips.pickup_ntacode, trips.pickup_ntaname, trips.pickup_puma, trips.dropoff_nyct2010_gid, trips.dropoff_ctlabel, trips.dropoff_borocode, trips.dropoff_ct2010, trips.dropoff_boroct2010, trips.dropoff_cdeligibil, trips.dropoff_ntacode, trips.dropoff_ntaname, trips.dropoff_puma
               Remote SQL: SELECT vendor_id FROM taxi.trips ORDER BY vendor_id ASC NULLS LAST
(8 rows)

LIMIT Subquery

Here's another one:

taxi=# explain SELECT * FROM (select vendor_id, count(*) num FROM trips group by vendor_id) LIMIT 10;
                        QUERY PLAN                        
----------------------------------------------------------
 Limit  (cost=1.00..1.04 rows=10 width=40)
   ->  Foreign Scan  (cost=1.00..5.10 rows=1000 width=40)
         Relations: Aggregate on (trips)

Could LIMIT 10 be pushed down? Maybe it's enough to tell users to put the LIMIT in the subquery:

taxi=# explain SELECT * FROM (select vendor_id, count(*) num FROM trips group by vendor_id LIMIT 10);
                    QUERY PLAN                     
---------------------------------------------------
 Foreign Scan  (cost=0.00..-10.00 rows=1 width=40)
   Relations: Aggregate on (trips)

Metadata

Metadata

Labels

pushdownImprovements to query pushdown
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions