Skip to content

Parallel Full Join and Right Join#1635

Open
avamingli wants to merge 5 commits intoapache:mainfrom
avamingli:p_full_join
Open

Parallel Full Join and Right Join#1635
avamingli wants to merge 5 commits intoapache:mainfrom
avamingli:p_full_join

Conversation

@avamingli
Copy link
Contributor

Parallel Hash Join has been in PostgreSQL for a while, but FULL and RIGHT outer joins were
deliberately left out of the initial implementation — the per-batch barrier protocol had a
deadlock risk that nobody had solved yet. UPSTREAM finally cracked it by adding a dedicated scan
phase (PHJ_BATCH_SCAN): after all workers finish probing, one elected worker walks the hash
table looking for unmatched inner rows and emits them, while the rest move on to other batches
or finish early. It is a clean solution and the upstream fix is compact. We are in the process
of cherry-picking UPSTREAM into CBDB, and this PR brings in those three upstream commits as a
foundation.

Performance

Benchmark on a 3-segment cluster with parallel_workers = 2 per segment (6 workers total),
6 million rows per table, 50% row overlap between the two sides.

Query Parallel Serial Speedup
FULL JOIN 4040 ms 6347 ms 1.57×
RIGHT JOIN 3039 ms 5568 ms 1.83×

Example plans (3 segments, parallel_workers=2):

PARALLEL FULL JOIN:

      -- FULL JOIN: result locus is HashedOJ with Parallel Workers: 2
      EXPLAIN(costs off, locus)
      SELECT count(*) FROM t1 FULL JOIN t2 USING (id);

       Finalize Aggregate
         Locus: Entry
         ->  Gather Motion 6:1  (slice1; segments: 6)
               ->  Partial Aggregate
                     Locus: HashedOJ
                     Parallel Workers: 2
                     ->  Parallel Hash Full Join
                           Locus: HashedOJ
                           Parallel Workers: 2
                           Hash Cond: (t1.id = t2.id)
                           ->  Parallel Seq Scan on t1
                                 Locus: HashedWorkers
                           ->  Parallel Hash
                                 ->  Parallel Seq Scan on t2
                                       Locus: HashedWorkers

PARALLEL RIGHT JOIN

      -- RIGHT JOIN: when t1 is larger the planner hashes the smaller t2
      --             and probes with t1; result locus HashedWorkers
      EXPLAIN(costs off, locus)
      SELECT count(*) FROM t1 RIGHT JOIN t2 USING (id);

       Finalize Aggregate
         Locus: Entry
         ->  Gather Motion 6:1  (slice1; segments: 6)
               ->  Partial Aggregate
                     Locus: HashedWorkers
                     Parallel Workers: 2
                     ->  Parallel Hash Right Join
                           Locus: HashedWorkers
                           Parallel Workers: 2
                           Hash Cond: (t1.id = t2.id)
                           ->  Parallel Seq Scan on t1
                                 Locus: HashedWorkers
                           ->  Parallel Hash
                                 ->  Parallel Seq Scan on t2
                                       Locus: HashedWorkers

On top of the upstream work, CBDB's distributed execution model needed its own adaptations.
The tricky part is what happens to the result of a parallel full outer join: in a distributed
cluster the NULL-extended rows for unmatched tuples can land on any segment, so the result
cannot carry a plain Hashed locus — it needs HashedOJ. Before this PR the distributed
planner simply rejected FULL and RIGHT joins from the parallel path entirely, so every
such query fell back to serial execution on all segments. The planner now accepts both join
types, assigns the correct HashedOJ locus to a full join's output, and — importantly —
remembers to carry the parallel_workers count on that locus. Without that last detail,
any aggregate or further join sitting above the full join would see a locus with zero workers
and silently go serial, wiping out the benefit.

There is also a crash fix bundled here. A parallel NOT IN subquery join (LASJ_NOTIN) has
a special fast path: the moment any inner row with a NULL key is found, the whole join can
return empty, so the worker exits early. The problem is that this early exit happened before
the worker ever attached to the probing barrier for the current batch. Later, during shutdown,
the code still tried to arrive-and-detach from that barrier, which had zero participants and
asserted. The fix is straightforward: skip the barrier call when the early-exit condition was
triggered.

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@avamingli avamingli added type: Performance cloudberry runs slow on some particular query planner labels Mar 24, 2026
@avamingli avamingli changed the title Parallel Hash Full Join and Right Join Parallel Full Join and Right Join Mar 24, 2026
macdice and others added 5 commits March 25, 2026 20:30
Full and right outer joins were not supported in the initial
implementation of Parallel Hash Join because of deadlock hazards (see
discussion).  Therefore FULL JOIN inhibited parallelism, as the other
join strategies can't do that in parallel either.

Add a new PHJ phase PHJ_BATCH_SCAN that scans for unmatched tuples on
the inner side of one batch's hash table.  For now, sidestep the
deadlock problem by terminating parallelism there.  The last process to
arrive at that phase emits the unmatched tuples, while others detach and
are free to go and work on other batches, if there are any, but
otherwise they finish the join early.

That unfairness is considered acceptable for now, because it's better
than no parallelism at all.  The build and probe phases are run in
parallel, and the new scan-for-unmatched phase, while serial, is usually
applied to the smaller of the two relations and is either limited by
some multiple of work_mem, or it's too big and is partitioned into
batches and then the situation is improved by batch-level parallelism.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com
Hash join tuples reuse the HOT status bit to indicate match status
during hash join execution. Correct reuse requires clearing the bit in
all tuples. Serial hash join and parallel multi-batch hash join do so
upon inserting the tuple into the hashtable. Single batch parallel hash
join and batch 0 of unexpected multi-batch hash joins forgot to do this.

It hadn't come up before because hashtable tuple match bits are only
used for right and full outer joins and parallel ROJ and FOJ were
unsupported. 11c2d6f introduced support for parallel ROJ/FOJ but
neglected to ensure the match bits were reset.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/flat/CAMbWs48Nde1Mv%3DBJv6_vXmRKHMuHZm2Q_g4F6Z3_pn%2B3EV6BGQ%40mail.gmail.com
As reported by buildfarm member conchuela, one of the regression tests
added by 558c9d7 is having some ordering issues.  This commit adds an
ORDER BY clause to make the output more stable for the problematic
query.

Fix suggested by Tom Lane.  The plan of the query updated still uses a
parallel hash full join.

Author: Melanie Plageman
Discussion: https://postgr.es/m/623596.1684541098@sss.pgh.pa.us
PostgreSQL originally excluded FULL and RIGHT outer joins from parallel
hash join because of deadlock hazards in the per-batch barrier protocol.
PG 14 resolved this by introducing a dedicated PHJ_BATCH_SCAN phase: one
elected worker emits unmatched inner-side rows after probing, while the
others detach and move on.

In CBDB, distributed execution adds a second dimension: after a full
outer join the unmatched NULL-filled rows may come from any segment, so
the result carries a HashedOJ locus rather than a plain Hashed locus.
This change teaches the parallel planner about that:

  - FULL JOIN and RIGHT JOIN are now valid parallel join types in the
    distributed planner.  Previously they were unconditionally rejected,
    forcing serial execution across all segments.

  - The HashedOJ locus produced by a parallel full join now carries
    parallel_workers, so operators above the join (aggregates, further
    joins) can remain parallel.

  - A crash that could occur when a parallel LASJ_NOTIN (NOT IN) join
    encountered NULL inner keys is fixed.  The worker would exit early
    but the batch barrier, which was never attached to, would be touched
    on shutdown causing an assertion failure.

Example plans (3 segments, parallel_workers=2):

  -- FULL JOIN: result locus is HashedOJ with Parallel Workers: 2
  EXPLAIN(costs off, locus)
  SELECT count(*) FROM t1 FULL JOIN t2 USING (id);

   Finalize Aggregate
     Locus: Entry
     ->  Gather Motion 6:1  (slice1; segments: 6)
           ->  Partial Aggregate
                 Locus: HashedOJ
                 Parallel Workers: 2
                 ->  Parallel Hash Full Join
                       Locus: HashedOJ
                       Parallel Workers: 2
                       Hash Cond: (t1.id = t2.id)
                       ->  Parallel Seq Scan on t1
                             Locus: HashedWorkers
                       ->  Parallel Hash
                             ->  Parallel Seq Scan on t2
                                   Locus: HashedWorkers

  -- RIGHT JOIN: when t1 is larger the planner hashes the smaller t2
  --             and probes with t1; result locus HashedWorkers
  EXPLAIN(costs off, locus)
  SELECT count(*) FROM t1 RIGHT JOIN t2 USING (id);

   Finalize Aggregate
     Locus: Entry
     ->  Gather Motion 6:1  (slice1; segments: 6)
           ->  Partial Aggregate
                 Locus: HashedWorkers
                 Parallel Workers: 2
                 ->  Parallel Hash Right Join
                       Locus: HashedWorkers
                       Parallel Workers: 2
                       Hash Cond: (t1.id = t2.id)
                       ->  Parallel Seq Scan on t1
                             Locus: HashedWorkers
                       ->  Parallel Hash
                             ->  Parallel Seq Scan on t2
                                   Locus: HashedWorkers

Performance (3 segments x 2 parallel workers, 6M rows each, 50% overlap):

  FULL JOIN  parallel:  4040 ms   serial: 6347 ms   speedup: 1.57x
  RIGHT JOIN parallel:  3039 ms   serial: 5568 ms   speedup: 1.83x
cbdb_parallel.sql: add a new test block covering:
  - Parallel Hash Full Join (HashedWorkers FULL JOIN HashedWorkers
    produces HashedOJ with parallel_workers=2)
  - Parallel Hash Right Join (pj_t1 is 3x larger than pj_t2, so the
    planner hashes the smaller pj_t2 and probes with pj_t1; result
    locus HashedWorkers)
  - Correctness checks: count(*) matches serial execution
  - Locus propagation: HashedOJ(parallel) followed by INNER JOIN
    produces HashedOJ; followed by FULL JOIN produces HashedOJ

join_hash.sql/out: CBDB-specific adaptations for the upstream parallel
full join test -- disable parallel mode for tests that require serial
plans, fix SAVEPOINT inside a parallel worker context, and update
expected output to match CBDB plan shapes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

planner type: Performance cloudberry runs slow on some particular query

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants