Parallel Full Join and Right Join by avamingli · Pull Request #1635 · apache/cloudberry

avamingli · 2026-03-24T08:51:09Z

Parallel Hash Join has been in PostgreSQL for a while, but FULL and RIGHT outer joins were
deliberately left out of the initial implementation — the per-batch barrier protocol had a
deadlock risk that nobody had solved yet. UPSTREAM finally cracked it by adding a dedicated scan
phase (PHJ_BATCH_SCAN): after all workers finish probing, one elected worker walks the hash
table looking for unmatched inner rows and emits them, while the rest move on to other batches
or finish early. It is a clean solution and the upstream fix is compact. We are in the process
of cherry-picking UPSTREAM into CBDB, and this PR brings in those three upstream commits as a
foundation.

Performance

Benchmark on a 3-segment cluster with parallel_workers = 2 per segment (6 workers total),
6 million rows per table, 50% row overlap between the two sides.

Query	Parallel	Serial	Speedup
`FULL JOIN`	4040 ms	6347 ms	1.57×
`RIGHT JOIN`	3039 ms	5568 ms	1.83×

Example plans (3 segments, parallel_workers=2):

PARALLEL FULL JOIN:

      -- FULL JOIN: result locus is HashedOJ with Parallel Workers: 2
      EXPLAIN(costs off, locus)
      SELECT count(*) FROM t1 FULL JOIN t2 USING (id);

       Finalize Aggregate
         Locus: Entry
         ->  Gather Motion 6:1  (slice1; segments: 6)
               ->  Partial Aggregate
                     Locus: HashedOJ
                     Parallel Workers: 2
                     ->  Parallel Hash Full Join
                           Locus: HashedOJ
                           Parallel Workers: 2
                           Hash Cond: (t1.id = t2.id)
                           ->  Parallel Seq Scan on t1
                                 Locus: HashedWorkers
                           ->  Parallel Hash
                                 ->  Parallel Seq Scan on t2
                                       Locus: HashedWorkers

PARALLEL RIGHT JOIN

      -- RIGHT JOIN: when t1 is larger the planner hashes the smaller t2
      --             and probes with t1; result locus HashedWorkers
      EXPLAIN(costs off, locus)
      SELECT count(*) FROM t1 RIGHT JOIN t2 USING (id);

       Finalize Aggregate
         Locus: Entry
         ->  Gather Motion 6:1  (slice1; segments: 6)
               ->  Partial Aggregate
                     Locus: HashedWorkers
                     Parallel Workers: 2
                     ->  Parallel Hash Right Join
                           Locus: HashedWorkers
                           Parallel Workers: 2
                           Hash Cond: (t1.id = t2.id)
                           ->  Parallel Seq Scan on t1
                                 Locus: HashedWorkers
                           ->  Parallel Hash
                                 ->  Parallel Seq Scan on t2
                                       Locus: HashedWorkers

On top of the upstream work, CBDB's distributed execution model needed its own adaptations.
The tricky part is what happens to the result of a parallel full outer join: in a distributed
cluster the NULL-extended rows for unmatched tuples can land on any segment, so the result
cannot carry a plain Hashed locus — it needs HashedOJ. Before this PR the distributed
planner simply rejected FULL and RIGHT joins from the parallel path entirely, so every
such query fell back to serial execution on all segments. The planner now accepts both join
types, assigns the correct HashedOJ locus to a full join's output, and — importantly —
remembers to carry the parallel_workers count on that locus. Without that last detail,
any aggregate or further join sitting above the full join would see a locus with zero workers
and silently go serial, wiping out the benefit.

There is also a crash fix bundled here. A parallel NOT IN subquery join (LASJ_NOTIN) has
a special fast path: the moment any inner row with a NULL key is found, the whole join can
return empty, so the worker exits early. The problem is that this early exit happened before
the worker ever attached to the probing barrier for the current batch. Later, during shutdown,
the code still tried to arrive-and-detach from that barrier, which had zero participants and
asserted. The fix is straightforward: skip the barrier call when the early-exit condition was
triggered.

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

Full and right outer joins were not supported in the initial implementation of Parallel Hash Join because of deadlock hazards (see discussion). Therefore FULL JOIN inhibited parallelism, as the other join strategies can't do that in parallel either. Add a new PHJ phase PHJ_BATCH_SCAN that scans for unmatched tuples on the inner side of one batch's hash table. For now, sidestep the deadlock problem by terminating parallelism there. The last process to arrive at that phase emits the unmatched tuples, while others detach and are free to go and work on other batches, if there are any, but otherwise they finish the join early. That unfairness is considered acceptable for now, because it's better than no parallelism at all. The build and probe phases are run in parallel, and the new scan-for-unmatched phase, while serial, is usually applied to the smaller of the two relations and is either limited by some multiple of work_mem, or it's too big and is partitioned into batches and then the situation is improved by batch-level parallelism. Author: Melanie Plageman <melanieplageman@gmail.com> Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com

Hash join tuples reuse the HOT status bit to indicate match status during hash join execution. Correct reuse requires clearing the bit in all tuples. Serial hash join and parallel multi-batch hash join do so upon inserting the tuple into the hashtable. Single batch parallel hash join and batch 0 of unexpected multi-batch hash joins forgot to do this. It hadn't come up before because hashtable tuple match bits are only used for right and full outer joins and parallel ROJ and FOJ were unsupported. 11c2d6f introduced support for parallel ROJ/FOJ but neglected to ensure the match bits were reset. Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/flat/CAMbWs48Nde1Mv%3DBJv6_vXmRKHMuHZm2Q_g4F6Z3_pn%2B3EV6BGQ%40mail.gmail.com

As reported by buildfarm member conchuela, one of the regression tests added by 558c9d7 is having some ordering issues. This commit adds an ORDER BY clause to make the output more stable for the problematic query. Fix suggested by Tom Lane. The plan of the query updated still uses a parallel hash full join. Author: Melanie Plageman Discussion: https://postgr.es/m/623596.1684541098@sss.pgh.pa.us

PostgreSQL originally excluded FULL and RIGHT outer joins from parallel hash join because of deadlock hazards in the per-batch barrier protocol. PG 14 resolved this by introducing a dedicated PHJ_BATCH_SCAN phase: one elected worker emits unmatched inner-side rows after probing, while the others detach and move on. In CBDB, distributed execution adds a second dimension: after a full outer join the unmatched NULL-filled rows may come from any segment, so the result carries a HashedOJ locus rather than a plain Hashed locus. This change teaches the parallel planner about that: - FULL JOIN and RIGHT JOIN are now valid parallel join types in the distributed planner. Previously they were unconditionally rejected, forcing serial execution across all segments. - The HashedOJ locus produced by a parallel full join now carries parallel_workers, so operators above the join (aggregates, further joins) can remain parallel. - A crash that could occur when a parallel LASJ_NOTIN (NOT IN) join encountered NULL inner keys is fixed. The worker would exit early but the batch barrier, which was never attached to, would be touched on shutdown causing an assertion failure. Example plans (3 segments, parallel_workers=2): -- FULL JOIN: result locus is HashedOJ with Parallel Workers: 2 EXPLAIN(costs off, locus) SELECT count(*) FROM t1 FULL JOIN t2 USING (id); Finalize Aggregate Locus: Entry -> Gather Motion 6:1 (slice1; segments: 6) -> Partial Aggregate Locus: HashedOJ Parallel Workers: 2 -> Parallel Hash Full Join Locus: HashedOJ Parallel Workers: 2 Hash Cond: (t1.id = t2.id) -> Parallel Seq Scan on t1 Locus: HashedWorkers -> Parallel Hash -> Parallel Seq Scan on t2 Locus: HashedWorkers -- RIGHT JOIN: when t1 is larger the planner hashes the smaller t2 -- and probes with t1; result locus HashedWorkers EXPLAIN(costs off, locus) SELECT count(*) FROM t1 RIGHT JOIN t2 USING (id); Finalize Aggregate Locus: Entry -> Gather Motion 6:1 (slice1; segments: 6) -> Partial Aggregate Locus: HashedWorkers Parallel Workers: 2 -> Parallel Hash Right Join Locus: HashedWorkers Parallel Workers: 2 Hash Cond: (t1.id = t2.id) -> Parallel Seq Scan on t1 Locus: HashedWorkers -> Parallel Hash -> Parallel Seq Scan on t2 Locus: HashedWorkers Performance (3 segments x 2 parallel workers, 6M rows each, 50% overlap): FULL JOIN parallel: 4040 ms serial: 6347 ms speedup: 1.57x RIGHT JOIN parallel: 3039 ms serial: 5568 ms speedup: 1.83x

cbdb_parallel.sql: add a new test block covering: - Parallel Hash Full Join (HashedWorkers FULL JOIN HashedWorkers produces HashedOJ with parallel_workers=2) - Parallel Hash Right Join (pj_t1 is 3x larger than pj_t2, so the planner hashes the smaller pj_t2 and probes with pj_t1; result locus HashedWorkers) - Correctness checks: count(*) matches serial execution - Locus propagation: HashedOJ(parallel) followed by INNER JOIN produces HashedOJ; followed by FULL JOIN produces HashedOJ join_hash.sql/out: CBDB-specific adaptations for the upstream parallel full join test -- disable parallel mode for tests that require serial plans, fix SAVEPOINT inside a parallel worker context, and update expected output to match CBDB plan shapes.

avamingli added type: Performance cloudberry runs slow on some particular query planner labels Mar 24, 2026

avamingli changed the title ~~Parallel Hash Full Join and Right Join~~ Parallel Full Join and Right Join Mar 24, 2026

avamingli force-pushed the p_full_join branch from 22fd520 to 324daff Compare March 25, 2026 07:53

avamingli requested review from gongxun0928 and my-ship-it March 25, 2026 09:31

macdice and others added 5 commits March 25, 2026 20:30

avamingli force-pushed the p_full_join branch from 324daff to a073d9a Compare March 25, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Full Join and Right Join#1635

Parallel Full Join and Right Join#1635
avamingli wants to merge 5 commits intoapache:mainfrom
avamingli:p_full_join

avamingli commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

avamingli commented Mar 24, 2026

Performance

Example plans (3 segments, parallel_workers=2):

PARALLEL FULL JOIN:

PARALLEL RIGHT JOIN

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants