Open
Conversation
Full and right outer joins were not supported in the initial implementation of Parallel Hash Join because of deadlock hazards (see discussion). Therefore FULL JOIN inhibited parallelism, as the other join strategies can't do that in parallel either. Add a new PHJ phase PHJ_BATCH_SCAN that scans for unmatched tuples on the inner side of one batch's hash table. For now, sidestep the deadlock problem by terminating parallelism there. The last process to arrive at that phase emits the unmatched tuples, while others detach and are free to go and work on other batches, if there are any, but otherwise they finish the join early. That unfairness is considered acceptable for now, because it's better than no parallelism at all. The build and probe phases are run in parallel, and the new scan-for-unmatched phase, while serial, is usually applied to the smaller of the two relations and is either limited by some multiple of work_mem, or it's too big and is partitioned into batches and then the situation is improved by batch-level parallelism. Author: Melanie Plageman <melanieplageman@gmail.com> Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com
Hash join tuples reuse the HOT status bit to indicate match status during hash join execution. Correct reuse requires clearing the bit in all tuples. Serial hash join and parallel multi-batch hash join do so upon inserting the tuple into the hashtable. Single batch parallel hash join and batch 0 of unexpected multi-batch hash joins forgot to do this. It hadn't come up before because hashtable tuple match bits are only used for right and full outer joins and parallel ROJ and FOJ were unsupported. 11c2d6f introduced support for parallel ROJ/FOJ but neglected to ensure the match bits were reset. Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/flat/CAMbWs48Nde1Mv%3DBJv6_vXmRKHMuHZm2Q_g4F6Z3_pn%2B3EV6BGQ%40mail.gmail.com
As reported by buildfarm member conchuela, one of the regression tests added by 558c9d7 is having some ordering issues. This commit adds an ORDER BY clause to make the output more stable for the problematic query. Fix suggested by Tom Lane. The plan of the query updated still uses a parallel hash full join. Author: Melanie Plageman Discussion: https://postgr.es/m/623596.1684541098@sss.pgh.pa.us
PostgreSQL originally excluded FULL and RIGHT outer joins from parallel
hash join because of deadlock hazards in the per-batch barrier protocol.
PG 14 resolved this by introducing a dedicated PHJ_BATCH_SCAN phase: one
elected worker emits unmatched inner-side rows after probing, while the
others detach and move on.
In CBDB, distributed execution adds a second dimension: after a full
outer join the unmatched NULL-filled rows may come from any segment, so
the result carries a HashedOJ locus rather than a plain Hashed locus.
This change teaches the parallel planner about that:
- FULL JOIN and RIGHT JOIN are now valid parallel join types in the
distributed planner. Previously they were unconditionally rejected,
forcing serial execution across all segments.
- The HashedOJ locus produced by a parallel full join now carries
parallel_workers, so operators above the join (aggregates, further
joins) can remain parallel.
- A crash that could occur when a parallel LASJ_NOTIN (NOT IN) join
encountered NULL inner keys is fixed. The worker would exit early
but the batch barrier, which was never attached to, would be touched
on shutdown causing an assertion failure.
Example plans (3 segments, parallel_workers=2):
-- FULL JOIN: result locus is HashedOJ with Parallel Workers: 2
EXPLAIN(costs off, locus)
SELECT count(*) FROM t1 FULL JOIN t2 USING (id);
Finalize Aggregate
Locus: Entry
-> Gather Motion 6:1 (slice1; segments: 6)
-> Partial Aggregate
Locus: HashedOJ
Parallel Workers: 2
-> Parallel Hash Full Join
Locus: HashedOJ
Parallel Workers: 2
Hash Cond: (t1.id = t2.id)
-> Parallel Seq Scan on t1
Locus: HashedWorkers
-> Parallel Hash
-> Parallel Seq Scan on t2
Locus: HashedWorkers
-- RIGHT JOIN: when t1 is larger the planner hashes the smaller t2
-- and probes with t1; result locus HashedWorkers
EXPLAIN(costs off, locus)
SELECT count(*) FROM t1 RIGHT JOIN t2 USING (id);
Finalize Aggregate
Locus: Entry
-> Gather Motion 6:1 (slice1; segments: 6)
-> Partial Aggregate
Locus: HashedWorkers
Parallel Workers: 2
-> Parallel Hash Right Join
Locus: HashedWorkers
Parallel Workers: 2
Hash Cond: (t1.id = t2.id)
-> Parallel Seq Scan on t1
Locus: HashedWorkers
-> Parallel Hash
-> Parallel Seq Scan on t2
Locus: HashedWorkers
Performance (3 segments x 2 parallel workers, 6M rows each, 50% overlap):
FULL JOIN parallel: 4040 ms serial: 6347 ms speedup: 1.57x
RIGHT JOIN parallel: 3039 ms serial: 5568 ms speedup: 1.83x
cbdb_parallel.sql: add a new test block covering:
- Parallel Hash Full Join (HashedWorkers FULL JOIN HashedWorkers
produces HashedOJ with parallel_workers=2)
- Parallel Hash Right Join (pj_t1 is 3x larger than pj_t2, so the
planner hashes the smaller pj_t2 and probes with pj_t1; result
locus HashedWorkers)
- Correctness checks: count(*) matches serial execution
- Locus propagation: HashedOJ(parallel) followed by INNER JOIN
produces HashedOJ; followed by FULL JOIN produces HashedOJ
join_hash.sql/out: CBDB-specific adaptations for the upstream parallel
full join test -- disable parallel mode for tests that require serial
plans, fix SAVEPOINT inside a parallel worker context, and update
expected output to match CBDB plan shapes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallel Hash Join has been in PostgreSQL for a while, but
FULLandRIGHTouter joins weredeliberately left out of the initial implementation — the per-batch barrier protocol had a
deadlock risk that nobody had solved yet. UPSTREAM finally cracked it by adding a dedicated scan
phase (
PHJ_BATCH_SCAN): after all workers finish probing, one elected worker walks the hashtable looking for unmatched inner rows and emits them, while the rest move on to other batches
or finish early. It is a clean solution and the upstream fix is compact. We are in the process
of cherry-picking UPSTREAM into CBDB, and this PR brings in those three upstream commits as a
foundation.
Performance
Benchmark on a 3-segment cluster with
parallel_workers = 2per segment (6 workers total),6 million rows per table, 50% row overlap between the two sides.
FULL JOINRIGHT JOINExample plans (3 segments, parallel_workers=2):
PARALLEL FULL JOIN:
PARALLEL RIGHT JOIN
On top of the upstream work, CBDB's distributed execution model needed its own adaptations.
The tricky part is what happens to the result of a parallel full outer join: in a distributed
cluster the NULL-extended rows for unmatched tuples can land on any segment, so the result
cannot carry a plain
Hashedlocus — it needsHashedOJ. Before this PR the distributedplanner simply rejected
FULLandRIGHTjoins from the parallel path entirely, so everysuch query fell back to serial execution on all segments. The planner now accepts both join
types, assigns the correct
HashedOJlocus to a full join's output, and — importantly —remembers to carry the
parallel_workerscount on that locus. Without that last detail,any aggregate or further join sitting above the full join would see a locus with zero workers
and silently go serial, wiping out the benefit.
There is also a crash fix bundled here. A parallel
NOT INsubquery join (LASJ_NOTIN) hasa special fast path: the moment any inner row with a NULL key is found, the whole join can
return empty, so the worker exits early. The problem is that this early exit happened before
the worker ever attached to the probing barrier for the current batch. Later, during shutdown,
the code still tried to arrive-and-detach from that barrier, which had zero participants and
asserted. The fix is straightforward: skip the barrier call when the early-exit condition was
triggered.
Type of Change
Breaking Changes
Test Plan
make installcheckmake -C src/test installcheck-cbdb-parallelImpact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions