gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307
Conversation
…y-referenced operands in place
When the tier 2 optimizer can prove that an operand to a float
operation is uniquely referenced (refcount 1), mutate it in place
instead of allocating a new PyFloatObject.
New tier 2 micro-ops:
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE (unique LHS)
- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE_RIGHT (unique RHS)
- _UNARY_NEGATIVE_FLOAT_INPLACE (unique operand)
Speeds up the pyperformance nbody benchmark by ~19%.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid compound assignment (+=, -=, *=) directly on ob_fval in inplace float ops. On 32-bit Windows, this generates JIT stencils with _xmm register references that MSVC cannot parse. Instead, read into a local double, compute, and write back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add -mno-sse to clang args for i686-pc-windows-msvc target. The COFF32 stencil converter cannot handle _xmm register references that clang emits for inline float arithmetic. Using x87 FPU instructions avoids this. SSE is optional on 32-bit x86; x87 is the baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Selected pyperformance benchmarks: |
Python/optimizer_bytecodes.c
Outdated
| r = right; | ||
| if (PyJitRef_IsUnique(left)) { | ||
| ADD_OP(_BINARY_OP_SUBTRACT_FLOAT_INPLACE, 0, 0); | ||
| l = PyJitRef_Borrow(left); |
There was a problem hiding this comment.
Isn't it more correct to say l = sym_new_null(ctx);? Same for below?
There was a problem hiding this comment.
To make this work I had to change the definition _POP_TOP_FLOAT. I think the change is ok, but please double check.
There was a problem hiding this comment.
It should be PyJitRef_Borrow(sym_new_null(ctx)) sorry for not being clear
There was a problem hiding this comment.
That way theres no need to change the case in POP_TOP_FLOAT
There was a problem hiding this comment.
Thanks for claryfying. Now _POP_TOP_FLOAT is unchanged.
markshannon
left a comment
There was a problem hiding this comment.
This looks like a nice performance improvement for floating point code
I've a few comments
| // Note: read into a local double and write back to avoid compound | ||
| // assignment (+=) on ob_fval, which generates problematic JIT | ||
| // stencils on i686-pc-windows-msvc. | ||
| tier2 op(_BINARY_OP_ADD_FLOAT_INPLACE, (left, right -- res, l, r)) { |
There was a problem hiding this comment.
This op and its variants all share a lot of common code.
Could you factor out the code into a macro to perform the inplace operation?
Like:
tier2 op(_BINARY_OP_ADD_FLOAT_INPLACE, (left, right -- res, l, r)) {
res = FLOAT_INPLACE_OP(left, +, right);
l = PyStackRef_NULL;
r = right;
INPUTS_DEAD();
}
tier2 op(_BINARY_OP_MULTIPLY_FLOAT_INPLACE_RIGHT, (left, right -- res, l, r)) {
res = FLOAT_INPLACE_OP(right, *, left);
l = left
r = PyStackRef_NULL;
INPUTS_DEAD();
}There was a problem hiding this comment.
Normal C macros are not allowed in the bytecodes.c opcodes (they are not expanded). I have not yet found a way to refactor this nicely.
We could add a new macro to the DSL (like INPUTS_DEAD) for this, but it feels a bit odd to add something to the DSL for this particular case.
There was a problem hiding this comment.
Normal C macros are not allowed in the bytecodes.c opcodes
They are. You just need to define them in ceval_macros.h. Would you like me to do this, or would you like to do it?
Co-authored-by: Ken Jin <kenjin4096@gmail.com>
The inplace ops set l or r to PyStackRef_NULL at runtime, so the optimizer should model this as sym_new_null(ctx) rather than PyJitRef_Borrow(). Both produce _POP_TOP_NOP but sym_new_null correctly matches the runtime semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # Include/internal/pycore_uop_ids.h
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Awesome. Thanks for doing this!
In the future, this will be superceded partially by the partial evaluation pass, as any Py_REFCNT(f) == 1 op is also safe to unbox. However, for 3.15 this is a good win.
Good to know! Given that, which of the operations listed in the followup section of the first post do you think would be candidates to add in new PRs? |
I think They might be worthwhile, hard to tell. |
We can add the following tier 2 micro-ops that mutate the uniquely-referenced operand:
_BINARY_OP_ADD_FLOAT_INPLACE/_INPLACE_RIGHT— unique LHS / RHS_BINARY_OP_SUBTRACT_FLOAT_INPLACE/_INPLACE_RIGHT— unique LHS / RHS_BINARY_OP_MULTIPLY_FLOAT_INPLACE/_INPLACE_RIGHT— unique LHS / RHS_UNARY_NEGATIVE_FLOAT_INPLACE— unique operandThe
_RIGHTvariants handle commutative ops (add, multiply) plus subtract when only the RHS is unique. The optimizer emits these inoptimizer_bytecodes.cwhenPyJitRef_IsUnique(left)orPyJitRef_IsUnique(right)is true and the operand is a known float. The mutated operand is marked as borrowed so the following_POP_TOPbecomes_POP_TOP_NOP.Micro-benchmarks:
total += a*b + ctotal += a + btotal += a*b + c*dpyperformance nbody (20k iterations):
Followup
Some operations that could be added in followup PR's: division of floats (same as this PR, but needs to handle division by zero), operations on a float and an (compact) int with a uniquely referenced float, integer operations (but this is more involved because of small ints and number of digits), operations with complex numbers (same as this PR, but maybe the use case for complex is less).
Script