PERF: use Cython for SparseArray groupby operations#64758
Draft
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
Draft
PERF: use Cython for SparseArray groupby operations#64758jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
Conversation
Implement SparseArray._groupby_op to route groupby reductions and transformations through the fast Cython path instead of falling back to slow Python aggregation. This converts to dense before calling the Cython op, trading a small memory cost for ~185x speedup on the benchmark from the issue. Also removes the SparseArray special-case for any/all in _cython_agg_general since the new _groupby_op handles them natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On 32-bit platforms (Linux-32-bit, Pyodide/wasm32), SparseDtype(int, ...) resolves to int32 while DataFrame int columns are int64, causing dtype mismatches. Use np.int64 explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SparseArray._groupby_opto route groupby reductions and transformations through the fast Cython path instead of falling back to slow Python aggregationSparseArrayspecial-case forany/allin_cython_agg_generalsince the new_groupby_ophandles them nativelyOn the benchmark from the issue (1000x1000 sparse int DataFrame, groupby mean):
closes #36123
Test plan
pandas/tests/extension/test_sparse.py)pandas/tests/groupby/)pandas/tests/groupby/test_sparse.pycovers reductions (sum,mean,min,max,std,var,sem,prod,median), boolean ops (any,all), positional (first,last), transforms (cumsum,cummin,cummax,cumprod,rank), index-based (idxmin,idxmax), NaN fill_value handling, and Series groupby — all parametrized over fill_value in [0, NaN]🤖 Generated with Claude Code