Skip to content

PERF: use Cython for SparseArray groupby operations#64758

Draft
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel:perf-36123
Draft

PERF: use Cython for SparseArray groupby operations#64758
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel:perf-36123

Conversation

@jbrockmendel
Copy link
Member

Summary

  • Implement SparseArray._groupby_op to route groupby reductions and transformations through the fast Cython path instead of falling back to slow Python aggregation
  • Remove the SparseArray special-case for any/all in _cython_agg_general since the new _groupby_op handles them natively
  • Add dedicated test coverage for sparse groupby operations

On the benchmark from the issue (1000x1000 sparse int DataFrame, groupby mean):

  • Before: ~3.9s (Python fallback)
  • After: ~21ms (~185x speedup)

closes #36123

Test plan

  • Existing sparse extension tests pass (pandas/tests/extension/test_sparse.py)
  • Full groupby test suite passes (pandas/tests/groupby/)
  • New pandas/tests/groupby/test_sparse.py covers reductions (sum, mean, min, max, std, var, sem, prod, median), boolean ops (any, all), positional (first, last), transforms (cumsum, cummin, cummax, cumprod, rank), index-based (idxmin, idxmax), NaN fill_value handling, and Series groupby — all parametrized over fill_value in [0, NaN]

🤖 Generated with Claude Code

Implement SparseArray._groupby_op to route groupby reductions and
transformations through the fast Cython path instead of falling back
to slow Python aggregation. This converts to dense before calling the
Cython op, trading a small memory cost for ~185x speedup on the
benchmark from the issue.

Also removes the SparseArray special-case for any/all in
_cython_agg_general since the new _groupby_op handles them natively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added Performance Memory or execution speed performance Groupby Sparse Sparse Data Type labels Mar 22, 2026
On 32-bit platforms (Linux-32-bit, Pyodide/wasm32), SparseDtype(int, ...)
resolves to int32 while DataFrame int columns are int64, causing dtype
mismatches. Use np.int64 explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Groupby Performance Memory or execution speed performance Sparse Sparse Data Type

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.mean() is extremely slow with sparse arrays

1 participant