Skip to content

Move enum explanations and health checks from cuda_core to cuda_bindings#1805

Draft
rwgk wants to merge 12 commits intoNVIDIA:mainfrom
rwgk:move_enum_explanations
Draft

Move enum explanations and health checks from cuda_core to cuda_bindings#1805
rwgk wants to merge 12 commits intoNVIDIA:mainfrom
rwgk:move_enum_explanations

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Mar 23, 2026

Closes #1712

The DRIVER_CU_RESULT_EXPLANATIONS and RUNTIME_CUDA_ERROR_EXPLANATIONS dicts are fundamentally tied to the cuda-bindings release (they must match the enums shipped in that release). Having them live exclusively in cuda_core meant the health-check tests failed whenever cuda_core was tested against a different version of cuda-bindings (nvbug 5932944).

Changes

  • Move the dicts to cuda_bindings/cuda/bindings/_utils/ as the single authoritative source (renamed to _EXPLANATIONS with a _CTK_MAJOR_MINOR_PATCH version tag).
  • Delete the copies from cuda_core. cuda_utils.pyx now imports directly from cuda.bindings._utils, with a ModuleNotFoundError fallback to an empty dict.
  • Move the exhaustive health-check tests to cuda_bindings/tests/test_enum_explanations.py, where they belong alongside the dicts they verify.

Impact on error messages for cuda-core users

When cuda-core raises a CUDAError, it tries to include a human-readable explanation of the error code (e.g. "This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values").

With this change:

  • cuda-bindings >= this PR: Error messages continue to include explanations, exactly as before.
  • cuda-bindings < this PR (older releases that don't ship _utils): Error messages fall back to the driver/runtime error name and description string obtained from cuGetErrorString / cudaGetErrorString. The explanations are a nice-to-have supplement, and the error name + description are still informative. Upgrading to a current cuda-bindings release restores the full explanations.

rwgk added 5 commits March 22, 2026 20:44
…VIDIA#1712)

The explanation dicts are fundamentally tied to the bindings version, so
they belong in cuda_bindings. This copies them (keeping the cuda_core
originals for backward compatibility) and adds the corresponding health
tests under cuda_bindings/tests/.

Made-with: Cursor
These tests now live in cuda_bindings/tests/test_enum_explanations.py,
where they belong alongside the explanation dicts they verify.

Made-with: Cursor
…llback (NVIDIA#1712)

Each explanation module now tries to import the authoritative dict from
cuda.bindings._utils (ModuleNotFoundError-guarded) and falls back to its
own copy for older cuda-bindings that don't ship it yet. Smoke tests
added for both dicts.

Made-with: Cursor
NVIDIA#1712)

Rename explanation dicts to _EXPLANATIONS / _FALLBACK_EXPLANATIONS,
add _CTK_MAJOR_MINOR_PATCH to each module, and enforce that the
cuda_core fallback copy is as new as (and in-sync with) cuda_bindings.
Parametrize the smoke and version-check tests to cover both driver and
runtime without duplication.

Made-with: Cursor
@rwgk rwgk self-assigned this Mar 23, 2026
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 23, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk rwgk added bug Something isn't working P0 High priority - Must do! cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module labels Mar 23, 2026
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 23, 2026

/ok to test

@github-actions
Copy link

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 23, 2026

/ok to test

@rwgk rwgk marked this pull request as ready for review March 23, 2026 06:50
@rwgk rwgk requested a review from leofang March 23, 2026 18:43
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 24, 2026

For easy reference, the CI at commit fb12195 was successful:

(I'm about to push git merge master, which will hide it. Not rerunning the CI for now, waiting for a review.)

@cpcloud
Copy link
Contributor

cpcloud commented Mar 25, 2026

What's stopping us from moving this codegen into the code generator and re-exporting it here to avoid breaking stuff?

We can't continue to live with steps like "copy x manually". Let's just do the work to move it to the generator. It doesn't really make sense that we've got tools parsing C headers in Python and producing code from that, and yet we're still copying dictionaries by hand.

RUNTIME_CUDA_ERROR_EXPLANATIONS = {
_FALLBACK_EXPLANATIONS = {
0: (
"The API call returned with no errors. In the case of query calls, this"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to duplicate this error text list? Can we hoist it into a central location?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to duplicate this error text list?

Originally my proposal was to avoid this copy (see the #1712 issue description, Backward compatibility section), but @leofang argued for vendoring (see issue comments).

Can we hoist it into a central location?

Not if we want future cuda-core releases to produce the enhanced error messages even if used in combination with cuda-binding releases made before this PR was merged.

On balance, I still feel the better compromise is to delete this copy, and to change cuda_core/cuda/core/_utils/cuda_utils.pyx to skip enhancing the error messages if the dict is not in cuda-bindings. It's really only a nice-to-have that will be easy to get back by using the latest cuda-bindings.

Copy link
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the duplicated error text array.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 25, 2026

What's stopping us from moving this codegen into the code generator and re-exporting it here to avoid breaking stuff?

We can't continue to live with steps like "copy x manually". Let's just do the work to move it to the generator. It doesn't really make sense that we've got tools parsing C headers in Python and producing code from that, and yet we're still copying dictionaries by hand.

I totally agree, but this PR is about solving nvbug 5932944, which is related to but different from the code-gen question. I opened cuda-python-private issue 289 to track your suggestion.

rwgk added 2 commits March 25, 2026 16:25
…gs (NVIDIA#1712)

Remove the vendored explanation dicts from cuda_core. cuda_utils.pyx now
imports directly from cuda.bindings._utils with a ModuleNotFoundError
fallback to an empty dict, so error messages gracefully degrade when
paired with older cuda-bindings that don't ship the dicts.

Made-with: Cursor
@rwgk rwgk marked this pull request as draft March 26, 2026 00:10
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 26, 2026

Please remove the duplicated error text array.

Done, Cursor said this:

  • Committed as 6fc77b7. Net -966 lines -- much cleaner.

I converted this PR back to Draft mode while retesting.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 26, 2026

/ok to test

…#1712)

Restore DRIVER_CU_RESULT_EXPLANATIONS / RUNTIME_CUDA_ERROR_EXPLANATIONS
as the dict names in cuda_bindings and remove the _CTK_MAJOR_MINOR_PATCH
/ _EXPLANATIONS indirection that is no longer needed without the
cuda_core fallback copies.

Made-with: Cursor
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 26, 2026

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Move enum explanations and health checks from cuda_core to cuda_bindings

3 participants