Move enum explanations and health checks from cuda_core to cuda_bindings#1805
Move enum explanations and health checks from cuda_core to cuda_bindings#1805rwgk wants to merge 20 commits intoNVIDIA:mainfrom
Conversation
…VIDIA#1712) The explanation dicts are fundamentally tied to the bindings version, so they belong in cuda_bindings. This copies them (keeping the cuda_core originals for backward compatibility) and adds the corresponding health tests under cuda_bindings/tests/. Made-with: Cursor
These tests now live in cuda_bindings/tests/test_enum_explanations.py, where they belong alongside the explanation dicts they verify. Made-with: Cursor
…llback (NVIDIA#1712) Each explanation module now tries to import the authoritative dict from cuda.bindings._utils (ModuleNotFoundError-guarded) and falls back to its own copy for older cuda-bindings that don't ship it yet. Smoke tests added for both dicts. Made-with: Cursor
NVIDIA#1712) Rename explanation dicts to _EXPLANATIONS / _FALLBACK_EXPLANATIONS, add _CTK_MAJOR_MINOR_PATCH to each module, and enforce that the cuda_core fallback copy is as new as (and in-sync with) cuda_bindings. Parametrize the smoke and version-check tests to cover both driver and runtime without duplication. Made-with: Cursor
…tring (NVIDIA#1712) Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
|
/ok to test |
|
For easy reference, the CI at commit fb12195 was successful: (I'm about to push git merge master, which will hide it. Not rerunning the CI for now, waiting for a review.) |
|
What's stopping us from moving this codegen into the code generator and re-exporting it here to avoid breaking stuff? We can't continue to live with steps like "copy x manually". Let's just do the work to move it to the generator. It doesn't really make sense that we've got tools parsing C headers in Python and producing code from that, and yet we're still copying dictionaries by hand. |
| RUNTIME_CUDA_ERROR_EXPLANATIONS = { | ||
| _FALLBACK_EXPLANATIONS = { | ||
| 0: ( | ||
| "The API call returned with no errors. In the case of query calls, this" |
There was a problem hiding this comment.
Do we have to duplicate this error text list? Can we hoist it into a central location?
There was a problem hiding this comment.
Do we have to duplicate this error text list?
Originally my proposal was to avoid this copy (see the #1712 issue description, Backward compatibility section), but @leofang argued for vendoring (see issue comments).
Can we hoist it into a central location?
Not if we want future cuda-core releases to produce the enhanced error messages even if used in combination with cuda-binding releases made before this PR was merged.
On balance, I still feel the better compromise is to delete this copy, and to change cuda_core/cuda/core/_utils/cuda_utils.pyx to skip enhancing the error messages if the dict is not in cuda-bindings. It's really only a nice-to-have that will be easy to get back by using the latest cuda-bindings.
rparolin
left a comment
There was a problem hiding this comment.
Please remove the duplicated error text array.
I totally agree, but this PR is about solving nvbug 5932944, which is related to but different from the code-gen question. I opened cuda-python-private issue 289 to track your suggestion. |
…gs (NVIDIA#1712) Remove the vendored explanation dicts from cuda_core. cuda_utils.pyx now imports directly from cuda.bindings._utils with a ModuleNotFoundError fallback to an empty dict, so error messages gracefully degrade when paired with older cuda-bindings that don't ship the dicts. Made-with: Cursor
Done, Cursor said this:
I converted this PR back to Draft mode while retesting. |
|
/ok to test |
…#1712) Restore DRIVER_CU_RESULT_EXPLANATIONS / RUNTIME_CUDA_ERROR_EXPLANATIONS as the dict names in cuda_bindings and remove the _CTK_MAJOR_MINOR_PATCH / _EXPLANATIONS indirection that is no longer needed without the cuda_core fallback copies. Made-with: Cursor
|
/ok to test |
|
I am not sure I follow. So what is the behavior after merging this PR, when using cuda-core + non-latest cuda-bindings 12.x or 13.x? The expectation for the enum dict is the same as what a megaheader would deliver, so once @mdboom concludes the cython-gen merging we might have a path for this. Let's make sure there is no regression introduced by this PR, and then we can merge. |
See PR description, copy-pasting for convenience:
I.e. technically, maybe it could be viewed as a regression, but for all practical purposes
Before this PR: We've been using the explanations from the latest CTK release for all versions cuda_core. With this PR: The explanations will match the CTK release we use for generating the cuda-bindings code. — I believe strictly speaking that's more self-consistent, but in practice the differences are probably very minor; from what I've seen in passing, the explanations for released enums do not change in major ways. |
Add test_explanations_dict_matches_enum_member_docstrings, which checks that each hand-maintained DRIVER_CU_RESULT_EXPLANATIONS and RUNTIME_CUDA_ERROR_EXPLANATIONS entry matches the corresponding FastEnum member's __doc__ (cuda-bindings 13.2+ is said to attach the same narrative text via codegen). The comparison uses strict string equality. In current releases the dict text and __doc__ are not byte-identical: generated docstrings include Sphinx cross-references (:py:obj:...) and manual line breaks where the dicts use raw CUDA comment style (::symbol()) and single-line concatenation; some deprecated codes differ in length. So the test is marked xfail(strict=False) so CI stays green until dicts and generated docstrings share one source of truth; when they align, XPASS indicates the xfail can be removed. Skip the compare when cuda-bindings < 13.2 (major.minor). Skip members with no __doc__ (e.g. cudaErrorApiFailureBase). Helpers: _explanation_text_from_dict_value to flatten dict tuple fragments. To inspect all mismatches locally: pytest --runxfail on this test. Made-with: Cursor
…tion tests Introduce clean_enum_member_docstring() for turning FastEnum CUresult / cudaError_t __doc__ strings into plain text: collapse whitespace (newlines to spaces), strip ends, and best-effort strip common Sphinx inline roles (:py:obj:, :py:func:, :obj:, etc.) plus simple ** / * markup. Placed in test_enum_explanations.py for now pending reuse from cuda_core. Add parametrized examples and a None-input test. Made-with: Cursor
…ealth test Add _explanation_dict_text_for_cleaned_doc_compare to normalize dict strings for parity with clean_enum_member_docstring: strip Doxygen-style :: before name( and collapse whitespace. Rename test_explanations_dict_matches_enum_member_docstrings to test_explanations_dict_matches_cleaned_enum_docstrings; compare normalized dict text to clean_enum_member_docstring(__doc__) instead of raw __doc__. Update xfail reason and failure-report labels. Give explicit pytest.param ids on test_clean_enum_member_docstring_examples for readable node ids (ruff/pre-commit friendly). Made-with: Cursor
…precated skips)
Add _strip_doxygen_double_colon_prefixes to remove Doxygen :: before CUDA
identifiers in explanation dict text (not C++ Foo::Bar scope), and use it in
_explanation_dict_text_for_cleaned_doc_compare. Add small unit tests.
Refactor test_explanations_dict_matches_cleaned_enum_docstrings to parametrize
per enum member so pytest can report per-case skips and failures.
Skip comparison when __doc__ is missing, or when strip().endswith('[Deprecated]')
(stub-only or suffix deprecation notes). Drop unused textwrap import.
Made-with: Cursor
…cstring Add _fix_hyphenation_wordwrap_spacing to remove spurious spaces around hyphens from reflowed __doc__ text ([a-z]- [a-z] and [a-z] -[a-z]), applied until stable. Use it in clean_enum_member_docstring after whitespace collapse and in _explanation_dict_text_for_cleaned_doc_compare for symmetric comparison. Add examples to test_clean_enum_member_docstring_examples. Made-with: Cursor
…parity Replace malformed \n:py:obj: (non-raw string so \n matches a real newline) with quoted "Interactions " before generic Sphinx stripping. Strip CUDART_DRIVER from normalized dict text for compare-only parity with cleaned __doc__ (manpage token vs title-only). Add a clean_enum_member_docstring example for the broken cross-ref. Made-with: Cursor
… parity test Treat cleaned __doc__ as authoritative for cudaErrorLaunchTimeout; the explanation dict still uses a Doxygen-style cudaDeviceAttr:: fragment we do not normalize. Document in the test docstring. Made-with: Cursor
|
@mdboom for visibility — Thanks for pointing out that the enums have a Validation on this branch, to inform work on a new PR to replace this oneEnum explanations dict vs.
|
Closes #1712
The
DRIVER_CU_RESULT_EXPLANATIONSandRUNTIME_CUDA_ERROR_EXPLANATIONSdicts are fundamentally tied to the cuda-bindings release (they must match the enums shipped in that release). Having them live exclusively in cuda_core meant the health-check tests failed whenever cuda_core was tested against a different version of cuda-bindings (nvbug 5932944).Changes
cuda_bindings/cuda/bindings/_utils/as the single authoritative source (renamed to_EXPLANATIONSwith a_CTK_MAJOR_MINOR_PATCHversion tag).cuda_utils.pyxnow imports directly fromcuda.bindings._utils, with aModuleNotFoundErrorfallback to an empty dict.cuda_bindings/tests/test_enum_explanations.py, where they belong alongside the dicts they verify.Impact on error messages for cuda-core users
When cuda-core raises a
CUDAError, it tries to include a human-readable explanation of the error code (e.g. "This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values").With this change:
_utils): Error messages fall back to the driver/runtime error name and description string obtained fromcuGetErrorString/cudaGetErrorString. The explanations are a nice-to-have supplement, and the error name + description are still informative. Upgrading to a current cuda-bindings release restores the full explanations.