Skip to content

[PERF]: Epic for binding overhead improvements #1645

@mdboom

Description

@mdboom

This issue is tracking performance improvements and investigations to Python-to-C binding overhead, mostly driven by the benchmark of cuTensorMapEncodeTiled devised in #659. That is a useful benchmark because it is a function with an unusually high number of arguments (and therefore unusually high Python-to-C overhead).

Comparison to a more limited Cython binding

As an interesting experimental datapoint, a colleague provided a vibe-coded Cython binding for cuTensorMapEncodeTiled that runs about 4x faster than cuda-bindings official one. It is useful to see where some overheads may be reduced, but care should be taken looking at its raw performance: this wrapper accepts far fewer things as inputs than the CUDA bindings, and doesn't include developer niceties, like enums.

Merged or in-progress fixes

Timings below are per-iteration of the benchmark in #659. This includes /both/ binding overhead and some fixed amount of time in the actual CUDA call.

Under investigation

Issues in this category are theoretical findings to reduce the operations required for type conversion, but haven't necessarily yet been confirmed to have a measurable effect.

Deferred (effective, but high effort)

Rejected (ineffective)

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions