Skip to content

Conversation

@dzzz2001
Copy link
Collaborator

@dzzz2001 dzzz2001 commented Jan 30, 2026

Summary

This PR modernizes the device module architecture, introducing a centralized GPU initialization mechanism and unifying error handling across the codebase. The refactoring improves code maintainability, reduces code duplication, and establishes consistent patterns for GPU resource management.

Motivation

The previous device module had several issues:

  • Scattered cudaGetDeviceCount calls across multiple files
  • Inconsistent error checking macros (cudaErrcheck, CAL_CHECK, cufftErrcheck, etc.)
  • Tightly coupled code with mixed responsibilities in single files
  • Legacy compatibility code that was no longer needed

Key Changes

1. DeviceContext Singleton

Introduced a DeviceContext singleton class that provides:

  • Centralized GPU initialization and device selection
  • Unified access to GPU count and current device information
  • Single point of control for multi-GPU environments

2. Unified Error Handling (device_check.h)

Consolidated all CUDA/HIP/cuFFT error checking into a single header with consistent naming:

  • CHECK_CUDA() - CUDA runtime API calls
  • CHECK_CUBLAS() - cuBLAS library calls
  • CHECK_CUSOLVER() - cuSOLVER library calls
  • CHECK_CUFFT() - cuFFT library calls
  • Equivalent CHECK_HIP* macros for ROCm support

3. Modular Architecture

Reorganized device module with single responsibility per file:

  • device.h/cpp - Core device abstraction and DeviceContext
  • device_check.h - Error checking macros
  • device_helpers.h/cpp - Utility functions
  • kernel_compat.h - Kernel compatibility layer (separated from device.h)
  • gpu_runtime.h - GPU runtime abstractions

4. Code Cleanup

Removed deprecated and redundant code:

  • Removed helper_cuda.h and helper_string.h (NVIDIA samples legacy)
  • Removed CAL_CHECK compatibility alias
  • Removed cufftGetErrorStringCompat from cuda_compat
  • Removed redundant cudaGetDeviceCount calls in snap_psibeta_gpu

5. Logic Reorganization

  • Moved get_device_kpar logic to read_input_item_system for better cohesion
  • Renamed initialize_gpu_resources to init_snap_psibeta_gpu for clarity
  • Simplified parakSolve_cusolver using DeviceContext

Files Changed

Major changes in source/source_base/module_device/:

  • device.h/cpp - Refactored with DeviceContext singleton
  • device_check.h - New unified error checking header
  • device_helpers.h/cpp - New helper functions
  • kernel_compat.h - New separated kernel compatibility
  • gpu_runtime.h - New GPU runtime abstractions
  • cuda_compat.cpp/h - Removed deprecated functions
  • output_device.cpp - Simplified using new architecture

🤖 Generated with Claude Code

dzzz2001 and others added 13 commits January 28, 2026 19:47
…ation

- Add DeviceContext singleton class (device_context.h/cpp) to manage GPU
  device binding with thread-safe initialization using std::mutex
- Move GPU initialization from get_device_kpar() side-effect to explicit
  DeviceContext::init() call in read_input.cpp after INPUT parsing
- Use MPI_COMM_TYPE_SHARED for modern node-local rank detection
- Update callers to use DeviceContext::instance().get_device_id():
  - hsolver_lcao.cpp: parakSolve_cusolver()
  - diag_cusolvermp.cu: constructor
  - gint_gpu_vars.cpp: constructor
  - td_nonlocal_lcao.cpp: remove redundant set_device_by_rank() call

This is Phase 1 of GPU resource initialization refactoring, establishing
a single entry point for GPU device binding instead of scattered calls.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Phase 2 of GPU device module refactoring:
- Remove deprecated functions: stringCmp, get_node_rank(), set_device_by_rank()
- Merge device_context.h/cpp into device.h/cpp
- Update include statements in dependent files
- Remove device_context.cpp from CMakeLists.txt

This reduces ~70 lines of dead code and consolidates the device
module into fewer files with a unified interface.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Create kernel_compat.h: move atomicAdd polyfill for pre-Pascal GPUs
- Create device_helpers.h/cpp: extract get_device_type and get_current_precision templates
- Create gpu_runtime.h: unified CUDA/ROCm API macros for portable GPU code
- Refactor output_device.cpp: unify duplicated CUDA/ROCm implementations (~150 lines reduced)
- Update device.h: clean interface with only DeviceContext and information namespace
- Update CMakeLists.txt: add device_helpers.cpp to build

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Utilize DeviceContext for node-local rank and device count
- Remove redundant MPI_Comm_split_type and manual CUDA calls
- Cleanup unused variables and communicator management
- Remove manual device count check in initialize_gpu_resources
- Rely on DeviceContext to handle device availability and binding
- Simplify GPU initialization logic in module_rt
- Rename function to reflect its module-specific scope
- Update comments to clarify that general GPU setup is handled by DeviceContext
- Remove unused finalize_gpu_resources declaration
- Update caller in td_nonlocal_lcao.cpp
Remove unconditional include of kernel_compat.h from device.h to
properly separate Host/Device code. The kernel_compat.h header
contains CUDA-specific code (__device__ keyword) and should only
be included by .cu files that actually use atomicAdd.

Add explicit includes to the 4 CUDA files that need it:
- stress_op.cu
- force_op.cu
- exx_cal_energy_op.cu
- phi_operator_kernel.cu

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move GPU kpar validation from device module to kpar's reset_value
in read_input_item_system.cpp. This keeps parameter validation logic
with its related input item rather than in the device module.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update device_check.h with proper error handling (exit on error, stderr output)
- Add cuSOLVER error string function with full IRS status codes
- Add CHECK_LAST_CUDA_ERROR, CHECK_CUDA_SYNC, CHECK_CAL macros
- Add ROCm CHECK_CUSOLVER support with hipsolver
- Remove error checking code from module_container/base/macros/cuda.h
- Remove error checking macros from helper_cuda.h
- Delete helper_cusolver.h (functionality merged into device_check.h)
- Update diag_cusolver.cu and diag_cusolvermp.cu to use new macros

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace cudaErrcheck -> CHECK_CUDA
- Replace cublasErrcheck -> CHECK_CUBLAS
- Replace cusolverErrcheck -> CHECK_CUSOLVER
- Replace checkCudaErrors -> CHECK_CUDA
- Replace CUSOLVER_CHECK -> CHECK_CUSOLVER
- Replace CAL_CHECK -> CHECK_CAL
- Replace cudaCheckOnDebug -> CHECK_CUDA_SYNC
- Replace getLastCudaError -> CHECK_LAST_CUDA_ERROR
- Update gpuErrcheck alias in gpu_runtime.h
- Remove deprecated compatibility aliases from device_check.h

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add _cufftGetErrorString static function to device_check.h
- Remove dependency on cuda_compat.h for cufft error strings
- Update CHECK_CUFFT macro to use the local function

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The function is now unified into device_check.h as _cufftGetErrorString.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dzzz2001 dzzz2001 closed this Jan 30, 2026
  - Replaced helper_cuda.h with device_check.h in hegvd_op.cu and diag_cusolvermp.cu

  - Removed helper_cuda.h and helper_string.h as they are no longer used

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dzzz2001 dzzz2001 reopened this Jan 30, 2026
@dzzz2001 dzzz2001 changed the title device refactor Refactor: Device module modernization and GPU initialization consolidation Jan 30, 2026
dzzz2001 and others added 4 commits January 30, 2026 21:27
- Add __CUDA compile definition to MODULE_HSOLVER_LCAO_cusolver test target
  (fixes undefined CHECK_CUDA/CHECK_CUSOLVER macros after remove_definitions)
- Add device_helpers.cpp to pyabacus _base_pack module
  (fixes undefined symbol get_device_type in PyTest)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add device_helpers.cpp to hsolver and ModuleNAO pyabacus modules
  (fixes undefined symbol get_device_type in libdiagopack.so and libnaopack.so)
- Add diag(Hamilt*, Psi&, Real*) overload to DiagoCusolver
  (fixes test compilation error due to interface mismatch)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dzzz2001 dzzz2001 marked this pull request as draft January 30, 2026 13:52
- Always provide both init() and init(MPI_Comm) overloads
- init() version for non-MPI builds (local_rank = 0)
- init(MPI_Comm) version for MPI builds (uses MPI_COMM_TYPE_SHARED)
- Fixes undefined reference error in MODULE_IO_read_item_serial test

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants