Skip to content

Conversation

@Abdelrahman912
Copy link
Collaborator

closes #230

@Abdelrahman912
Copy link
Collaborator Author

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            152ms /  99.9%           6.26MiB / 100.0%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    152ms  100.0%   152ms   6.26MiB  100.0%  6.26MiB
  kernel execution          1    100ms   65.5%   100ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   52.1ms   34.2%  52.1ms   5.22MiB   83.4%  5.22MiB
  isapprox(A, A')           1    464μs    0.3%   464μs   1.00MiB   16.0%  1.00MiB
  kernel setup              1   25.1μs    0.0%  25.1μs      128B    0.0%     128B
  allocate D_Dl1            1   24.6μs    0.0%  24.6μs   25.9KiB    0.4%  25.9KiB
  adapt(backend, _A)        1   40.0ns    0.0%  40.0ns     0.00B    0.0%    0.00B
  synchronize               1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 136ms /  83.0%           4.22MiB /  22.1%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         120    113ms  100.0%   942μs    955KiB  100.0%  7.96KiB
  _forward_sweep!                   120    113ms   99.7%   939μs    954KiB   99.9%  7.95KiB
    _forward_sweep! (internal)      120    113ms   99.6%   938μs    953KiB   99.8%  7.94KiB
      kernel call                   120    111ms   98.6%   929μs    936KiB   98.0%  7.80KiB
      kernel construction           120    722μs    0.6%  6.01μs   15.0KiB    1.6%     128B
      synchronize                   120   10.6μs    0.0%  87.9ns     0.00B    0.0%    0.00B
      convert size_A                120   4.75μs    0.0%  39.6ns     0.00B    0.0%    0.00B
      @unpack partitioning          120   4.34μs    0.0%  36.2ns     0.00B    0.0%    0.00B
      ndrange calc                  120   4.18μs    0.0%  34.9ns     0.00B    0.0%    0.00B
      @unpack P                     120   3.37μs    0.0%  28.0ns     0.00B    0.0%    0.00B
  y .= x                            120    208μs    0.2%  1.73μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 981, time: 1.05442742
Prec. no. iters: 119, time: 0.136057039

============================================================
Preconditioner Construction Timings:
============================================================
─────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:            355ms /  99.8%           11.3MiB /  91.5%    

Section                ncalls     time    %tot     avg     alloc    %tot      avg
─────────────────────────────────────────────────────────────────────────────────
_precompute_blocks          1    355ms  100.0%   355ms   10.3MiB  100.0%  10.3MiB
  kernel execution          1    349ms   98.3%   349ms   8.20KiB    0.1%  8.20KiB
  allocate SLbuffer         1   4.19ms    1.2%  4.19ms   7.42MiB   72.0%  7.42MiB
  isapprox(A, A')           1   1.71ms    0.5%  1.71ms   2.85MiB   27.6%  2.85MiB
  kernel setup              1   53.5μs    0.0%  53.5μs      128B    0.0%     128B
  allocate D_Dl1            1   15.2μs    0.0%  15.2μs   30.9KiB    0.3%  30.9KiB
  synchronize               1   37.0ns    0.0%  37.0ns     0.00B    0.0%    0.00B
  adapt(backend, _A)        1   36.0ns    0.0%  36.0ns     0.00B    0.0%    0.00B
─────────────────────────────────────────────────────────────────────────────────
============================================================
ldiv! Timings during solve:
============================================================
───────────────────────────────────────────────────────────────────────────────────────────
                                                  Time                    Allocations      
                                         ───────────────────────   ────────────────────────
            Tot / % measured:                 797ms /  62.8%           18.0MiB /  17.9%    

Section                          ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────────────
ldiv! (CPU)                         415    501ms  100.0%  1.21ms   3.21MiB  100.0%  7.93KiB
  _forward_sweep!                   415    499ms   99.8%  1.20ms   3.21MiB  100.0%  7.93KiB
    _forward_sweep! (internal)      415    499ms   99.7%  1.20ms   3.21MiB  100.0%  7.93KiB
      kernel call                   415    495ms   98.9%  1.19ms   3.16MiB   98.3%  7.80KiB
      kernel construction           415   2.63ms    0.5%  6.33μs   51.9KiB    1.6%     128B
      synchronize                   415   14.9μs    0.0%  36.0ns     0.00B    0.0%    0.00B
      convert size_A                415   12.3μs    0.0%  29.6ns     0.00B    0.0%    0.00B
      @unpack P                     415   12.1μs    0.0%  29.1ns     0.00B    0.0%    0.00B
      ndrange calc                  415   11.2μs    0.0%  27.0ns     0.00B    0.0%    0.00B
      @unpack partitioning          415   11.0μs    0.0%  26.6ns     0.00B    0.0%    0.00B
  y .= x                            415    796μs    0.2%  1.92μs     0.00B    0.0%    0.00B
───────────────────────────────────────────────────────────────────────────────────────────============================================================
Unprec. no. iters: 3067, time: 17.799613799
Prec. no. iters: 414, time: 0.796364037

@termi-official these are the outputs from the two examples in the test_preconditioners. There is the initial time which is done once and for all and there is the ldiv! which is called multiple times. the huge chunk of allocations comes from kernel constructions basically.

Also one thing that might be optimized is isapprox(A, A') but we need a prior info from the user I believe.

@codecov
Copy link

codecov bot commented Nov 29, 2025

Codecov Report

❌ Patch coverage is 56.73289% with 196 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.39%. Comparing base (a07d127) to head (0ec39a1).

Files with missing lines Patch % Lines
...c/solver/linear/preconditioners/l1_gauss_seidel.jl 56.73% 196 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #238      +/-   ##
==========================================
- Coverage   71.71%   70.39%   -1.32%     
==========================================
  Files          78       78              
  Lines        6515     6864     +349     
==========================================
+ Hits         4672     4832     +160     
- Misses       1843     2032     +189     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating Abdelrahman! I left some comments on things which I catched on the fly.

Also take a look at CodeCov, it looks like not all of your code is covered in the CI.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Abdelrahman. There are some minor things left tho.

@AbdAlazezAhmed can you confirm that this works now on the ECG?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing tests for the symmetric case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@testset "Symmetric A" begin
md = mdopen("HB/bcsstk15")
A = md.A # type here is Symmetric{Float64, SparseMatrixCSC{Float64, Int64}}
b = ones(size(A, 1))
test_l1gs_prec(A, b)
end
end

matrix here is already of type Symmetric
also for the test algorithmic logic I added this

function test_sym(testname, A, x, y_exp, D_Dl1_exp, SLbuffer_exp, partsize)
@testset "$testname Symmetric" begin
total_ncores = 8 # Assuming 8 cores for testing
for ncores in 1:total_ncores # testing for multiple cores to check that the answer is independent of the number of cores
builder = L1GSPrecBuilder(PolyesterDevice(ncores))
P = A isa Symmetric ? builder(A, partsize) : builder(A, partsize; isSymA=true)
@test P.D_Dl1 D_Dl1_exp
@test P.SLbuffer SLbuffer_exp
y = P \ x
@test y y_exp
end
end
end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.

Copy link
Collaborator

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some issues.

  1. ThreadedSparseMatrixCSC support is missing
  2. The preconditioner tanks the symmetry
  3. There is too much of memory usage

Lets try to make the following work

using Thunderbolt
using SparseArrays,OrderedCollections,TimerOutputs,LinearAlgebra

# TimerOutputs.enable_debug_timings(Thunderbolt)
# TimerOutputs.enable_debug_timings(Thunderbolt.Preconditioners)
geo = Hexahedron
nel = 2

size = 2.
signal_strength = 0.04
nel_heart = (6, 6, 6) .* 1 .* nel
nel_torso = (8, 8, 8) .* Int(size) .* nel
ground_vertex = Vec(0.0, 0.0, 0.0)
electrodes = [
    ground_vertex,
    Vec(-size,  0.,  0.), 
    Vec( size,  0.,  0.),
    Vec( 0., -size,  0.),
    Vec( 0.,  size,  0.),
    Vec( 0.,  0., -size),
    Vec( 0.,  0.,  size),
]
electrode_pairs = [[i,1] for i in 2:length(electrodes)]

heart_grid = generate_mesh(geo, nel_heart)
Ferrite.transform_coordinates!(heart_grid, x->Vec{3}(sign.(x) .* x.^2))

κ  = ConstantCoefficient(SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)))
κᵢ = AnalyticalCoefficient(
    (x,t) -> norm(x,Inf)  1.0 ? SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)) : SymmetricTensor{2,3,Float64}((0.0, 0, 0, 0.0, 0, 0.0)),
    CartesianCoordinateSystem{3}()
)

heart_model = TransientDiffusionModel(
    κᵢ,
    NoStimulationProtocol(), # Poisoning to detecte if we accidentally touch these
    :φₘ
)
heart_fun = semidiscretize(
    heart_model,
    FiniteElementDiscretization(
        Dict(:φₘ => LagrangeCollection{1}()),
        Dirichlet[]
    ),
    heart_grid
)

op = Thunderbolt.setup_assembled_operator(
    Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
    Thunderbolt.BilinearDiffusionIntegrator(
        κ,
        QuadratureRuleCollection(2),
        :φₘ,
    ),
    Thunderbolt.SparseMatrixCSC,
    heart_fun.dh,
)
Thunderbolt.update_operator!(op, 0.0) # trigger assembly


torso_grid_ = generate_grid(geo, nel_torso, Vec((-size,-size,-size)), Vec((size,size,size)))
addcellset!(torso_grid_, "heart", x->norm(x,Inf)  1.0)
# addcellset!(torso_grid_, "surrounding-tissue", x->norm(x,Inf) ≥ 1.0)
torso_grid_.cellsets["surrounding-tissue"] = OrderedSet([i for i in 1:getncells(torso_grid_) if i  torso_grid_.cellsets["heart"]])
torso_grid = Thunderbolt.to_mesh(torso_grid_)
u = zeros(Thunderbolt.solution_size(heart_fun))
geselowitz_electrodes = [[electrodes[1], electrodes[i]] for i in 2:length(electrodes)]

TimerOutputs.reset_timer!()
@info prod(nel_torso)
geselowitz_ecg = Thunderbolt.Geselowitz1989ECGLeadCache(
    heart_fun,
    torso_grid,
    κᵢ, κ,
    geselowitz_electrodes; 
    ground = OrderedSet([Thunderbolt.get_closest_vertex(ground_vertex, torso_grid)]),
    torso_heart_domain=["heart"],
    ipc                  = LagrangeCollection{1}(),
    qrc                  = QuadratureRuleCollection(3),
    linear_solver        = Thunderbolt.LinearSolve.KrylovJL_CG(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, SparseArrays.size(A,1)/8); isSymA=true), LinearAlgebra.I)),
    # linear_solver        = Thunderbolt.LinearSolve.KrylovJL_GMRES(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, 24)), LinearAlgebra.I)),
    system_matrix_type   = Thunderbolt.SparseMatrixCSR{Float64,Int64}, #Thunderbolt.ThreadedSparseMatrixCSR{Float64,Int64},
    strategy = Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
)

TimerOutputs.print_timer()

You need to apply these patches.

diff --git a/src/solver/interface.jl b/src/solver/interface.jl
index 8234949c..4279e657 100644
--- a/src/solver/interface.jl
+++ b/src/solver/interface.jl
@@ -148,13 +148,17 @@ update_constraints_block!(f::NullFunction, i::Block, solver_cache::AbstractTimeS
 
 create_system_matrix(T::Type{<:AbstractMatrix}, f::AbstractSemidiscreteFunction) = create_system_matrix(T, f.dh)
 
-function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR{Tv,Ti}}, dh::AbstractDofHandler) where {Tv,Ti}
-    Acsct = transpose(convert(SparseMatrixCSC{Tv,Ti}, allocate_matrix(dh)))
+function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR}, dh::AbstractDofHandler)
+    Acsct = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return ThreadedSparseMatrixCSR(Acsct)
 end
 
 function create_system_matrix(SpMatType::Type{<:SparseMatrixCSC}, dh::AbstractDofHandler)
-    A = convert(SpMatType, allocate_matrix(dh))
+    A = allocate_matrix(SparseMatrixCSC{Float64,Int64}, dh)
+    return A
+end
+function create_system_matrix(SpMatType::Type{<:SparseMatrixCSR}, dh::AbstractDofHandler)
+    A = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
     return A
 end
diff --git a/src/modeling/electrophysiology/ecg.jl b/src/modeling/electrophysiology/ecg.jl
index b0f47170..9db426db 100644
--- a/src/modeling/electrophysiology/ecg.jl
+++ b/src/modeling/electrophysiology/ecg.jl
@@ -402,6 +402,7 @@ function Geselowitz1989ECGLeadCache(
     linear_solver        = LinearSolve.KrylovJL_CG(),
     solution_vector_type = Vector{Float64},
     system_matrix_type   = ThreadedSparseMatrixCSR{Float64,Int64},
+    strategy             = SequentialAssemblyStrategy(SequentialCPUDevice()),
 )
     return Geselowitz1989ECGLeadCache(
         heart_fun,
@@ -416,6 +417,7 @@ function Geselowitz1989ECGLeadCache(
         linear_solver       ,
         solution_vector_type,
         system_matrix_type  ,
+        strategy            ,
     )
 end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.

η = convert(Tf, η)
D_Dl1 = adapt(backend, zeros(Tf, N)) # D + Dˡ
last_partsize = N - (nparts - 1) * partsize # size of the last partition
SLbuffer_size = (partsize * (partsize - 1) * (nparts-1)) ÷ 2 + last_partsize * (last_partsize - 1) ÷ 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should store the lower/upper triangular parts in some sparse format. I do not have opinions about which one exactly.

(; backend) = partitioning
# The following code is required because there is no assumption on the compatibality of x with the backend.
@timeit_debug "adapt(backend, y)" _y = adapt(backend, y)
@timeit_debug "_forward_sweep!" _forward_sweep!(_y, P)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of sweep should be ideally a user option -- and must be symmetric if the problem is symmetric, as we tank the symmetry needed for the solver otherwise.

Comment on lines 5 to 60
## Structs & Constructors ##
## User defined types for sweep direction and storage strategy ##
abstract type Sweep end
struct ForwardSweep <: Sweep end # forward sweep -> lower triangular
struct BackwardSweep <: Sweep end # backward sweep -> upper triangular
struct SymmetricSweep <: Sweep end # symmetric sweep -> both lower and upper triangular

abstract type StorageStrategy end
struct OriginalMatrix <: StorageStrategy end # store original matrix
struct PackedBuffer <: StorageStrategy end # store lower/upper triangular parts in a vector buffer -> efficient for small partitions
struct SparseTriangular <: StorageStrategy end # store sparse lower/upper triangular matrices


## Internal types to encapsulate sweep and storage ##
abstract type SweepStorage end

# Forward sweep variants
abstract type ForwardSweepStorage <: SweepStorage end
struct ForwardSweepOriginalMatrix{MatrixType} <: ForwardSweepStorage
A::MatrixType # Original matrix
end
struct ForwardSweepPackedBuffer{VectorType} <: ForwardSweepStorage
SLbuffer::VectorType # Packed strictly lower triangular elements
end
struct ForwardSweepSparseTriangular{MatrixType} <: ForwardSweepStorage
L::MatrixType # Sparse strictly lower triangular matrix
end

# Backward sweep variants
abstract type BackwardSweepStorage <: SweepStorage end
struct BackwardSweepOriginalMatrix{MatrixType} <: BackwardSweepStorage
A::MatrixType # Original matrix
end
struct BackwardSweepPackedBuffer{VectorType} <: BackwardSweepStorage
SUbuffer::VectorType # Packed strictly upper triangular elements
end
struct BackwardSweepSparseTriangular{MatrixType} <: BackwardSweepStorage
U::MatrixType # Sparse strictly upper triangular matrix
end

# Symmetric sweep variants
abstract type SymmetricSweepStorage <: SweepStorage end
struct SymmetricSweepOriginalMatrix{MatrixType} <: SymmetricSweepStorage
A::MatrixType # Original symmetric matrix
end


# TODO: are these two buffers necessary? won't be implemented for now
struct SymmetricSweepPackedBuffer{VectorType} <: SymmetricSweepStorage
SLbuffer::VectorType # Packed strictly lower triangular elements
SUbuffer::VectorType # Packed strictly upper triangular elements
end
struct SymmetricSweepSparseTriangular{MatrixType} <: SymmetricSweepStorage
L::MatrixType # Sparse strictly lower triangular matrix
U::MatrixType # Sparse strictly upper triangular matrix
end

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some opinion here about the design:
Three user-defined options for the sweep (fs, bs, ss) and three user-defined options for storage strategy (original matrix (i.e. A), packed buffer (what we had previously, because might be performant for GPU context), sparse strictly lower/upper triangular matrix) . These options are high level APIs by the user, which they boil down internally to multiple permutations of SweepStorage (e.g. ForwardSweepOriginalMatrix) where they encapsulate the sweep type and the buffer (which will help in dispatching the correct kernels).

Questions here:

  1. Is this design appropriate?
  2. If so do we need to implement these
# TODO: are these two buffers necessary? won't be implemented for now
struct SymmetricSweepPackedBuffer{VectorType} <: SymmetricSweepStorage
    SLbuffer::VectorType  # Packed strictly lower triangular elements
    SUbuffer::VectorType  # Packed strictly upper triangular elements
end
struct SymmetricSweepSparseTriangular{MatrixType} <: SymmetricSweepStorage
    L::MatrixType  # Sparse strictly lower triangular matrix
    U::MatrixType  # Sparse strictly upper triangular matrix
end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need different storage types for the different sweep types? To control the dispatch? These are orthogonal, right? Also, to have a unified naming, I think the storage types should be called "caches". Naming the inner vectors buffer is fine.

So, just to understand, the use-case to store A directly would be that the buffer will store the indices for the triangular parts in the block?

Note that symmetric sweep typically refers to a forward sweep followed by a backward sweep.

A small hint here, it might be helpful to have a layer to dispatch with, i.e. apply_sweep! (or something better, I am out of steam for today already) and some internal function like apply_forward_sweep_internal! which gets called on the unpacked data. This way you can reuse the code of the forward/backward sweep for the symmetric sweep.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need different storage types for the different sweep types?
yes mainly for dispatching the right kernel easily, an alternative is to use sth like

struct ForwardSweepCache 
 cache::Union{Vector,SparseMatrix} # also this sprase matrix might be strictly lower or `A` itself , so might be hairy 
end

So, just to understand, the use-case to store A directly would be that the buffer will store the indices for the triangular parts in the block?

No, A literally means the original matrix A ...no indicies needs to be stored, because later in the iterator, we know the start and the end of each block

For caches, no problem I will change it right away. for a dispatch layer, yes sure no problem...this implementation was just a quick prototype to the previous structs.

@Abdelrahman912
Copy link
Collaborator Author

# Fields
- `device::AbstractDevice`: The backend used for the preconditioner. More info [AbstractDevice](@ref).
"""

Also, I don't know whether this is done intentionally or simply overlooked, but the reference to AbstractDevice gives 404 (i.e. no api reference to it) even though, devices.jl has already doc strings

@termi-official
Copy link
Collaborator

reference to AbstractDevice gives 404

Yea, I am not sure how to fix this. Nothing worth to spend time on right now tho.

@AbdAlazezAhmed
Copy link
Collaborator

AbdAlazezAhmed commented Dec 9, 2025

reference to AbstractDevice gives 404

Doesn't seem to be referenced in an md file

@Abdelrahman912 Abdelrahman912 marked this pull request as draft January 7, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Big memory allocation in the L1GS

4 participants