-
Notifications
You must be signed in to change notification settings - Fork 5
Big memory allocations in L1GS #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@termi-official these are the outputs from the two examples in the Also one thing that might be optimized is |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #238 +/- ##
==========================================
- Coverage 71.71% 70.39% -1.32%
==========================================
Files 78 78
Lines 6515 6864 +349
==========================================
+ Hits 4672 4832 +160
- Misses 1843 2032 +189 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating Abdelrahman! I left some comments on things which I catched on the fly.
Also take a look at CodeCov, it looks like not all of your code is covered in the CI.
termi-official
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes Abdelrahman. There are some minor things left tho.
@AbdAlazezAhmed can you confirm that this works now on the ECG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing tests for the symmetric case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thunderbolt.jl/test/test_preconditioners.jl
Lines 157 to 163 in 8f872e3
| @testset "Symmetric A" begin | |
| md = mdopen("HB/bcsstk15") | |
| A = md.A # type here is Symmetric{Float64, SparseMatrixCSC{Float64, Int64}} | |
| b = ones(size(A, 1)) | |
| test_l1gs_prec(A, b) | |
| end | |
| end |
matrix here is already of type Symmetric
also for the test algorithmic logic I added this
Thunderbolt.jl/test/test_preconditioners.jl
Lines 21 to 33 in 8f872e3
| function test_sym(testname, A, x, y_exp, D_Dl1_exp, SLbuffer_exp, partsize) | |
| @testset "$testname Symmetric" begin | |
| total_ncores = 8 # Assuming 8 cores for testing | |
| for ncores in 1:total_ncores # testing for multiple cores to check that the answer is independent of the number of cores | |
| builder = L1GSPrecBuilder(PolyesterDevice(ncores)) | |
| P = A isa Symmetric ? builder(A, partsize) : builder(A, partsize; isSymA=true) | |
| @test P.D_Dl1 ≈ D_Dl1_exp | |
| @test P.SLbuffer ≈ SLbuffer_exp | |
| y = P \ x | |
| @test y ≈ y_exp | |
| end | |
| end | |
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks!
However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some issues.
- ThreadedSparseMatrixCSC support is missing
- The preconditioner tanks the symmetry
- There is too much of memory usage
Lets try to make the following work
using Thunderbolt
using SparseArrays,OrderedCollections,TimerOutputs,LinearAlgebra
# TimerOutputs.enable_debug_timings(Thunderbolt)
# TimerOutputs.enable_debug_timings(Thunderbolt.Preconditioners)
geo = Hexahedron
nel = 2
size = 2.
signal_strength = 0.04
nel_heart = (6, 6, 6) .* 1 .* nel
nel_torso = (8, 8, 8) .* Int(size) .* nel
ground_vertex = Vec(0.0, 0.0, 0.0)
electrodes = [
ground_vertex,
Vec(-size, 0., 0.),
Vec( size, 0., 0.),
Vec( 0., -size, 0.),
Vec( 0., size, 0.),
Vec( 0., 0., -size),
Vec( 0., 0., size),
]
electrode_pairs = [[i,1] for i in 2:length(electrodes)]
heart_grid = generate_mesh(geo, nel_heart)
Ferrite.transform_coordinates!(heart_grid, x->Vec{3}(sign.(x) .* x.^2))
κ = ConstantCoefficient(SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)))
κᵢ = AnalyticalCoefficient(
(x,t) -> norm(x,Inf) ≤ 1.0 ? SymmetricTensor{2,3,Float64}((1.0, 0, 0, 1.0, 0, 1.0)) : SymmetricTensor{2,3,Float64}((0.0, 0, 0, 0.0, 0, 0.0)),
CartesianCoordinateSystem{3}()
)
heart_model = TransientDiffusionModel(
κᵢ,
NoStimulationProtocol(), # Poisoning to detecte if we accidentally touch these
:φₘ
)
heart_fun = semidiscretize(
heart_model,
FiniteElementDiscretization(
Dict(:φₘ => LagrangeCollection{1}()),
Dirichlet[]
),
heart_grid
)
op = Thunderbolt.setup_assembled_operator(
Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
Thunderbolt.BilinearDiffusionIntegrator(
κ,
QuadratureRuleCollection(2),
:φₘ,
),
Thunderbolt.SparseMatrixCSC,
heart_fun.dh,
)
Thunderbolt.update_operator!(op, 0.0) # trigger assembly
torso_grid_ = generate_grid(geo, nel_torso, Vec((-size,-size,-size)), Vec((size,size,size)))
addcellset!(torso_grid_, "heart", x->norm(x,Inf) ≤ 1.0)
# addcellset!(torso_grid_, "surrounding-tissue", x->norm(x,Inf) ≥ 1.0)
torso_grid_.cellsets["surrounding-tissue"] = OrderedSet([i for i in 1:getncells(torso_grid_) if i ∉ torso_grid_.cellsets["heart"]])
torso_grid = Thunderbolt.to_mesh(torso_grid_)
u = zeros(Thunderbolt.solution_size(heart_fun))
geselowitz_electrodes = [[electrodes[1], electrodes[i]] for i in 2:length(electrodes)]
TimerOutputs.reset_timer!()
@info prod(nel_torso)
geselowitz_ecg = Thunderbolt.Geselowitz1989ECGLeadCache(
heart_fun,
torso_grid,
κᵢ, κ,
geselowitz_electrodes;
ground = OrderedSet([Thunderbolt.get_closest_vertex(ground_vertex, torso_grid)]),
torso_heart_domain=["heart"],
ipc = LagrangeCollection{1}(),
qrc = QuadratureRuleCollection(3),
linear_solver = Thunderbolt.LinearSolve.KrylovJL_CG(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, SparseArrays.size(A,1)/8); isSymA=true), LinearAlgebra.I)),
# linear_solver = Thunderbolt.LinearSolve.KrylovJL_GMRES(precs = (A, p) -> (L1GSPrecBuilder(PolyesterDevice(8))(A, ceil(Int, 24)), LinearAlgebra.I)),
system_matrix_type = Thunderbolt.SparseMatrixCSR{Float64,Int64}, #Thunderbolt.ThreadedSparseMatrixCSR{Float64,Int64},
strategy = Thunderbolt.PerColorAssemblyStrategy(Thunderbolt.PolyesterDevice(8)),
)
TimerOutputs.print_timer()You need to apply these patches.
diff --git a/src/solver/interface.jl b/src/solver/interface.jl
index 8234949c..4279e657 100644
--- a/src/solver/interface.jl
+++ b/src/solver/interface.jl
@@ -148,13 +148,17 @@ update_constraints_block!(f::NullFunction, i::Block, solver_cache::AbstractTimeS
create_system_matrix(T::Type{<:AbstractMatrix}, f::AbstractSemidiscreteFunction) = create_system_matrix(T, f.dh)
-function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR{Tv,Ti}}, dh::AbstractDofHandler) where {Tv,Ti}
- Acsct = transpose(convert(SparseMatrixCSC{Tv,Ti}, allocate_matrix(dh)))
+function create_system_matrix(::Type{<:ThreadedSparseMatrixCSR}, dh::AbstractDofHandler)
+ Acsct = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
return ThreadedSparseMatrixCSR(Acsct)
end
function create_system_matrix(SpMatType::Type{<:SparseMatrixCSC}, dh::AbstractDofHandler)
- A = convert(SpMatType, allocate_matrix(dh))
+ A = allocate_matrix(SparseMatrixCSC{Float64,Int64}, dh)
+ return A
+end
+function create_system_matrix(SpMatType::Type{<:SparseMatrixCSR}, dh::AbstractDofHandler)
+ A = allocate_matrix(SparseMatrixCSR{1,Float64,Int64}, dh)
return A
enddiff --git a/src/modeling/electrophysiology/ecg.jl b/src/modeling/electrophysiology/ecg.jl
index b0f47170..9db426db 100644
--- a/src/modeling/electrophysiology/ecg.jl
+++ b/src/modeling/electrophysiology/ecg.jl
@@ -402,6 +402,7 @@ function Geselowitz1989ECGLeadCache(
linear_solver = LinearSolve.KrylovJL_CG(),
solution_vector_type = Vector{Float64},
system_matrix_type = ThreadedSparseMatrixCSR{Float64,Int64},
+ strategy = SequentialAssemblyStrategy(SequentialCPUDevice()),
)
return Geselowitz1989ECGLeadCache(
heart_fun,
@@ -416,6 +417,7 @@ function Geselowitz1989ECGLeadCache(
linear_solver ,
solution_vector_type,
system_matrix_type ,
+ strategy ,
)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks!
However, we should use CG for the symmetric problems to ensure that the preconditioner actually conserves the symmetry.
| η = convert(Tf, η) | ||
| D_Dl1 = adapt(backend, zeros(Tf, N)) # D + Dˡ | ||
| last_partsize = N - (nparts - 1) * partsize # size of the last partition | ||
| SLbuffer_size = (partsize * (partsize - 1) * (nparts-1)) ÷ 2 + last_partsize * (last_partsize - 1) ÷ 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should store the lower/upper triangular parts in some sparse format. I do not have opinions about which one exactly.
| (; backend) = partitioning | ||
| # The following code is required because there is no assumption on the compatibality of x with the backend. | ||
| @timeit_debug "adapt(backend, y)" _y = adapt(backend, y) | ||
| @timeit_debug "_forward_sweep!" _forward_sweep!(_y, P) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of sweep should be ideally a user option -- and must be symmetric if the problem is symmetric, as we tank the symmetry needed for the solver otherwise.
| ## Structs & Constructors ## | ||
| ## User defined types for sweep direction and storage strategy ## | ||
| abstract type Sweep end | ||
| struct ForwardSweep <: Sweep end # forward sweep -> lower triangular | ||
| struct BackwardSweep <: Sweep end # backward sweep -> upper triangular | ||
| struct SymmetricSweep <: Sweep end # symmetric sweep -> both lower and upper triangular | ||
|
|
||
| abstract type StorageStrategy end | ||
| struct OriginalMatrix <: StorageStrategy end # store original matrix | ||
| struct PackedBuffer <: StorageStrategy end # store lower/upper triangular parts in a vector buffer -> efficient for small partitions | ||
| struct SparseTriangular <: StorageStrategy end # store sparse lower/upper triangular matrices | ||
|
|
||
|
|
||
| ## Internal types to encapsulate sweep and storage ## | ||
| abstract type SweepStorage end | ||
|
|
||
| # Forward sweep variants | ||
| abstract type ForwardSweepStorage <: SweepStorage end | ||
| struct ForwardSweepOriginalMatrix{MatrixType} <: ForwardSweepStorage | ||
| A::MatrixType # Original matrix | ||
| end | ||
| struct ForwardSweepPackedBuffer{VectorType} <: ForwardSweepStorage | ||
| SLbuffer::VectorType # Packed strictly lower triangular elements | ||
| end | ||
| struct ForwardSweepSparseTriangular{MatrixType} <: ForwardSweepStorage | ||
| L::MatrixType # Sparse strictly lower triangular matrix | ||
| end | ||
|
|
||
| # Backward sweep variants | ||
| abstract type BackwardSweepStorage <: SweepStorage end | ||
| struct BackwardSweepOriginalMatrix{MatrixType} <: BackwardSweepStorage | ||
| A::MatrixType # Original matrix | ||
| end | ||
| struct BackwardSweepPackedBuffer{VectorType} <: BackwardSweepStorage | ||
| SUbuffer::VectorType # Packed strictly upper triangular elements | ||
| end | ||
| struct BackwardSweepSparseTriangular{MatrixType} <: BackwardSweepStorage | ||
| U::MatrixType # Sparse strictly upper triangular matrix | ||
| end | ||
|
|
||
| # Symmetric sweep variants | ||
| abstract type SymmetricSweepStorage <: SweepStorage end | ||
| struct SymmetricSweepOriginalMatrix{MatrixType} <: SymmetricSweepStorage | ||
| A::MatrixType # Original symmetric matrix | ||
| end | ||
|
|
||
|
|
||
| # TODO: are these two buffers necessary? won't be implemented for now | ||
| struct SymmetricSweepPackedBuffer{VectorType} <: SymmetricSweepStorage | ||
| SLbuffer::VectorType # Packed strictly lower triangular elements | ||
| SUbuffer::VectorType # Packed strictly upper triangular elements | ||
| end | ||
| struct SymmetricSweepSparseTriangular{MatrixType} <: SymmetricSweepStorage | ||
| L::MatrixType # Sparse strictly lower triangular matrix | ||
| U::MatrixType # Sparse strictly upper triangular matrix | ||
| end | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some opinion here about the design:
Three user-defined options for the sweep (fs, bs, ss) and three user-defined options for storage strategy (original matrix (i.e. A), packed buffer (what we had previously, because might be performant for GPU context), sparse strictly lower/upper triangular matrix) . These options are high level APIs by the user, which they boil down internally to multiple permutations of SweepStorage (e.g. ForwardSweepOriginalMatrix) where they encapsulate the sweep type and the buffer (which will help in dispatching the correct kernels).
Questions here:
- Is this design appropriate?
- If so do we need to implement these
# TODO: are these two buffers necessary? won't be implemented for now
struct SymmetricSweepPackedBuffer{VectorType} <: SymmetricSweepStorage
SLbuffer::VectorType # Packed strictly lower triangular elements
SUbuffer::VectorType # Packed strictly upper triangular elements
end
struct SymmetricSweepSparseTriangular{MatrixType} <: SymmetricSweepStorage
L::MatrixType # Sparse strictly lower triangular matrix
U::MatrixType # Sparse strictly upper triangular matrix
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need different storage types for the different sweep types? To control the dispatch? These are orthogonal, right? Also, to have a unified naming, I think the storage types should be called "caches". Naming the inner vectors buffer is fine.
So, just to understand, the use-case to store A directly would be that the buffer will store the indices for the triangular parts in the block?
Note that symmetric sweep typically refers to a forward sweep followed by a backward sweep.
A small hint here, it might be helpful to have a layer to dispatch with, i.e. apply_sweep! (or something better, I am out of steam for today already) and some internal function like apply_forward_sweep_internal! which gets called on the unpacked data. This way you can reuse the code of the forward/backward sweep for the symmetric sweep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need different storage types for the different sweep types?
yes mainly for dispatching the right kernel easily, an alternative is to use sth like
struct ForwardSweepCache
cache::Union{Vector,SparseMatrix} # also this sprase matrix might be strictly lower or `A` itself , so might be hairy
end
So, just to understand, the use-case to store A directly would be that the buffer will store the indices for the triangular parts in the block?
No, A literally means the original matrix A ...no indicies needs to be stored, because later in the iterator, we know the start and the end of each block
For caches, no problem I will change it right away. for a dispatch layer, yes sure no problem...this implementation was just a quick prototype to the previous structs.
|
Thunderbolt.jl/src/solver/linear/preconditioners/l1_gauss_seidel.jl Lines 144 to 146 in b34c57a
Also, I don't know whether this is done intentionally or simply overlooked, but the reference to |
Yea, I am not sure how to fix this. Nothing worth to spend time on right now tho. |
Doesn't seem to be referenced in an md file |
closes #230