insarlab · s-sasaki-earthsea-wizard · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/docs/README.md b/docs/README.md
@@ -88,6 +88,8 @@ Algorithms implemented in the software are described in details at [Yunjun et al
 + [Example data directory](./dir_structure.md)
 + [Example template files](./templates/README.md)
 + [Tutorials in Jupyter Notebook](https://github.com/insarlab/MintPy-tutorial)
++ [Parallel processing with Dask](./dask.md)
++ [GPU acceleration for the `invert_network` step (opt-in PyTorch CUDA solver, partial)](./gpu.md)
 
 ### 4. Contact us
 

diff --git a/docs/dask.md b/docs/dask.md
@@ -7,6 +7,8 @@ Most computations in MintPy are operated in either a pixel-by-pixel or a epoch-b
 
 [Here](https://github.com/2gotgrossman/dask-rsmas-presentation) is an entry-level presentation on parallel computing using Dask by David Grossman. Below we brief describe for each cluster/scheduler the required options and recommended best practices.
 
+For GPU acceleration of the `invert_network` step on a single CUDA device — orthogonal to the Dask paths described here — see [gpu.md](./gpu.md).
+
 ## 1. local cluster ##
 
 The parallel processing on a single machine is supported via [`Dask.distributed.LocalCluster`](https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster). This is recommended if you are running MintPy on a local machine with multiple available cores, or on an HPC but wish to allocate only a single node's worth of resources.

diff --git a/docs/gpu.md b/docs/gpu.md
@@ -0,0 +1,68 @@
+# Configure GPU acceleration for the network inversion #
+
+The `invert_network` step (in `ifgram_inversion.py`) ships an opt-in GPU solver that batches the per-pixel weighted least-squares inversion as normal-equations + Cholesky on a CUDA device via PyTorch. This is a partial GPU implementation: only `invert_network` is offloaded to the GPU; every other step in `smallbaselineApp.py` continues to run on the CPU. The solver is opt-in — the default `mintpy.networkInversion.solver = auto` resolves to `cpu`, so existing setups are unaffected.
+
+The `torch` solver is orthogonal to Dask parallel processing (see [dask.md](./dask.md)): the former replaces the per-pixel CPU loop with a single batched Cholesky on one CUDA device, the latter distributes that same per-pixel loop across multiple worker processes. The two paths are not currently combined; pick one.
+
+## 1. Setup ##
+
+See [installation.md](./installation.md) section 2.4 for installing the `[gpu]` extras with the matching CUDA wheel index. Selecting `solver = torch` on a host without a visible CUDA device is a hard error (no silent CPU fallback).
+
+## 2. Enable ##
+
+#### 2.1 via command line ####
+
+Run the following in the terminal:
+
+```bash
+ifgram_inversion.py inputs/ifgramStack.h5 --solver torch
+ifgram_inversion.py inputs/ifgramStack.h5 --solver torch --gpu-chunk-size 20000
+```
+
+`--gpu-chunk-size 0` (the default) auto-sizes the per-chunk pixel count from free VRAM; pass a positive integer to override.
+
+#### 2.2 via template file ####
+
+Adjust options in the template file:
+
+```cfg
+mintpy.networkInversion.solver       = torch  #[cpu / torch], auto for cpu
+mintpy.networkInversion.gpuChunkSize = auto   #[int >= 0], auto for 0 (auto-size from free VRAM)
+```
+
+and feed the template file to the script:
+
+```bash
+ifgram_inversion.py inputs/ifgramStack.h5 -t smallbaselineApp.cfg
+smallbaselineApp.py smallbaselineApp.cfg
+```
+
+#### 2.3 Testing using example data ####
+
+Download and run the FernandinaSenDT128 example data; then run with and without the GPU solver:
+
+```bash
+cd FernandinaSenDT128/mintpy
+ifgram_inversion.py inputs/ifgramStack.h5 -w no --solver cpu
+ifgram_inversion.py inputs/ifgramStack.h5 -w no --solver torch
+```
+
+The two outputs should agree to float32 round-off (RMS on the order of 1e-5).
+
+## 3. Behavior notes ##
+
++ **VRAM auto-sizing.** `gpuChunkSize = 0` (auto) probes free VRAM at runtime and chooses a per-chunk pixel count with a fixed headroom factor. Set an explicit integer to override (e.g. for reproducible chunking across hosts with different VRAM).
+
++ **Rank-deficient pixels.** Detected via `torch.linalg.cholesky_ex` info codes; their solution is set to zero so NaN/Inf never propagate downstream. A warning line reports the count per chunk.
+
++ **Per-pixel NaN observations.** Handled by zeroing the corresponding row weight, which is mathematically equivalent to dropping that row from the WLS system.
+
++ **No silent CPU fallback.** Selecting `solver = torch` on a host without a visible CUDA device raises immediately rather than silently falling back to CPU; this keeps performance regressions visible.
+
+## 4. Performance ##
+
+Indicative numbers below were measured on an NVIDIA RTX 5080 (Blackwell sm_120, CUDA 12.8, PyTorch 2.11) at the time this feature was submitted. Speedup depends on scene size, GPU class, and chunk-size tuning, so reproduce on your own data and hardware before drawing conclusions.
+
++ **Tutorial-scale** (FernandinaSenDT128: 270k pixels, 288 ifgs) — `invert_network` runs roughly **16×** faster internally and **4.5×** faster end-to-end versus the CPU path.
++ **Large-scene** (GalapagosSenDT128: 3.4M pixels, 475 ifgs; ~12.6× pixels and 1.65× ifgs over Fernandina) — roughly **44×** internal and **36×** step-wall speedup on `invert_network` (CPU 6189 s → torch 170 s on the same machine), confirming the speedup grows at scale.
++ **Numerical equivalence** between the `cpu` and `torch` solvers holds to float32 round-off: RMS on the order of `1e-5` on the tutorial case, with absolute RMS at most ~16 µm on the large-scene case.
diff --git a/docs/installation.md b/docs/installation.md
@@ -194,6 +194,64 @@ Same as the <a href="#21-install-on-linux">instruction for Linux</a>, except for
 </details>
 </p>
 
+### 2.4 Optional: GPU acceleration via PyTorch CUDA ###
+
+<p>
+<details>
+<p><summary>Click to expand for more details</summary></p>
+
+<p>The <code>invert_network</code> step ships an opt-in GPU solver that solves the per-pixel WLS inversion as a batched normal-equation + Cholesky on a CUDA device via PyTorch. It is opt-in: the default <code>mintpy.networkInversion.solver = auto</code> resolves to <code>cpu</code>, so existing setups are unaffected. There is no silent CPU fallback — selecting <code>torch</code> without a visible CUDA device is a hard error.</p>
+
+<h4>a. Prerequisites</h4>
+
+<ul>
+<li>An NVIDIA GPU with a working CUDA driver</li>
+<li>MintPy installed from source in editable mode (Section 2 above)</li>
+</ul>
+
+<h4>b. Install the [gpu] extras</h4>
+
+<p>The <code>[gpu]</code> extras pull a CUDA-enabled PyTorch build. Pick the <code>cuXXX</code> wheel index that matches your CUDA toolkit version (see <a href="https://pytorch.org/get-started/locally/">pytorch.org</a> for the current list); for example, <code>cu121</code>, <code>cu124</code>, or <code>cu128</code>:</p>
+
+```bash
+python -m pip install -e ".[gpu]" \
+    --extra-index-url https://download.pytorch.org/whl/cu128
+```
+
+<p>If you use <a href="https://docs.astral.sh/uv/">uv</a> instead of <code>pip</code>, add <code>--index-strategy unsafe-best-match</code> to work around a stale <code>setuptools</code> pin in the PyTorch wheel index:</p>
+
+```bash
+uv pip install -e ".[gpu]" \
+    --extra-index-url https://download.pytorch.org/whl/cu128 \
+    --index-strategy unsafe-best-match
+```
+
+<h4>c. Verify</h4>
+
+```bash
+python -c "import torch; print(torch.cuda.is_available())"
+# expected: True
+```
+
+<h4>d. Enable</h4>
+
+<p>Set the template flag:</p>
+
+```cfg
+mintpy.networkInversion.solver = torch
+```
+
+<p>or pass it on the command line:</p>
+
+```bash
+ifgram_inversion.py inputs/ifgramStack.h5 --solver torch
+```
+
+<p>See <a href="./gpu.md">gpu.md</a> for tuning, behavior notes, and benchmarks.</p>
+
+</details>
+</p>
+
 ## 3. Post-Installation Setup ##
 
 #### a. ERA5 for tropospheric correction ####

diff --git a/pyproject.toml b/pyproject.toml
@@ -114,6 +114,9 @@ readme = { file = ["docs/README.md"], content-type = "text/markdown" }
 [tool.setuptools.dynamic.optional-dependencies.test]
 file = ["tests/requirements.txt"]
 
+[tool.setuptools.dynamic.optional-dependencies.gpu]
+file = ["requirements-gpu.txt"]
+
 [tool.setuptools.packages.find]
 where = ["src"]
 

diff --git a/requirements-gpu.txt b/requirements-gpu.txt
@@ -0,0 +1,4 @@
+# Optional GPU acceleration deps for the `[gpu]` extras.
+# Install with the PyTorch CUDA wheel index, e.g.:
+#   pip install -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
+torch>=2.11
diff --git a/src/mintpy/cli/ifgram_inversion.py b/src/mintpy/cli/ifgram_inversion.py
@@ -86,6 +86,16 @@ def create_parser(subparsers=None):
     solver.add_argument('--min-norm-phase', dest='minNormVelocity', action='store_false',
                         help=('Enable inversion with minimum-norm deformation phase,'
                               ' instead of the default minimum-norm deformation velocity.'))
+    solver.add_argument('--solver', dest='solver', default='cpu',
+                        choices={'cpu', 'torch'},
+                        help='WLS solver: cpu (scipy.linalg.lstsq, default) '
+                             'or torch (CUDA-batched normal-equation + Cholesky via '
+                             'PyTorch). torch requires the [gpu] extras and a visible '
+                             'CUDA device; absence is a hard error. '
+                             'See docs/installation.md.')
+    solver.add_argument('--gpu-chunk-size', dest='gpuChunkSize', type=int, default=0,
+                        help='pixels per GPU chunk for --solver=torch '
+                             '(0=auto-size from free VRAM; default).')
     #solver.add_argument('--norm', dest='residualNorm', default='L2', choices=['L1', 'L2'],
     #                    help='Optimization method, L1 or L2 norm. (default: %(default)s).')
 
@@ -234,8 +244,10 @@ def read_template2inps(template_file, inps):
         elif value:
             if key in ['maskThreshold', 'minRedundancy']:
                 iDict[key] = float(value)
-            elif key in ['residualNorm', 'waterMaskFile']:
+            elif key in ['residualNorm', 'waterMaskFile', 'solver']:
                 iDict[key] = value
+            elif key in ['gpuChunkSize']:
+                iDict[key] = int(value)
 
     # computing configurations
     dask_key_prefix = 'mintpy.compute.'

diff --git a/src/mintpy/defaults/smallbaselineApp.cfg b/src/mintpy/defaults/smallbaselineApp.cfg
@@ -175,6 +175,15 @@ mintpy.unwrapError.bridgePtsRadius = auto  #[1-inf], auto for 50, half size of t
 mintpy.networkInversion.weightFunc      = auto #[var / fim / coh / no], auto for var
 mintpy.networkInversion.waterMaskFile   = auto #[filename / no], auto for waterMask.h5 or no [if not found]
 mintpy.networkInversion.minNormVelocity = auto #[yes / no], auto for yes, min-norm deformation velocity / phase
+## WLS solver for the per-pixel network inversion (GPU path is opt-in):
+## a. cpu   - scipy.linalg.lstsq, per-pixel (default, original behavior)
+## b. torch - batched normal-equation + Cholesky on CUDA via PyTorch.
+##            Requires the [gpu] extras (CUDA-enabled torch build) and a
+##            visible CUDA device; absence is a hard error (no silent CPU
+##            fallback). The default 'auto' resolves to 'cpu', so existing
+##            setups are unaffected. See docs/installation.md.
+mintpy.networkInversion.solver          = auto #[cpu / torch], auto for cpu
+mintpy.networkInversion.gpuChunkSize    = auto #[int >= 0], auto for 0 (auto-size from free VRAM)
 
 ## mask options for unwrapPhase of each interferogram before inversion (recommend if weightFunct=no):
 ## a. coherence              - mask out pixels with spatial coherence < maskThreshold

diff --git a/src/mintpy/defaults/smallbaselineApp_auto.cfg b/src/mintpy/defaults/smallbaselineApp_auto.cfg
@@ -70,6 +70,8 @@ mintpy.unwrapError.bridgePtsRadius   = 50
 mintpy.networkInversion.weightFunc       = var
 mintpy.networkInversion.waterMaskFile    = waterMask.h5
 mintpy.networkInversion.minNormVelocity  = yes
+mintpy.networkInversion.solver           = cpu
+mintpy.networkInversion.gpuChunkSize     = 0
 
 ## mask
 mintpy.networkInversion.maskDataset      = no

diff --git a/src/mintpy/ifgram_inversion.py b/src/mintpy/ifgram_inversion.py
@@ -595,7 +595,8 @@ def get_design_matrix4std(stack_obj):
 
 def run_ifgram_inversion_patch(ifgram_file, box=None, ref_phase=None, obs_ds_name='unwrapPhase',
                                weight_func='var', water_mask_file=None, min_norm_velocity=True,
-                               mask_ds_name=None, mask_threshold=0.4, min_redundancy=1.0, calc_cov=False):
+                               mask_ds_name=None, mask_threshold=0.4, min_redundancy=1.0, calc_cov=False,
+                               solver='cpu', gpu_chunk_size=0):
     """Invert one patch of an ifgram stack into timeseries.
 
     Parameters: ifgram_file       - str, interferograms stack HDF5 file, e.g. ./inputs/ifgramStack.h5
@@ -610,6 +611,15 @@ def run_ifgram_inversion_patch(ifgram_file, box=None, ref_phase=None, obs_ds_nam
                 mask_threshold    - float, min coherence of pixels if mask_dataset_name='coherence'
                 min_redundancy    - float, the min number of ifgrams for every acquisition.
                 calc_cov          - bool, calculate the time series covariance matrix.
+                solver            - str, WLS solver: 'cpu' (default,
+                                    scipy.linalg.lstsq per pixel) or 'torch'
+                                    (CUDA-batched normal-equation + Cholesky via
+                                    PyTorch). The 'torch' solver requires the
+                                    [gpu] extras and a visible CUDA device;
+                                    absence is a hard error (no silent CPU
+                                    fallback).
+                gpu_chunk_size    - int, pixels per GPU chunk for solver='torch'.
+                                    0 (default) auto-sizes from free VRAM.
     Returns:    ts                - 3D array in size of (num_date, num_row, num_col)
                 ts_cov            - 4D array in size of (num_date, num_date, num_row, num_col) or None
                 inv_quality       - 2D array in size of (num_row, num_col)
@@ -800,8 +810,34 @@ def run_ifgram_inversion_patch(ifgram_file, box=None, ref_phase=None, obs_ds_nam
         'inv_quality_name'  : inv_quality_name,
     }
 
+    # 2.x GPU batched path: handles weighted and unweighted in one call.
+    # Per-pixel NaN observations are masked via zero-weights inside the kernel,
+    # which is mathematically equivalent to dropping them from the LS system
+    # for the full-rank case. Rank-deficient pixels (rare on real SBAS networks)
+    # are not handled here; if encountered, NaN/Inf will propagate downstream.
+    if solver != 'cpu':
+        from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
+        print(f'estimating time-series via {solver} solver (batched, GPU)')
+        ts_sub, q_sub, n_sub = estimate_timeseries_batch(
+            A=A, B=B,
+            y=stack_obs[:, idx_pixel2inv],
+            weight_sqrt=(weight_sqrt[:, idx_pixel2inv]
+                         if weight_sqrt is not None else None),
+            tbase_diff=tbase_diff,
+            min_norm_velocity=min_norm_velocity,
+            rcond=1e-5,
+            min_redundancy=min_redundancy,
+            inv_quality_name=inv_quality_name,
+            chunk_size=gpu_chunk_size,
+            solver=solver,
+        )
+        ts[:, idx_pixel2inv] = ts_sub
+        inv_quality[idx_pixel2inv] = q_sub
+        num_inv_obs[idx_pixel2inv] = n_sub
+        del mask
+
     # 2.2 un-weighted inversion (classic SBAS)
-    if weight_sqrt is None:
+    elif weight_sqrt is None:
         msg = f'estimating time-series for pixels with valid {obs_ds_name} in'
 
         # a. split mask into mask_all/part_net
@@ -1089,6 +1125,8 @@ def run_ifgram_inversion(inps):
         "mask_threshold"    : inps.maskThreshold,
         "min_redundancy"    : inps.minRedundancy,
         "calc_cov"          : inps.calcCov,
+        "solver"            : getattr(inps, 'solver', 'cpu'),
+        "gpu_chunk_size"    : int(getattr(inps, 'gpuChunkSize', 0)),
     }
 
     # 3.3 invert / write block-by-block