Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ Algorithms implemented in the software are described in details at [Yunjun et al
+ [Example data directory](./dir_structure.md)
+ [Example template files](./templates/README.md)
+ [Tutorials in Jupyter Notebook](https://github.com/insarlab/MintPy-tutorial)
+ [Parallel processing with Dask](./dask.md)
+ [GPU acceleration for the `invert_network` step (opt-in PyTorch CUDA solver, partial)](./gpu.md)

### 4. Contact us

Expand Down
2 changes: 2 additions & 0 deletions docs/dask.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Most computations in MintPy are operated in either a pixel-by-pixel or a epoch-b

[Here](https://github.com/2gotgrossman/dask-rsmas-presentation) is an entry-level presentation on parallel computing using Dask by David Grossman. Below we brief describe for each cluster/scheduler the required options and recommended best practices.

For GPU acceleration of the `invert_network` step on a single CUDA device — orthogonal to the Dask paths described here — see [gpu.md](./gpu.md).

## 1. local cluster ##

The parallel processing on a single machine is supported via [`Dask.distributed.LocalCluster`](https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster). This is recommended if you are running MintPy on a local machine with multiple available cores, or on an HPC but wish to allocate only a single node's worth of resources.
Expand Down
68 changes: 68 additions & 0 deletions docs/gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Configure GPU acceleration for the network inversion #

The `invert_network` step (in `ifgram_inversion.py`) ships an opt-in GPU solver that batches the per-pixel weighted least-squares inversion as normal-equations + Cholesky on a CUDA device via PyTorch. This is a partial GPU implementation: only `invert_network` is offloaded to the GPU; every other step in `smallbaselineApp.py` continues to run on the CPU. The solver is opt-in — the default `mintpy.networkInversion.solver = auto` resolves to `cpu`, so existing setups are unaffected.

The `torch` solver is orthogonal to Dask parallel processing (see [dask.md](./dask.md)): the former replaces the per-pixel CPU loop with a single batched Cholesky on one CUDA device, the latter distributes that same per-pixel loop across multiple worker processes. The two paths are not currently combined; pick one.

## 1. Setup ##

See [installation.md](./installation.md) section 2.4 for installing the `[gpu]` extras with the matching CUDA wheel index. Selecting `solver = torch` on a host without a visible CUDA device is a hard error (no silent CPU fallback).

## 2. Enable ##

#### 2.1 via command line ####

Run the following in the terminal:

```bash
ifgram_inversion.py inputs/ifgramStack.h5 --solver torch
ifgram_inversion.py inputs/ifgramStack.h5 --solver torch --gpu-chunk-size 20000
```

`--gpu-chunk-size 0` (the default) auto-sizes the per-chunk pixel count from free VRAM; pass a positive integer to override.

#### 2.2 via template file ####

Adjust options in the template file:

```cfg
mintpy.networkInversion.solver = torch #[cpu / torch], auto for cpu
mintpy.networkInversion.gpuChunkSize = auto #[int >= 0], auto for 0 (auto-size from free VRAM)
```

and feed the template file to the script:

```bash
ifgram_inversion.py inputs/ifgramStack.h5 -t smallbaselineApp.cfg
smallbaselineApp.py smallbaselineApp.cfg
```

#### 2.3 Testing using example data ####

Download and run the FernandinaSenDT128 example data; then run with and without the GPU solver:

```bash
cd FernandinaSenDT128/mintpy
ifgram_inversion.py inputs/ifgramStack.h5 -w no --solver cpu
ifgram_inversion.py inputs/ifgramStack.h5 -w no --solver torch
```

The two outputs should agree to float32 round-off (RMS on the order of 1e-5).

## 3. Behavior notes ##

+ **VRAM auto-sizing.** `gpuChunkSize = 0` (auto) probes free VRAM at runtime and chooses a per-chunk pixel count with a fixed headroom factor. Set an explicit integer to override (e.g. for reproducible chunking across hosts with different VRAM).

+ **Rank-deficient pixels.** Detected via `torch.linalg.cholesky_ex` info codes; their solution is set to zero so NaN/Inf never propagate downstream. A warning line reports the count per chunk.

+ **Per-pixel NaN observations.** Handled by zeroing the corresponding row weight, which is mathematically equivalent to dropping that row from the WLS system.

+ **No silent CPU fallback.** Selecting `solver = torch` on a host without a visible CUDA device raises immediately rather than silently falling back to CPU; this keeps performance regressions visible.

## 4. Performance ##

Indicative numbers below were measured on an NVIDIA RTX 5080 (Blackwell sm_120, CUDA 12.8, PyTorch 2.11) at the time this feature was submitted. Speedup depends on scene size, GPU class, and chunk-size tuning, so reproduce on your own data and hardware before drawing conclusions.

+ **Tutorial-scale** (FernandinaSenDT128: 270k pixels, 288 ifgs) — `invert_network` runs roughly **16×** faster internally and **4.5×** faster end-to-end versus the CPU path.
+ **Large-scene** (GalapagosSenDT128: 3.4M pixels, 475 ifgs; ~12.6× pixels and 1.65× ifgs over Fernandina) — roughly **44×** internal and **36×** step-wall speedup on `invert_network` (CPU 6189 s → torch 170 s on the same machine), confirming the speedup grows at scale.
+ **Numerical equivalence** between the `cpu` and `torch` solvers holds to float32 round-off: RMS on the order of `1e-5` on the tutorial case, with absolute RMS at most ~16 µm on the large-scene case.
58 changes: 58 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,64 @@ Same as the <a href="#21-install-on-linux">instruction for Linux</a>, except for
</details>
</p>

### 2.4 Optional: GPU acceleration via PyTorch CUDA ###

<p>
<details>
<p><summary>Click to expand for more details</summary></p>

<p>The <code>invert_network</code> step ships an opt-in GPU solver that solves the per-pixel WLS inversion as a batched normal-equation + Cholesky on a CUDA device via PyTorch. It is opt-in: the default <code>mintpy.networkInversion.solver = auto</code> resolves to <code>cpu</code>, so existing setups are unaffected. There is no silent CPU fallback — selecting <code>torch</code> without a visible CUDA device is a hard error.</p>

<h4>a. Prerequisites</h4>

<ul>
<li>An NVIDIA GPU with a working CUDA driver</li>
<li>MintPy installed from source in editable mode (Section 2 above)</li>
</ul>

<h4>b. Install the [gpu] extras</h4>

<p>The <code>[gpu]</code> extras pull a CUDA-enabled PyTorch build. Pick the <code>cuXXX</code> wheel index that matches your CUDA toolkit version (see <a href="https://pytorch.org/get-started/locally/">pytorch.org</a> for the current list); for example, <code>cu121</code>, <code>cu124</code>, or <code>cu128</code>:</p>

```bash
python -m pip install -e ".[gpu]" \
--extra-index-url https://download.pytorch.org/whl/cu128
```

<p>If you use <a href="https://docs.astral.sh/uv/">uv</a> instead of <code>pip</code>, add <code>--index-strategy unsafe-best-match</code> to work around a stale <code>setuptools</code> pin in the PyTorch wheel index:</p>

```bash
uv pip install -e ".[gpu]" \
--extra-index-url https://download.pytorch.org/whl/cu128 \
--index-strategy unsafe-best-match
```

<h4>c. Verify</h4>

```bash
python -c "import torch; print(torch.cuda.is_available())"
# expected: True
```

<h4>d. Enable</h4>

<p>Set the template flag:</p>

```cfg
mintpy.networkInversion.solver = torch
```

<p>or pass it on the command line:</p>

```bash
ifgram_inversion.py inputs/ifgramStack.h5 --solver torch
```

<p>See <a href="./gpu.md">gpu.md</a> for tuning, behavior notes, and benchmarks.</p>

</details>
</p>

## 3. Post-Installation Setup ##

#### a. ERA5 for tropospheric correction ####
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ readme = { file = ["docs/README.md"], content-type = "text/markdown" }
[tool.setuptools.dynamic.optional-dependencies.test]
file = ["tests/requirements.txt"]

[tool.setuptools.dynamic.optional-dependencies.gpu]
file = ["requirements-gpu.txt"]

[tool.setuptools.packages.find]
where = ["src"]

Expand Down
4 changes: 4 additions & 0 deletions requirements-gpu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Optional GPU acceleration deps for the `[gpu]` extras.
# Install with the PyTorch CUDA wheel index, e.g.:
# pip install -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
torch>=2.11
14 changes: 13 additions & 1 deletion src/mintpy/cli/ifgram_inversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,16 @@ def create_parser(subparsers=None):
solver.add_argument('--min-norm-phase', dest='minNormVelocity', action='store_false',
help=('Enable inversion with minimum-norm deformation phase,'
' instead of the default minimum-norm deformation velocity.'))
solver.add_argument('--solver', dest='solver', default='cpu',
choices={'cpu', 'torch'},
help='WLS solver: cpu (scipy.linalg.lstsq, default) '
'or torch (CUDA-batched normal-equation + Cholesky via '
'PyTorch). torch requires the [gpu] extras and a visible '
'CUDA device; absence is a hard error. '
'See docs/installation.md.')
solver.add_argument('--gpu-chunk-size', dest='gpuChunkSize', type=int, default=0,
help='pixels per GPU chunk for --solver=torch '
'(0=auto-size from free VRAM; default).')
#solver.add_argument('--norm', dest='residualNorm', default='L2', choices=['L1', 'L2'],
# help='Optimization method, L1 or L2 norm. (default: %(default)s).')

Expand Down Expand Up @@ -234,8 +244,10 @@ def read_template2inps(template_file, inps):
elif value:
if key in ['maskThreshold', 'minRedundancy']:
iDict[key] = float(value)
elif key in ['residualNorm', 'waterMaskFile']:
elif key in ['residualNorm', 'waterMaskFile', 'solver']:
iDict[key] = value
elif key in ['gpuChunkSize']:
iDict[key] = int(value)

# computing configurations
dask_key_prefix = 'mintpy.compute.'
Expand Down
9 changes: 9 additions & 0 deletions src/mintpy/defaults/smallbaselineApp.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,15 @@ mintpy.unwrapError.bridgePtsRadius = auto #[1-inf], auto for 50, half size of t
mintpy.networkInversion.weightFunc = auto #[var / fim / coh / no], auto for var
mintpy.networkInversion.waterMaskFile = auto #[filename / no], auto for waterMask.h5 or no [if not found]
mintpy.networkInversion.minNormVelocity = auto #[yes / no], auto for yes, min-norm deformation velocity / phase
## WLS solver for the per-pixel network inversion (GPU path is opt-in):
## a. cpu - scipy.linalg.lstsq, per-pixel (default, original behavior)
## b. torch - batched normal-equation + Cholesky on CUDA via PyTorch.
## Requires the [gpu] extras (CUDA-enabled torch build) and a
## visible CUDA device; absence is a hard error (no silent CPU
## fallback). The default 'auto' resolves to 'cpu', so existing
## setups are unaffected. See docs/installation.md.
mintpy.networkInversion.solver = auto #[cpu / torch], auto for cpu
mintpy.networkInversion.gpuChunkSize = auto #[int >= 0], auto for 0 (auto-size from free VRAM)

## mask options for unwrapPhase of each interferogram before inversion (recommend if weightFunct=no):
## a. coherence - mask out pixels with spatial coherence < maskThreshold
Expand Down
2 changes: 2 additions & 0 deletions src/mintpy/defaults/smallbaselineApp_auto.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ mintpy.unwrapError.bridgePtsRadius = 50
mintpy.networkInversion.weightFunc = var
mintpy.networkInversion.waterMaskFile = waterMask.h5
mintpy.networkInversion.minNormVelocity = yes
mintpy.networkInversion.solver = cpu
mintpy.networkInversion.gpuChunkSize = 0

## mask
mintpy.networkInversion.maskDataset = no
Expand Down
42 changes: 40 additions & 2 deletions src/mintpy/ifgram_inversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -595,7 +595,8 @@ def get_design_matrix4std(stack_obj):

def run_ifgram_inversion_patch(ifgram_file, box=None, ref_phase=None, obs_ds_name='unwrapPhase',
weight_func='var', water_mask_file=None, min_norm_velocity=True,
mask_ds_name=None, mask_threshold=0.4, min_redundancy=1.0, calc_cov=False):
mask_ds_name=None, mask_threshold=0.4, min_redundancy=1.0, calc_cov=False,
solver='cpu', gpu_chunk_size=0):
"""Invert one patch of an ifgram stack into timeseries.

Parameters: ifgram_file - str, interferograms stack HDF5 file, e.g. ./inputs/ifgramStack.h5
Expand All @@ -610,6 +611,15 @@ def run_ifgram_inversion_patch(ifgram_file, box=None, ref_phase=None, obs_ds_nam
mask_threshold - float, min coherence of pixels if mask_dataset_name='coherence'
min_redundancy - float, the min number of ifgrams for every acquisition.
calc_cov - bool, calculate the time series covariance matrix.
solver - str, WLS solver: 'cpu' (default,
scipy.linalg.lstsq per pixel) or 'torch'
(CUDA-batched normal-equation + Cholesky via
PyTorch). The 'torch' solver requires the
[gpu] extras and a visible CUDA device;
absence is a hard error (no silent CPU
fallback).
gpu_chunk_size - int, pixels per GPU chunk for solver='torch'.
0 (default) auto-sizes from free VRAM.
Returns: ts - 3D array in size of (num_date, num_row, num_col)
ts_cov - 4D array in size of (num_date, num_date, num_row, num_col) or None
inv_quality - 2D array in size of (num_row, num_col)
Expand Down Expand Up @@ -800,8 +810,34 @@ def run_ifgram_inversion_patch(ifgram_file, box=None, ref_phase=None, obs_ds_nam
'inv_quality_name' : inv_quality_name,
}

# 2.x GPU batched path: handles weighted and unweighted in one call.
# Per-pixel NaN observations are masked via zero-weights inside the kernel,
# which is mathematically equivalent to dropping them from the LS system
# for the full-rank case. Rank-deficient pixels (rare on real SBAS networks)
# are not handled here; if encountered, NaN/Inf will propagate downstream.
if solver != 'cpu':
from mintpy.ifgram_inversion_gpu import estimate_timeseries_batch
print(f'estimating time-series via {solver} solver (batched, GPU)')
ts_sub, q_sub, n_sub = estimate_timeseries_batch(
A=A, B=B,
y=stack_obs[:, idx_pixel2inv],
weight_sqrt=(weight_sqrt[:, idx_pixel2inv]
if weight_sqrt is not None else None),
tbase_diff=tbase_diff,
min_norm_velocity=min_norm_velocity,
rcond=1e-5,
min_redundancy=min_redundancy,
inv_quality_name=inv_quality_name,
chunk_size=gpu_chunk_size,
solver=solver,
)
ts[:, idx_pixel2inv] = ts_sub
inv_quality[idx_pixel2inv] = q_sub
num_inv_obs[idx_pixel2inv] = n_sub
del mask

# 2.2 un-weighted inversion (classic SBAS)
if weight_sqrt is None:
elif weight_sqrt is None:
msg = f'estimating time-series for pixels with valid {obs_ds_name} in'

# a. split mask into mask_all/part_net
Expand Down Expand Up @@ -1089,6 +1125,8 @@ def run_ifgram_inversion(inps):
"mask_threshold" : inps.maskThreshold,
"min_redundancy" : inps.minRedundancy,
"calc_cov" : inps.calcCov,
"solver" : getattr(inps, 'solver', 'cpu'),
"gpu_chunk_size" : int(getattr(inps, 'gpuChunkSize', 0)),
}

# 3.3 invert / write block-by-block
Expand Down
Loading