Skip to content

Non-reproducibility when using icon + oasis and compiled with intel.psmpi #20

@AGonzalezNicolas

Description

@AGonzalezNicolas

@DCaviedesV, @s-poll, @kvrigor, @jjokella

Testcase is EXP_ID="eur-12-iic" #"eur-11u"
working_dir:

/p/scratch/cslts/gonzalez5/TSMP2/JUBE_repo/benchmark/jube

While running this test case, I observed lack of reproducibility between identical runs when using a binary built with intel.psmpi. Two runs performed using JUBE with the same configuration with model_id = icon-eclm produce different results. The same behaviour is observed for the model_id = icon-eclm-parflow, where repeated runs with identical settings (using JUBE) also produce different outputs. do not produce identical results.

Below, I present the maximum absolute error for each variables computed by comparing the outputs of two identical runs for the model_id=icon-eclm-parflow and model_id=icon-eclm:

variable model_id=icon-eclm-parflow max. abs. error model_id=icon-eclm max. abs. error
ICON
w error = 1.0482156277 1.1639280319
theta_v error = 5.6016540527 5.4000854492
qv error = 0.0038429732 0.0035402337
shfl_s error = 47.2150611877 131.6409301758
lhfl_s error = 60.2069587708 414.4877624512
eCLM
TWS error = 9.8613281250 10.4727783203
H2OSOI error = 0.0715191215 0.1653197557
TSOI error = 3.6534118652 5.7122192383
TG error = 3.6873474121 6.1456604004
EFLX_LH_TOT error = 192.7704467773 197.8242187500
FSH error = 148.3424072266 218.8023071289
FSA error = 271.3027954102 328.3160400391
FSR error = 131.1349182129 141.3157348633
FIRA error = 51.1596603394 75.3006668091
Rnet error = 220.1431274414 254.9673156738
EFLX_SOIL_GRND error = 108.2931365967 113.7866668701
ParFlow
pressure error = 3460.3571839130 -
saturation error = 0.2303589024 -
evaptrans error = 0.5114949301 -

We need some insight if the magnitude of these errors is acceptable and expected for this type of scenario.
Icon units: w=[m/s], theta_v=[K], qv = [kg/kg], shfl/lhfl= [W/m2]
eCLM variables units here

NOTE: Tests were carried out on JURECA-DC with Stages/2025 and binaries were built from TSMP2 repo (master branch). e.g.:

build_tsmp2.sh --icon --eclm --env ./env/jsc.2025.gnu.openmpi
Setting model-id and component string...
Setting component source dir...
Submodule path 'models/icon': checked out 'ce5c8f8ba75d2e7db73e41cbb186d98ec34171c8'
Submodule path 'models/eCLM': checked out '4d567d2d68cac0fba977914b4a9c3ba199afd0ff'
Submodule path 'models/oasis3-mct': checked out '5253349d4ce15259fcd76e0443495c1ddb7788bb'
  • Binaries_directory:
/p/project1/cslts/shared_data/CI_TSMP2/bin/master
  • Working_dir:
/p/scratch/cslts/gonzalez5/TSMP2/JUBE_repo/benchmark/jube

NOTE1: For the model_id = eclm-parflow, or only one component (eclm, icon, or parflow); this does not occur.

NOTE2: In contrast, this non-reproducibility is not observed when using executables built with gnu.openmpi or gnu.psmpi. For these builds, repeated runs with identical configurations are reproducible for both model_id's.

NOTE3: Try another test case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions