Skip to content

[BUG] Progressive slowdown of SOMD2 calculations #124

@akalpokas

Description

@akalpokas

I've been running some long repex (25-50ns) FEP calculations and noticed that while they start off fast, they tend slowdown as the calculation progresses. For example, looking at a log file from one of the simulations involving running a molecule in vacuum leg:

2026-02-24 16:15:47.882 | INFO     | somd2.runner._repex:run:959 - Running dynamics for cycle 1 of 12500
2026-02-24 16:15:49.811 | INFO     | somd2.runner._repex:run:959 - Running dynamics for cycle 2 of 12500

We can see that it took approx 1.93s to complete this repex cycle. When the full calculation (25ns) is finishing up, we can see:

2026-02-25 03:22:08.605 | INFO     | somd2.runner._repex:run:959 - Running dynamics for cycle 12499 of 12500
2026-02-25 03:22:14.038 | INFO     | somd2.runner._repex:run:959 - Running dynamics for cycle 12500 of 12500

Here it took 5.43s to complete this cycle, almost 3x longer than the initial cycle. To investigate this further, I generated some code (provided below) to plot the the timings between each consecutive cycles, yielding the following plot:

Image

This essentially illustrates the observation above; as the SOMD2 calculation progresses, the timings between each cycle are increasing in a linear/O(n) fashion. We can additionally see the following:

  1. The orange line representing the mixing times (obtained from the log file too) stays consistent throughout the simulation - mixing times are constant and independent of the number of cycles carried out.
  2. There are blue dot outliers that are significantly higher than the average blue line - I believe these most likely represent checkpoint blocks that occur at the end of N number of cycles (in this simulation there are 250 repex cycles to each checkpoint cycle) - This tells us that the progressive slow down is not caused by I/O operations that happen much less frequently during the dynamics.

This progressive slowdown is also a behaviour that's not always been part of SOMD2, but was introduced as part of some kind of change in the last year or so. For example taking a log file from a simulation I ran in March 2025:

Image

We can see there isn't any increase in consecutive cycle timings here.

In other calculations, the increase isn't as dramatic, but is still very much there:

Image

These calculations follow a fairly standard production FEP setup with approx 1-2fs exchange frequency, frame frequency of around 100ps and checkpoint frequency of 500ps.

Reproducibility

Unfortunately, I can't easily provide the affected SOMD2 versions as these calculations were run from my own fork with additional commits by me, meaning that logged git hashes are not comparable to the OBS git repo. I can see that recent SOMD2 versions are affected by this, but the ones from last year are not. I thought about using git bisect to identify the commits that might have introduced this, but the standalone SOMD2 pixi build doesn't work for me, because of some kind of OpenCL driver package issue. I'm happy to provide further information if needed.

Plotting code:

import re
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt

def plot_cumulative_split_timings(log_file_path):
    cycle_pattern = re.compile(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s*\|.*?Running dynamics for cycle (\d+)")
    matrix_pattern = re.compile(r"^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s*\|.*?Assembling energy matrix")

    data = []
    
    # Trackers for cumulative cycle counting
    cycle_offset = 0
    last_logged_cycle = 0
    
    current_actual_cycle = None
    cycle_start_time = None
    matrix_time = None

    with open(log_file_path, "r") as f:
        for line in f:
            # 1. Look for the start of a cycle
            cycle_match = cycle_pattern.search(line)
            if cycle_match:
                dt_str, cycle_str = cycle_match.groups()
                dt = datetime.strptime(dt_str, "%Y-%m-%d %H:%M:%S.%f")
                logged_cycle = int(cycle_str)

                # Detect if a simulation extension restarted the counter
                if logged_cycle < last_logged_cycle:
                    cycle_offset += last_logged_cycle
                
                actual_cycle = logged_cycle + cycle_offset
                last_logged_cycle = logged_cycle

                # Close out the previous cycle if data exists
                if current_actual_cycle is not None and cycle_start_time and matrix_time:
                    mixing_duration = (dt - matrix_time).total_seconds()
                    if mixing_duration > 0:
                        data[-1]["mixing_duration"] = mixing_duration

                current_actual_cycle = actual_cycle
                cycle_start_time = dt
                matrix_time = None

                data.append({
                    "actual_cycle": current_actual_cycle,
                    "dynamics_duration": None,
                    "mixing_duration": None
                })

            # 2. Look for the start of the mixing phase
            matrix_match = matrix_pattern.search(line)
            if matrix_match and current_actual_cycle is not None:
                dt_str = matrix_match.group(1)
                dt = datetime.strptime(dt_str, "%Y-%m-%d %H:%M:%S.%f")
                matrix_time = dt
                
                dynamics_duration = (dt - cycle_start_time).total_seconds()
                data[-1]["dynamics_duration"] = dynamics_duration

    # Convert to DataFrame
    df = pd.DataFrame(data).dropna()

    # Filter out initial startup spikes (adjust threshold if necessary)
    df = df[(df["dynamics_duration"] < 20.0) & (df["mixing_duration"] < 20.0)].copy()

    # Calculate rolling averages
    window_size = 50
    df["dyn_rolling"] = df["dynamics_duration"].rolling(window=window_size).mean()
    df["mix_rolling"] = df["mixing_duration"].rolling(window=window_size).mean()

    # Plotting
    plt.figure(figsize=(12, 6))
    
    plt.scatter(df["actual_cycle"], df["dynamics_duration"], alpha=0.3, s=10, color='blue', label="Dynamics (Raw)")
    plt.plot(df["actual_cycle"], df["dyn_rolling"], color="darkblue", linewidth=2, label="Dynamics (50-avg)")

    plt.scatter(df["actual_cycle"], df["mixing_duration"], alpha=0.3, s=10, color='orange', label="Mixing (Raw)")
    plt.plot(df["actual_cycle"], df["mix_rolling"], color="darkorange", linewidth=2, label="Mixing (50-avg)")

    plt.xlabel("Cumulative Cycle Number")
    plt.ylabel("Duration (seconds)")
    plt.legend()
    plt.grid(True, linestyle="--", alpha=0.6)
    plt.tight_layout()
    plt.show()

# Run it on your log file
plot_cumulative_split_timings("log.txt")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions