SCD_TYPE_2: option to preserve existing physical table rows on new snapshot identifier (is_first_insert truncation)

# Summary

  SCD_TYPE_2_BY_COLUMN models that accumulate revision history over time lose all historical data when a new snapshot identifier is created (due to a code change or indirect breaking change in a parent model), because the first interval run for the new identifier sets truncate=True via the is_first_insert path.

# Current behaviour

In evaluator.py:

```
  is_first_insert = (
      not _intervals(snapshot, deployability_index) or not target_table_exists
  ) and batch_index == 0
```

  When is_first_insert=True, truncate=True is passed to _scd_type_2(), which limits the existing-rows CTE to zero rows:

```
  if truncate:
      existing_rows_query = existing_rows_query.limit(0)
```

This ignores all rows already in the physical table and rebuilds entirely from the source query.

# The problem

For SCD_TYPE_2 models used as accumulating history tables — where the source is a rolling window and the SCD table is the only long-term record of how data changed over time — this truncation destroys irreplaceable historical rows (valid_to IS NOT NULL) that cannot be reconstructed from the current source.

The physical table may contain years of accumulated revision history. The source may intentionally only contain the last N months of data (e.g. a vendor API with a rolling window). Truncating and rebuilding from source produces a correct current state but permanently destroys the revision record.

This is not a bug per se — truncation on first insert is correct behaviour for models where the source is the full source of truth. But it is a gap for the common pattern where the SCD table is the source of truth and the input is a rolling window.

Notably, invalidate_hard_deletes=False (the default) already signals that the user does not want source-absence to affect existing rows. It is inconsistent that this option protects against row closure on normal runs but offers no protection against wholesale truncation on identifier changes.

# Proposed solution

Add a model-level option — e.g. forward_only_on_new_identifier or preserve_rows_on_first_insert — that, when set, skips the limit(0) branch when an existing physical table is present and already contains rows, even if the snapshot identifier is new.

Alternatively: when invalidate_hard_deletes=False, do not truncate on is_first_insert if the physical table already exists and is non-empty. The intent of invalidate_hard_deletes=False is "trust existing rows"; applying that intent to the first-insert case would be consistent and solve this problem without a new config option.

# Workaround

Manually insert interval records into sqlmesh._intervals for the new snapshot identifier before the first run, setting start_ts/end_ts to cover the full historical range. This causes is_first_insert=False and bypasses the truncation. This works but requires awareness of SQLMesh internals and is not a supported workflow.

# Version

SQLMesh 0.228.1
PostgreSQL 18.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCD_TYPE_2: option to preserve existing physical table rows on new snapshot identifier (is_first_insert truncation) #5744

Summary

Current behaviour

The problem

Proposed solution

Workaround

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SCD_TYPE_2: option to preserve existing physical table rows on new snapshot identifier (is_first_insert truncation) #5744

Description

Summary

Current behaviour

The problem

Proposed solution

Workaround

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions