Skip to content

SCD_TYPE_2: option to preserve existing physical table rows on new snapshot identifier (is_first_insert truncation) #5744

@rm-minus-r-star

Description

@rm-minus-r-star

Summary

SCD_TYPE_2_BY_COLUMN models that accumulate revision history over time lose all historical data when a new snapshot identifier is created (due to a code change or indirect breaking change in a parent model), because the first interval run for the new identifier sets truncate=True via the is_first_insert path.

Current behaviour

In evaluator.py:

  is_first_insert = (
      not _intervals(snapshot, deployability_index) or not target_table_exists
  ) and batch_index == 0

When is_first_insert=True, truncate=True is passed to _scd_type_2(), which limits the existing-rows CTE to zero rows:

  if truncate:
      existing_rows_query = existing_rows_query.limit(0)

This ignores all rows already in the physical table and rebuilds entirely from the source query.

The problem

For SCD_TYPE_2 models used as accumulating history tables — where the source is a rolling window and the SCD table is the only long-term record of how data changed over time — this truncation destroys irreplaceable historical rows (valid_to IS NOT NULL) that cannot be reconstructed from the current source.

The physical table may contain years of accumulated revision history. The source may intentionally only contain the last N months of data (e.g. a vendor API with a rolling window). Truncating and rebuilding from source produces a correct current state but permanently destroys the revision record.

This is not a bug per se — truncation on first insert is correct behaviour for models where the source is the full source of truth. But it is a gap for the common pattern where the SCD table is the source of truth and the input is a rolling window.

Notably, invalidate_hard_deletes=False (the default) already signals that the user does not want source-absence to affect existing rows. It is inconsistent that this option protects against row closure on normal runs but offers no protection against wholesale truncation on identifier changes.

Proposed solution

Add a model-level option — e.g. forward_only_on_new_identifier or preserve_rows_on_first_insert — that, when set, skips the limit(0) branch when an existing physical table is present and already contains rows, even if the snapshot identifier is new.

Alternatively: when invalidate_hard_deletes=False, do not truncate on is_first_insert if the physical table already exists and is non-empty. The intent of invalidate_hard_deletes=False is "trust existing rows"; applying that intent to the first-insert case would be consistent and solve this problem without a new config option.

Workaround

Manually insert interval records into sqlmesh._intervals for the new snapshot identifier before the first run, setting start_ts/end_ts to cover the full historical range. This causes is_first_insert=False and bypasses the truncation. This works but requires awareness of SQLMesh internals and is not a supported workflow.

Version

SQLMesh 0.228.1
PostgreSQL 18.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions