Skip to content

fix: delete TopologyServiceApplication links before TopologyService rows (fixes #5439)#6171

Open
chengyixu wants to merge 2 commits into
keephq:mainfrom
chengyixu:fix/topology-service-application-fk-deletion
Open

fix: delete TopologyServiceApplication links before TopologyService rows (fixes #5439)#6171
chengyixu wants to merge 2 commits into
keephq:mainfrom
chengyixu:fix/topology-service-application-fk-deletion

Conversation

@chengyixu
Copy link
Copy Markdown

@chengyixu chengyixu commented Mar 29, 2026

Summary

Fixes a ForeignKeyViolation that crashes process_topology whenever a provider's topology services belong to a TopologyApplication.

Root cause

process_topology refreshes topology data by deleting then re-inserting all TopologyService rows for a given provider. The deletion order was:

  1. Delete TopologyServiceDependency (correct — has CASCADE)
  2. Delete TopologyService (fails when a row is still referenced by TopologyServiceApplication.service_id)

The TopologyServiceApplication join table does not have ON DELETE CASCADE, so step 2 raised:

sqlalchemy.exc.IntegrityError: ForeignKeyViolation:
update or delete on table "topologyservice" violates
foreign key constraint "topologyserviceapplication_service_id_fkey"

This is the exact traceback reported in #5439.

Fix

Delete the TopologyServiceApplication rows for the affected services before deleting the TopologyService rows, using the same filter pattern already used for TopologyServiceDependency.

Changes

  • keep/api/tasks/process_topology_task.py — add TopologyServiceApplication deletion step
  • tests/test_topology.py — regression test that would have caught this (and confirms the fix)

Closes #5439

/claim #5439

When process_topology refreshes provider topology data it deletes all
existing TopologyService rows for that provider. If any of those services
were associated with a TopologyApplication (via the TopologyServiceApplication
join table), the delete violated the foreign-key constraint on
topologyserviceapplication.service_id and raised:

  sqlalchemy.exc.IntegrityError: ForeignKeyViolation on
  topologyserviceapplication — topologyservice.id still referenced

The fix deletes the TopologyServiceApplication join-table rows first,
in the same transaction as the dependency and service deletes, using the
same filter pattern already used for TopologyServiceDependency.

Closes keephq#5439

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 29, 2026

@chengyixu is attempting to deploy a commit to the KeepHQ Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 29, 2026
@dosubot dosubot Bot added the Bug Something isn't working label Mar 29, 2026
@chengyixu
Copy link
Copy Markdown
Author

Hi team! This fixes the ForeignKeyViolation crash in process_topology (#5439). All CI checks passing. Happy to address any feedback!

@chengyixu
Copy link
Copy Markdown
Author

Friendly ping — this fix addresses the ForeignKeyViolation crash in process_topology when deleting topology services (issue #5439). All CI checks pass. Happy to adjust anything if needed!

@chengyixu
Copy link
Copy Markdown
Author

Hi team — gentle ping. This PR fixes the ForeignKeyViolation crash in process_topology when deleting topology services (issue #5439). All CI checks passing. Happy to adjust anything if needed!

@chengyixu
Copy link
Copy Markdown
Author

Hi team — friendly ping on this fix for the ForeignKeyViolation crash in process_topology (issue #5439). Deletes TopologyServiceApplication links before TopologyService rows. All CI checks passing. Happy to address any feedback!

@chengyixu
Copy link
Copy Markdown
Author

Friendly ping @keephq/maintainers -- this PR has been open for 37 days. It fixes a foreign key constraint issue by deleting TopologyServiceApplication links before TopologyService rows (fixes #5439). Is there anything else needed before review? Thanks!

@chengyixu
Copy link
Copy Markdown
Author

👋 Friendly ping — this PR has been open for 37 days with all checks passing. Is there anything else needed for review? Thanks!

@chengyixu
Copy link
Copy Markdown
Author

Hi maintainers, just a friendly ping on this PR. It's been open for a while — would love a review when you have a moment. Happy to rebase or make any changes needed. Thanks!

@chengyixu
Copy link
Copy Markdown
Author

Hi! This PR is CI green and ready for merge when you have a moment. Thanks!

@ayoubil
Copy link
Copy Markdown

ayoubil commented May 5, 2026

Strong fix for the immediate symptom. One observation worth flagging for defense-in-depth, since this issue keeps coming up:

The root cause is that TopologyServiceApplication was defined in keep/api/models/db/topology.py without ondelete="CASCADE" on either FK:

class TopologyServiceApplication(SQLModel, table=True):
    service_id:     int  = Field(foreign_key="topologyservice.id",     primary_key=True)
    application_id: UUID = Field(foreign_key="topologyapplication.id", primary_key=True)

Compare with how alerttoincident was set up in migration 0832e0d9889a, where both columns get ondelete="CASCADE". That table never has this class of bug.

The app-layer ordering fix in this PR is correct and unblocks #5439, but any other code path that ever does session.delete(topology_service) or executes a bulk TopologyService delete query will reintroduce the same ForeignKeyViolation. A follow-up that adds ondelete="CASCADE" to both FKs (model + alembic migration to alter the existing constraint) would close the door on the entire bug class.

Happy to submit that schema-level follow-up if you'd like, once this lands — wanted to call it out here so it doesn't get lost.

@chengyixu
Copy link
Copy Markdown
Author

Thanks for the thorough review, @ayoubil! Great catch on the missing ondelete="CASCADE" -- that is the real root cause.

I agree the app-layer ordering fix in this PR is correct and unblocks #5439, but the schema-level fix would close the door permanently. I am happy to submit the CASCADE follow-up PR right after this lands:

  1. Add ondelete="CASCADE" to both FKs in TopologyServiceApplication
  2. Add a matching alembic migration to alter the existing constraints

Would you prefer I add the CASCADE changes to this PR now, or submit them as a separate follow-up once this one merges? Either way works for me.

Thanks again for the detailed review!

Adds CASCADE to both FK columns so that deleting a TopologyService or
TopologyApplication automatically cleans up the join-table rows.
Includes an alembic migration to alter the existing constraints.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@chengyixu
Copy link
Copy Markdown
Author

Pushed the CASCADE follow-up as discussed @ayoubil:

  1. Added ondelete="CASCADE" to both FKs in TopologyServiceApplication (model)
  2. Added alembic migration to alter the existing FK constraints with ON DELETE CASCADE

This closes the door on the entire FK violation bug class. The app-layer ordering fix is still in place as added safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🙋 Bounty claim Bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[🐛 Bug]: argocd topology pull error

2 participants