Skip to content

fix(build-index): return 404 on backend unavailability to unblock docker pushes#631

Open
thijmv wants to merge 1 commit into
thijmv/oci-manifest-supportfrom
thijmv/oci-manifest-support_2
Open

fix(build-index): return 404 on backend unavailability to unblock docker pushes#631
thijmv wants to merge 1 commit into
thijmv/oci-manifest-supportfrom
thijmv/oci-manifest-support_2

Conversation

@thijmv
Copy link
Copy Markdown
Collaborator

@thijmv thijmv commented Jun 1, 2026

This PR (2/3) makes the following changes:

  • Tag store: Return 404 instead of 500 when the backend returns an error during tag resolution. This prevents a Docker push flow that issues a HEAD before PUT, e.g. the containerd snapshotter, from being aborted by a server error. On the subsequent PUT, the tag is written to disk and flushed to the backend once it becomes available.
  • Build-index nginx: Drop the 1s cache on tag lookups, which was causing poisoned tag resolutions for subsequent pulls after the previous HEAD returned 404.

proxy_cache tags;
proxy_cache_methods GET;
proxy_cache_valid 200 5m;
proxy_cache_valid any 1s;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand why this cache was here in the first place? Do we envision any negative impact by removing it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been there since the beginning of Kraken, so I assume it was a base configuration, or used to limit thundering herd. Since the 200 requests are still cached, I do not envisage noticeable negative impact under normal load. However, it is something we should monitor. If we experience a lot of server errors, those requests will not be cached anymore, and would all go to the build-index.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this cache seems like a risk. Can we avoid removing it? I'd rather not fiddle with configs that we don't fully understand.

Copy link
Copy Markdown
Collaborator Author

@thijmv thijmv Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 404 is cached because of the additional HEAD. When a GET for the tag is then made, it incorrectly returns the 404, even though the tag now exists. This means the subsequent pull following the push fails during the integration tests due to cache poisoning.

While we could add a small delay between the two in the integration tests, any real production scenario where the pull follows the push within a second would still fail.

We still have the proxy_cache_lock in place to avoid having to process identical requests arriving at the same time.

log.With("tag", tag).Debug("Tag not found in backend")
return core.Digest{}, ErrTagNotFound
} else {
// If we experience a backend error, propagating a 5xx would abort
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the client responsible for retrying upon a server error? What is the issue if the push is aborted by the containerd snapshotter? Isn't it hacky to return 404 when in actuality the blob may be available, but we saw a backend error?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containerd snapshotter makes a HEAD request before trying to push an image. If that request fails with a server error, the push is aborted, and the user is unable to push said image.

We cannot return 500 to the snapshotter, as pushes will then fail when the backend is temporarily unavailable. I agree that this approach seems hacky, but I do not see a better alternative, given the snapshotter's behaviour.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a problem if the snapshotter abandons the push upon a server error?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when the backend is unavailable, we still expect the tag to be pushed to the build-index's disk, and flushed to the backend once it becomes available again. Without the snapshotter, this behaviour happened correctly. However, since the snapshotter precedes the tag push with an additional HEAD, continuing to return a server error when the backend is unavailable would mean no images or tags can be pushed if the backend is unavailable, which is the scenario we are actively trying to avoid.

For more context, see this integration test.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, makes sense. I still am not sure about the hacky approach. Is there a way to configure the containerd snapshotter to continue after HEAD failure or tweak it in a different way?

Alternatively, can we just avoid using the containerd snapshotter altogether and test OCI manifest parsing in a different way?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containerd snapshotter cannot be configured to ignore HEAD server errors. Even if it could, environments with the default containerd snapshotter enabled would still encounter said errors.

Regarding your second point, the snapshotter is the only way to exercise the full OCI push and pull flow within the integration tests.

Additionally, given that the containerd snapshotter and OCI manifest themselves are becoming the default, I think it is worth testing against it directly rather than working around it.

@thijmv thijmv force-pushed the thijmv/oci-manifest-support branch from ac8edc6 to a6edeae Compare June 2, 2026 08:39
@thijmv thijmv force-pushed the thijmv/oci-manifest-support_2 branch from 76eeab8 to 343ca94 Compare June 2, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants