fix(build-index): return 404 on backend unavailability to unblock docker pushes#631
fix(build-index): return 404 on backend unavailability to unblock docker pushes#631thijmv wants to merge 1 commit into
Conversation
| proxy_cache tags; | ||
| proxy_cache_methods GET; | ||
| proxy_cache_valid 200 5m; | ||
| proxy_cache_valid any 1s; |
There was a problem hiding this comment.
Do we understand why this cache was here in the first place? Do we envision any negative impact by removing it?
There was a problem hiding this comment.
It has been there since the beginning of Kraken, so I assume it was a base configuration, or used to limit thundering herd. Since the 200 requests are still cached, I do not envisage noticeable negative impact under normal load. However, it is something we should monitor. If we experience a lot of server errors, those requests will not be cached anymore, and would all go to the build-index.
There was a problem hiding this comment.
Removing this cache seems like a risk. Can we avoid removing it? I'd rather not fiddle with configs that we don't fully understand.
There was a problem hiding this comment.
The 404 is cached because of the additional HEAD. When a GET for the tag is then made, it incorrectly returns the 404, even though the tag now exists. This means the subsequent pull following the push fails during the integration tests due to cache poisoning.
While we could add a small delay between the two in the integration tests, any real production scenario where the pull follows the push within a second would still fail.
We still have the proxy_cache_lock in place to avoid having to process identical requests arriving at the same time.
| log.With("tag", tag).Debug("Tag not found in backend") | ||
| return core.Digest{}, ErrTagNotFound | ||
| } else { | ||
| // If we experience a backend error, propagating a 5xx would abort |
There was a problem hiding this comment.
Isn't the client responsible for retrying upon a server error? What is the issue if the push is aborted by the containerd snapshotter? Isn't it hacky to return 404 when in actuality the blob may be available, but we saw a backend error?
There was a problem hiding this comment.
The containerd snapshotter makes a HEAD request before trying to push an image. If that request fails with a server error, the push is aborted, and the user is unable to push said image.
We cannot return 500 to the snapshotter, as pushes will then fail when the backend is temporarily unavailable. I agree that this approach seems hacky, but I do not see a better alternative, given the snapshotter's behaviour.
There was a problem hiding this comment.
Why is it a problem if the snapshotter abandons the push upon a server error?
There was a problem hiding this comment.
Because when the backend is unavailable, we still expect the tag to be pushed to the build-index's disk, and flushed to the backend once it becomes available again. Without the snapshotter, this behaviour happened correctly. However, since the snapshotter precedes the tag push with an additional HEAD, continuing to return a server error when the backend is unavailable would mean no images or tags can be pushed if the backend is unavailable, which is the scenario we are actively trying to avoid.
For more context, see this integration test.
There was a problem hiding this comment.
hmm, makes sense. I still am not sure about the hacky approach. Is there a way to configure the containerd snapshotter to continue after HEAD failure or tweak it in a different way?
Alternatively, can we just avoid using the containerd snapshotter altogether and test OCI manifest parsing in a different way?
There was a problem hiding this comment.
The containerd snapshotter cannot be configured to ignore HEAD server errors. Even if it could, environments with the default containerd snapshotter enabled would still encounter said errors.
Regarding your second point, the snapshotter is the only way to exercise the full OCI push and pull flow within the integration tests.
Additionally, given that the containerd snapshotter and OCI manifest themselves are becoming the default, I think it is worth testing against it directly rather than working around it.
ac8edc6 to
a6edeae
Compare
76eeab8 to
343ca94
Compare
This PR (2/3) makes the following changes: