Apply server timeout globally, remove hardcoded datastore timeout#1397
Apply server timeout globally, remove hardcoded datastore timeout#1397the-glu wants to merge 1 commit intointeruss:masterfrom
Conversation
a9b96cc to
c005582
Compare
mickmis
left a comment
There was a problem hiding this comment.
Agreed that a unified timeout on HTTP endpoints do make sense. Now I'm not sure it implies that timeouts on DB calls become irrelevant (however the middle ones on handlers are).
I'm notably thinking here about two things:
- operations initiated not through an HTTP call (e.g. cleanup job in routine or CLI invokation)
- that we might want to return a '409' if we have a timeout on DB calls (and also when the max no. of retries have been reached):
'409':
content:
application/json:
schema:
$ref: '#/components/schemas/ErrorResponse'
description: >-
...
* Despite repeated attempts, the DSS was unable to complete the
update because of other simultaneous changes.So I'd propose here to keep and uniformize timeouts on DB (i.e. ones with DefaultTimeout) in addition to HTTP handlers timeouts.
(and we definitely need to factorize store logic)
And for 2., I don't believe the return of a 409 for a retryable error is implemented at all? If you agree, could open an issue to track this? This is actually quite important as clients would rely on that to implement a retry mechanism.
Yes correct, however I would just add theses to the CLI/cleanup job no? I don't think that should be the responsibility of the datastore layer enforce a timeout, especially since they are doing shorts call only in those cases.
I'm not sure the case for a generic timeout on the DB layer exist, as by default the HTTP one is enough, and there won't be any retries for normal cases. Also for retries (in the case of a timeout, not a deadlock), does a case exists where just canceling a query & retrying make sense? Why not just wait longer?
Should we do that, I suggest we move it to another PR no? It will need "major" changes (like a new setting, setting validation (HTTP timeout will need to be greater than DBs timeout). We can keep the cleanup here or not, but right now with default values it's never raised (10s for both, but the DB one will start later anyways), so I would go for cleanup and start with a clean state in the next PR.
It seems to be used, but not in case of database timeout, but thinks like version mismatch, etc. However, does 409 really mean retryable? It mean conflict, meaning it may not be possible anymore to do a specific query. Edit: Looking at the code, I think that the transaction timeout is never used to perform retries, and I even more don't see the need for a different timeout on that side. Especially that a timeout on the DB doesn't always mean that it's dues to a conflict / other changes, network may be slow/broken. The "global" timeout (http + cli|job) is including what is wanted with the DB timeout (DB stop responding) no? |
c005582 to
17f08ae
Compare
This PR enforce the value of the timeout flag for all HTTP calls, as it was applied inconsistently through the codebase, especially in SCD where the value was not used at all.
It use an HTTP handle, so it's applied on every call.
There was also another timeout in the code on transactions, hardcoded to 10s. I removed it as it's doesn't seems to make sense anymore with the global timeout (and was also not always applied).
This is potentially disruptive as some slow calls that passed before are now going to be canceled.
This remove a TODO in the code and potentially fix #1393 issue.