Add retry/resilience for libcurl HTTP backend#1987
Add retry/resilience for libcurl HTTP backend#1987gileshall wants to merge 8 commits intosamtools:developfrom
Conversation
hfile_libcurl previously gave up immediately on any transient network failure during HTTP transfers — socket close, throttle, 503, 429, timeout. The existing restart_from_position() could reconnect at a byte offset but was only used for seeking, never for error recovery. Add retry with exponential backoff for both connection-time and mid-transfer failures: - is_retryable() classifies transient CURLcodes (COULDNT_CONNECT, SEND_ERROR, RECV_ERROR, PARTIAL_FILE, OPERATION_TIMEDOUT, GOT_NOTHING, SSL_CONNECT_ERROR) and HTTP statuses (429, 500, 502, 503, 504). Permanent errors (401, 403, 404, etc.) fail immediately. - Retry configuration via environment variables with sensible defaults: HTS_RETRY_MAX (3), HTS_RETRY_DELAY (500ms), HTS_RETRY_MAX_DELAY (60000ms), HTS_LOW_SPEED_LIMIT (1 byte/s), HTS_LOW_SPEED_TIME (60s). - libcurl_open() retries the initial connection on transient failures. - libcurl_read() returns partial data on mid-transfer errors and defers reconnection to the next call via a needs_reconnect flag. Zero-data errors retry inline with a goto. - Stall detection via CURLOPT_LOW_SPEED_LIMIT/TIME causes libcurl to abort transfers below the throughput threshold, producing a retryable CURLE_OPERATION_TIMEDOUT. - stream_pos tracks the current remote file position for reconnection. - Set HTS_RETRY_MAX=0 to disable retries and preserve old behavior. Also add a test suite (test/test_hfile_libcurl.c) that exercises retry behavior against a Python mock HTTP server (test/mock_http_server.py) with fault injection modes: normal transfer, 503/429 transient errors, mid-transfer connection drops, permanent 404, retry exhaustion, and retry-disabled.
Replace usleep() with hts_usleep() from hts_internal.h, which wraps nanosleep() and is portable across the project. Fixes build failure with -Werror on glibc where usleep() requires _DEFAULT_SOURCE.
Use hclose_abruptly() for error-path cleanups where the return value is irrelevant, and check hclose() return on the normal path. Matches the pattern used in test/hfile.c.
hfile.c sets errno to EPROTONOSUPPORT (not ENOTSUP) when a URL scheme has no registered handler. The probe was checking the wrong errno, causing all tests to fail on builds without libcurl instead of gracefully skipping.
|
Thank you for this. I will take a proper look at it later. I did notice that the copyright for the new test file is for Genome Research Ltd, it should probably be for yourself or your organisation. |
The test uses fork/pipe/kill/waitpid to manage a Python mock HTTP server, which are POSIX-only APIs. On Windows, main() prints a skip message and returns 0.
Thanks @whitwham. I have added a copyright notice for the Broad Institute. About the windows build. The retry test ( The test now compiles cleanly on Windows via an #ifdef _WIN32 guard that prints a skip message and returns 0 — same pattern as the runtime skips for missing python3 or missing libcurl support on other platforms. If python3 is added to the Windows CI in the future, this could be revisited with _popen/TerminateProcess equivalents. |
bcf_sr_destroy1() silently ignored the return value of hts_close(). If a final read or flush error occurs during close, it was swallowed. Add hts_log_error() so the failure is at least visible in logs.
Summary
CURLOPT_LOW_SPEED_LIMIT/CURLOPT_LOW_SPEED_TIMEto abort and retry transfers that stop making progressHTS_RETRY_MAX,HTS_RETRY_DELAY,HTS_RETRY_MAX_DELAY,HTS_LOW_SPEED_LIMIT,HTS_LOW_SPEED_TIME) with sensible defaultsMotivation
When streaming files over HTTP/HTTPS (including GCS
gs://URLs), htslib currently gives up immediately on any transient network failure. This is problematic for long-running genomics pipelines where occasional 503s, throttling (429), or brief network interruptions are expected. The existingrestart_from_position()function already supports reconnecting at a byte offset but was only used for seeking, never for error recovery.Implementation
Error classification (
is_retryable()): Classifies CURLcodes and HTTP status codes as retryable or permanent.Open retries:
libcurl_open()retries the initial connection with exponential backoff when the first attempt fails with a retryable error.Read retries:
libcurl_read()handles mid-transfer failures by returning partial data with aneeds_reconnectflag, then reconnecting on the next read call. Zero-data failures retry inline.Stream position tracking:
stream_posfield tracks the current byte offset for accurate reconnection viarestart_from_position().Configuration refresh:
refresh_retry_config()re-reads environment variables at eachlibcurl_open()call, allowing per-request tuning.Test plan
test/test_hfile_libcurlpasses all 7 test cases:HTS_RETRY_MAX=0)make checktests still pass