Skip to content

Add retry/resilience for libcurl HTTP backend#1987

Open
gileshall wants to merge 8 commits intosamtools:developfrom
gileshall:retry-resilience-libcurl
Open

Add retry/resilience for libcurl HTTP backend#1987
gileshall wants to merge 8 commits intosamtools:developfrom
gileshall:retry-resilience-libcurl

Conversation

@gileshall
Copy link

Summary

  • Add automatic retry with exponential backoff for transient HTTP errors (429, 500, 502, 503, 504) and network failures (connection drops, timeouts, partial transfers)
  • Add stall detection via CURLOPT_LOW_SPEED_LIMIT / CURLOPT_LOW_SPEED_TIME to abort and retry transfers that stop making progress
  • All retry behavior is configurable via environment variables (HTS_RETRY_MAX, HTS_RETRY_DELAY, HTS_RETRY_MAX_DELAY, HTS_LOW_SPEED_LIMIT, HTS_LOW_SPEED_TIME) with sensible defaults
  • Permanent errors (401, 403, 404) are never retried
  • Add comprehensive test suite with a Python mock HTTP server exercising all retry scenarios

Motivation

When streaming files over HTTP/HTTPS (including GCS gs:// URLs), htslib currently gives up immediately on any transient network failure. This is problematic for long-running genomics pipelines where occasional 503s, throttling (429), or brief network interruptions are expected. The existing restart_from_position() function already supports reconnecting at a byte offset but was only used for seeking, never for error recovery.

Implementation

Error classification (is_retryable()): Classifies CURLcodes and HTTP status codes as retryable or permanent.

Open retries: libcurl_open() retries the initial connection with exponential backoff when the first attempt fails with a retryable error.

Read retries: libcurl_read() handles mid-transfer failures by returning partial data with a needs_reconnect flag, then reconnecting on the next read call. Zero-data failures retry inline.

Stream position tracking: stream_pos field tracks the current byte offset for accurate reconnection via restart_from_position().

Configuration refresh: refresh_retry_config() re-reads environment variables at each libcurl_open() call, allowing per-request tuning.

Test plan

  • New test/test_hfile_libcurl passes all 7 test cases:
    • Normal transfer (baseline)
    • 503 retry succeeds
    • 429 retry succeeds
    • Connection drop mid-transfer retry
    • 404 not retried (permanent error)
    • Retry exhaustion (fails after max attempts)
    • Retry disabled (HTS_RETRY_MAX=0)
  • Existing make check tests still pass
  • Test gracefully skips when python3 or HTTP support unavailable

hfile_libcurl previously gave up immediately on any transient network
failure during HTTP transfers — socket close, throttle, 503, 429,
timeout. The existing restart_from_position() could reconnect at a byte
offset but was only used for seeking, never for error recovery.

Add retry with exponential backoff for both connection-time and
mid-transfer failures:

- is_retryable() classifies transient CURLcodes (COULDNT_CONNECT,
  SEND_ERROR, RECV_ERROR, PARTIAL_FILE, OPERATION_TIMEDOUT,
  GOT_NOTHING, SSL_CONNECT_ERROR) and HTTP statuses (429, 500, 502,
  503, 504). Permanent errors (401, 403, 404, etc.) fail immediately.

- Retry configuration via environment variables with sensible defaults:
  HTS_RETRY_MAX (3), HTS_RETRY_DELAY (500ms), HTS_RETRY_MAX_DELAY
  (60000ms), HTS_LOW_SPEED_LIMIT (1 byte/s), HTS_LOW_SPEED_TIME (60s).

- libcurl_open() retries the initial connection on transient failures.

- libcurl_read() returns partial data on mid-transfer errors and defers
  reconnection to the next call via a needs_reconnect flag. Zero-data
  errors retry inline with a goto.

- Stall detection via CURLOPT_LOW_SPEED_LIMIT/TIME causes libcurl to
  abort transfers below the throughput threshold, producing a retryable
  CURLE_OPERATION_TIMEDOUT.

- stream_pos tracks the current remote file position for reconnection.

- Set HTS_RETRY_MAX=0 to disable retries and preserve old behavior.

Also add a test suite (test/test_hfile_libcurl.c) that exercises retry
behavior against a Python mock HTTP server (test/mock_http_server.py)
with fault injection modes: normal transfer, 503/429 transient errors,
mid-transfer connection drops, permanent 404, retry exhaustion, and
retry-disabled.
Replace usleep() with hts_usleep() from hts_internal.h, which wraps
nanosleep() and is portable across the project. Fixes build failure
with -Werror on glibc where usleep() requires _DEFAULT_SOURCE.
Use hclose_abruptly() for error-path cleanups where the return value
is irrelevant, and check hclose() return on the normal path. Matches
the pattern used in test/hfile.c.
hfile.c sets errno to EPROTONOSUPPORT (not ENOTSUP) when a URL scheme
has no registered handler. The probe was checking the wrong errno,
causing all tests to fail on builds without libcurl instead of
gracefully skipping.
@whitwham
Copy link
Member

Thank you for this. I will take a proper look at it later.

I did notice that the copyright for the new test file is for Genome Research Ltd, it should probably be for yourself or your organisation.

The test uses fork/pipe/kill/waitpid to manage a Python mock HTTP
server, which are POSIX-only APIs. On Windows, main() prints a skip
message and returns 0.
@gileshall
Copy link
Author

Thank you for this. I will take a proper look at it later.

I did notice that the copyright for the new test file is for Genome Research Ltd, it should probably be for yourself or your organisation.

Thanks @whitwham. I have added a copyright notice for the Broad Institute.

About the windows build. The retry test (test_hfile_libcurl) uses fork/pipe/kill/waitpid to manage a Python mock HTTP server as a child process. These are POSIX-only APIs with no direct Windows equivalents. The Windows CI also doesn't have python3 in its MSYS2 environment, so even with a CreateProcess/TerminateProcess port the mock server couldn't run.

The test now compiles cleanly on Windows via an #ifdef _WIN32 guard that prints a skip message and returns 0 — same pattern as the runtime skips for missing python3 or missing libcurl support on other platforms. If python3 is added to the Windows CI in the future, this could be revisited with _popen/TerminateProcess equivalents.

bcf_sr_destroy1() silently ignored the return value of hts_close().
If a final read or flush error occurs during close, it was swallowed.
Add hts_log_error() so the failure is at least visible in logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants