Skip to content

fix(fallback): treat 404/405/409/410 as retryable so a model-unavailable error falls through#173

Open
agpituk wants to merge 1 commit into
mainfrom
fix/fallback-404-410-retryable
Open

fix(fallback): treat 404/405/409/410 as retryable so a model-unavailable error falls through#173
agpituk wants to merge 1 commit into
mainfrom
fix/fallback-404-410-retryable

Conversation

@agpituk

@agpituk agpituk commented Jun 16, 2026

Copy link
Copy Markdown
Member

Problem

On a multi-attempt routing policy, the gateway stopped at the first attempt and returned 502 when the upstream responded with 404 (and 405/409/410), never trying the remaining entries. These codes typically mean the requested model or endpoint is unavailable at that specific provider — deprecated, renamed, retired, or never offered. Recovering from a retired/renamed model is one of the main reasons users configure fallback, so this defeated a core use case.

Root cause

_classify_upstream_error (src/gateway/api/routes/_platform.py) classified only {400, 422} as terminal and {401, 403, 408, 429, 5xx} as retryable; every other status fell through to a default return False (non-retryable). So 404/405/409/410 short-circuited the attempts loop. Note the prior inconsistency: 403 (no access to a model) fell through, but 404 (model not found) did not.

Fix

Add 404, 405, 409, 410 to _FALLBACK_RETRYABLE_STATUS_CODES. These mean "model/endpoint unavailable at this provider" → fall through to the next attempt. 400/422 stay terminal: a malformed request is rejected by every provider, so falling through would only waste attempts.

Tests

  • tests/unit/test_classify_upstream_error.py (new) — pins the classifier contract: which status codes fall through vs stop, plus timeout / network-error / no-status cases.
  • tests/integration/test_platform_mode_messages.py::test_platform_mode_falls_through_on_404_model_unavailable — primary returns 404, runner falls through, healthy secondary serves the response.

Targeted suites green (tests/unit/test_classify_upstream_error.py, tests/integration/test_platform_mode_messages.py, tests/integration/test_platform_mode_chat.py, tests/unit/test_run_platform_attempts.py).

🤖 Generated with Claude Code

Summary

This PR fixes a critical issue where the gateway would prematurely fail requests with a 502 error instead of trying alternative providers when dealing with unavailable models or endpoints.

What Changed

Code Changes:

  • Updated the fallback retry logic in the platform router to treat HTTP 404 (Not Found), 405 (Method Not Allowed), 409 (Conflict), and 410 (Gone) responses as temporary provider issues rather than permanent failures
  • Added comprehensive test coverage with both unit and integration tests to prevent regressions

Specific Updates:

  • Modified _FALLBACK_RETRYABLE_STATUS_CODES constant to include the four additional HTTP status codes
  • Added unit tests documenting which status codes should trigger fallback attempts vs. stop immediately
  • Added integration test verifying that when a primary provider returns 404, the gateway successfully tries the next provider in the fallback chain

Benefits

Users who configure multiple providers for redundancy will now experience:

  • Automatic failover when a model is deprecated, renamed, retired, or simply not offered by a specific provider
  • Proper utilization of backup providers instead of immediate failures
  • More reliable service availability in multi-provider setups

Technical Notes

The fix distinguishes between error categories:

  • Retryable codes (now 8 codes): Status codes indicating provider/model unavailability (401, 403, 404, 405, 408, 409, 410, 429, 500+) trigger fallback attempts
  • Terminal codes (2 codes): Status codes 400 and 422 remain non-retryable since a malformed request will fail at all providers

…e falls through

A multi-attempt routing policy stopped at the first attempt when the upstream
returned 404/405/409/410, returning 502 without trying the remaining entries.
These codes mean the requested model or endpoint is unavailable at THIS
provider (deprecated, renamed, retired, or never offered) — recovering from a
retired model is one of the main reasons users configure fallback, so they
should fall through to the next entry instead of failing the whole request.
400/422 stay terminal: a malformed request is rejected by every provider, so
falling through would only waste attempts.

Adds a unit test pinning the classifier contract (which codes fall through vs
stop, plus timeout/network/no-status cases) and an integration test covering
404-on-primary fall-through to a healthy secondary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@agpituk agpituk temporarily deployed to integration-tests June 16, 2026 10:53 — with GitHub Actions Inactive
@github-actions

Copy link
Copy Markdown

⚠️ PR Template Missing

This PR appears to be missing required sections from the PR template.

Please edit your PR description to include the PR Type, Checklist, and AI Usage sections. The template helps maintainers review your contribution.

This PR will be automatically closed in 24 hours if the template is not restored.

If you're using an AI coding tool, please ensure it preserves the PR template.

@github-actions github-actions Bot added the missing-template PR is missing required template sections label Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 39521fdf-aa43-4c44-bce9-1ed4ffd37c66

📥 Commits

Reviewing files that changed from the base of the PR and between c2d3eb2 and 99d18d4.

📒 Files selected for processing (3)
  • src/gateway/api/routes/_platform.py
  • tests/integration/test_platform_mode_messages.py
  • tests/unit/test_classify_upstream_error.py

Walkthrough

_FALLBACK_RETRYABLE_STATUS_CODES in the platform route is expanded to include HTTP 404, 405, 409, and 410. A new unit test file validates the full classification matrix for _classify_upstream_error, and a new integration test confirms the 404 fall-through to a second provider attempt.

Changes

Fallback retry policy expansion and test coverage

Layer / File(s) Summary
Expand _FALLBACK_RETRYABLE_STATUS_CODES constant
src/gateway/api/routes/_platform.py
Adds 404, 405, 409, and 410 to the fallback-retryable set with inline comments describing each as a provider/model unavailability signal that should fall through to the next attempt.
Unit tests for _classify_upstream_error
tests/unit/test_classify_upstream_error.py
New module with a _http_error helper and four parametrized tests covering: retryable HTTP codes, terminal HTTP codes (400/413/422/414/431), network/timeout errors, and unknown non-HTTP exceptions.
Integration test: 404 falls through to fallback
tests/integration/test_platform_mode_messages.py
Adds test_platform_mode_falls_through_on_404_model_unavailable — stubs two provider attempts, raises HTTP 404 on the first, and asserts the second attempt's response body and correlation ID are returned.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title uses the required 'fix:' prefix, follows imperative mood, and clearly summarizes the main change. However, it exceeds the ~70 character guideline at 92 characters.
Description check ✅ Passed The PR description comprehensively addresses all critical sections: it clearly explains the problem, root cause, and fix, includes relevant test coverage, and uses appropriate formatting.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fallback-404-410-retryable
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/fallback-404-410-retryable

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

missing-template PR is missing required template sections

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant