Skip to content

Cosmos: move periodic account-metadata refresh loop to runtime#4458

Open
tvaron3 wants to merge 7 commits into
Azure:mainfrom
tvaron3:cosmos/move-account-refresh-to-runtime
Open

Cosmos: move periodic account-metadata refresh loop to runtime#4458
tvaron3 wants to merge 7 commits into
Azure:mainfrom
tvaron3:cosmos/move-account-refresh-to-runtime

Conversation

@tvaron3

@tvaron3 tvaron3 commented May 22, 2026

Copy link
Copy Markdown
Member

Per Fabian's review on #4407: the periodic background account-metadata refresh loop now lives on CosmosDriverRuntime (one tokio task per runtime that iterates registered account endpoints on each tick) instead of on LocationStateStore (one task per CosmosDriver). A runtime hosting N drivers no longer fans out into N concurrent refresh tasks firing on the same cadence — it issues one fetch per registered account per tick.

The event-driven 5-second-rate-limited refresh triggered by LocationEffect::RefreshAccountProperties stays on LocationStateStore as refresh_account_properties_if_due. That path solves a different problem (burst throttling during real incidents — 403/3 WriteForbidden, 403/1008 DatabaseAccountNotFound, etc.) than the timer solves (idle clients not refreshing), and they have different cadences.

What changed

  • New sdk/cosmos/azure_data_cosmos_driver/src/driver/account_refresh.rs: per-endpoint refresh body (snapshot previous → fetch outside lock → atomic get_or_refresh_with → warn-on-failure → on_success callback), AccountRefreshFn / OnSuccessFn type aliases, AccountRefreshEntry with per-registration id token, AccountRefreshRegistry trait, and AccountRefreshRegistration RAII guard whose Drop deregisters by (endpoint, id) so re-registration races during driver recreation are safe.
  • CosmosDriverRuntime gains: an account-refresh registry (RwLock<HashMap<AccountEndpoint, Entry>>), an AtomicU64 id allocator, a Once-guarded lazy loop start on first registration, a BackgroundTaskManager that aborts the loop on drop, a pub(crate) fn register_for_account_refresh(...), an AccountRefreshRegistry impl, and a run_account_refresh_loop free fn that snapshots under a read lock, drops the strong ref, then iterates per-endpoint sequentially.
  • CosmosDriver holds an Option<AccountRefreshRegistration> for the driver's lifetime; the Drop deregisters. on_success_for_lss(weak_lss) upgrades a Weak<LocationStateStore> and calls sync_account_properties on each successful tick, and no-ops once the driver is dropped.
  • LocationStateStore loses start_account_refresh_loop, the in-module account_refresh_loop, force_refresh_account_properties, and the BACKGROUND_REFRESH_INTERVAL constant (moved to account_refresh.rs). It keeps refresh_account_properties_if_due (event-driven, 5 s rate-limited), the refresh_account_properties_inner shared body, and the failback-loop infrastructure.

Tests

  • 5 new unit tests in driver/account_refresh.rs covering: registration Drop deregisters with captured id+endpoint, Drop after runtime drop is a no-op, next_registration_id is monotonic, refresh_one_account success invokes on_success with fresh props, and refresh_one_account failure skips on_success and retains the cached value.
  • 2 new integration tests in tests/in_memory_emulator_tests/account_metadata_refresh.rs:
    • single_loop_serves_two_drivers_in_same_runtime — two drivers registered, a single Once-guarded loop spawned, both endpoints refresh independently each tick.
    • dropping_one_driver_stops_refresh_for_that_account_only — documents the registry-stays-populated reality given the existing Arc cycle between CosmosDriver and CosmosDriverRuntime; asserts the inverse property.
  • 1 ignored test runtime_drop_aborts_refresh_task — documents the Arc cycle limitation; preserved for re-enabling if/when the cycle is broken.
  • Existing background_loop_refetches_database_account_with_no_traffic and back_to_back_operations_do_not_trigger_per_request_account_reads pass verbatim.

Validation

  • cargo fmt -p azure_data_cosmos_driver -- --check
  • cargo clippy -p azure_data_cosmos_driver --all-features --all-targets ✅ (0 warnings)
  • cargo test -p azure_data_cosmos_driver --lib --all-features ✅ (1520 passed)
  • cargo test -p azure_data_cosmos_driver --features __internal_in_memory_emulator --test in_memory_emulator account_metadata ✅ (4 passed, 1 documented-ignored)
  • cargo build -p azure_data_cosmos --all-features
  • cargo doc -p azure_data_cosmos_driver --no-deps --all-features

tvaron3 and others added 3 commits May 20, 2026 01:40
Per Fabian's review comment on PR Azure#4407: the periodic background refresh
loop now lives on CosmosDriverRuntime (one tokio task per runtime that
iterates registered account endpoints on each tick) instead of on
LocationStateStore (one task per CosmosDriver). A runtime hosting N
drivers no longer fans out into N concurrent refresh tasks all firing on
the same cadence — it issues one fetch per registered account per tick.

The event-driven 5-second-rate-limited refresh triggered by
LocationEffect::RefreshAccountProperties stays on LocationStateStore as
refresh_account_properties_if_due. That path solves a different problem
(burst throttling during real incidents — 403/3 WriteForbidden,
403/1008 DatabaseAccountNotFound, etc.) than the timer solves (idle
clients not refreshing) and they have different cadences.

New file driver/account_refresh.rs holds the per-endpoint refresh body
(snapshot previous -> fetch outside lock -> atomic get_or_refresh_with
-> warn-on-failure -> on_success callback) extracted from
LocationStateStore::refresh_account_properties_inner, the
AccountRefreshFn / OnSuccessFn type aliases, the AccountRefreshEntry
struct with per-registration id token, the AccountRefreshRegistry
trait, and the AccountRefreshRegistration RAII guard with Drop impl
that deregisters by (endpoint, id) so re-registration races during
driver recreation are safe.

CosmosDriverRuntime gains:
* account_refresh_registry: RwLock<HashMap<AccountEndpoint, Entry>>
* account_refresh_next_id: AtomicU64
* account_refresh_loop_started: Once (lazy start on first registration)
* account_refresh_task_manager: BackgroundTaskManager (Drop -> abort)
* pub(crate) fn register_for_account_refresh(...) -> Registration
* impl AccountRefreshRegistry::deregister(endpoint, id)
* run_account_refresh_loop free fn (snapshot under read lock, drop
  the strong ref before iterating, sequential per-endpoint refresh)
* test-only pub fn account_refresh_registry_len() and
  account_refresh_loop_started() accessors

CosmosDriver gains:
* account_refresh_guard: Option<AccountRefreshRegistration> field
  (held for the driver's lifetime; Drop deregisters)
* on_success_for_lss(weak_lss) helper - builds the OnSuccessFn that
  upgrades Weak<LocationStateStore> and calls sync_account_properties
  on each successful tick, no-ops when the driver is dropped

LocationStateStore loses:
* start_account_refresh_loop, account_refresh_loop free fn,
  force_refresh_account_properties (no longer called - replaced by
  the runtime loop's call to refresh_one_account)
* BACKGROUND_REFRESH_INTERVAL constant (moved to account_refresh.rs)
* The local AccountRefreshFn type alias (re-exported from
  account_refresh as a type alias for backward compatibility)

LocationStateStore keeps:
* refresh_account_properties_if_due (event-driven, 5s rate-limited)
* refresh_account_properties_inner shared body (used only by the
  event-driven path now; the runtime loop has its own equivalent in
  account_refresh::refresh_one_account)
* failback loop infrastructure (unchanged)

Tests:
* 5 new unit tests in driver/account_refresh.rs covering: registration
  Drop deregisters with captured id+endpoint; Drop after runtime drop
  is a no-op; next_registration_id monotonic; refresh_one_account
  success invokes on_success with fresh props; refresh_one_account
  failure skips on_success and retains the cached value.
* 2 new integration tests in
  tests/in_memory_emulator_tests/account_metadata_refresh.rs:
  - single_loop_serves_two_drivers_in_same_runtime: two drivers
    registered, single Once-guarded loop spawned, both endpoints
    refresh independently each tick. Verifies the core
    consolidation invariant.
  - dropping_one_driver_stops_refresh_for_that_account_only:
    documents the registry-stays-populated reality given the
    existing Arc cycle between CosmosDriver and CosmosDriverRuntime;
    asserts the inverse property (both drivers' refreshes continue
    as long as the registry holds them).
* 1 ignored test runtime_drop_aborts_refresh_task: documents the
  Arc cycle limitation that prevents observable runtime drop from
  this test layer; preserves the test for re-enabling if/when the
  cycle is broken (e.g., by switching driver_registry to Weak).
* Existing background_loop_refetches_database_account_with_no_traffic
  and back_to_back_operations_do_not_trigger_per_request_account_reads
  pass verbatim - from the RequestObserver perspective the behavior
  is unchanged (still >=N GET / calls in N tick windows, still 0 per-
  operation hits).

CHANGELOG: additive entry under 0.3.0 Unreleased Bugs Fixed
documenting the move; no public API change.

Validation:
* cargo fmt -p azure_data_cosmos_driver -- --check PASS
* cargo clippy -p azure_data_cosmos_driver --all-features --all-targets PASS (0 warnings)
* cargo test -p azure_data_cosmos_driver --lib --all-features PASS (1520 passed)
* cargo test -p azure_data_cosmos_driver --features __internal_in_memory_emulator
  --test in_memory_emulator account_metadata PASS (4 passed, 1 documented-ignored)
* cargo build -p azure_data_cosmos --all-features PASS
* cargo doc -p azure_data_cosmos_driver --no-deps --all-features PASS

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 2 commits May 21, 2026 17:52
…fresh-to-runtime

# Conflicts:
#	sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs
- Drop the AccountRefreshFn type alias re-export in LocationStateStore;
  import the canonical type from driver::account_refresh directly.
- Trim the BACKGROUND_REFRESH_INTERVAL doc comment.
- Remove the Option wrapper on CosmosDriver::account_refresh_guard.
  CosmosDriver::new is the only construction path and always registers,
  so the Option was misleading about the actual invariant.
- Switch next_registration_id from Ordering::Relaxed to Ordering::SeqCst.
  The RwLock on the registry already provides the necessary barriers, but
  SeqCst makes the cross-thread coordination explicit at zero runtime cost
  (called once per driver construction).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3

tvaron3 commented May 22, 2026

Copy link
Copy Markdown
Member Author

/azp run rust - cosmos - weekly

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Two new integration tests in tests/in_memory_emulator_tests/account_metadata_refresh.rs:

* multiple_clients_same_account_same_runtime_share_one_driver — asserts
  that two get_or_create_driver calls for the same account return the
  same Arc<CosmosDriver> (so every client built on top shares the same
  LocationStateStore, PartitionKeyRangeCache, SessionManager, transport
  and refresh registration), verifies the runtime's account-refresh
  registry stays at size 1, and confirms the background loop still emits
  only one GET / per refresh interval.
* same_account_different_runtimes_do_not_share_driver — asserts that
  two separate CosmosDriverRuntime instances built for the same account
  produce two distinct Arc<CosmosDriver> values with independent
  registries and refresh loops, and that the per-interval GET / count
  grows at the expected ~2x rate. Guards against an accidental
  cross-runtime singleton being introduced later.

The dedup invariant on get_or_create_driver was previously only
covered by the failure-path test
get_or_create_driver_removes_failed_initialization_from_registry;
these tests lock in the success-path invariant that the runtime-owned
refresh loop depends on.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 marked this pull request as ready for review May 22, 2026 06:22
@tvaron3 tvaron3 requested a review from a team as a code owner May 22, 2026 06:22
Copilot AI review requested due to automatic review settings May 22, 2026 06:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Cosmos driver account-metadata background refresh so the periodic refresh task is owned by CosmosDriverRuntime (one loop per runtime) rather than by each CosmosDriver, reducing redundant concurrent refresh tasks when a single runtime hosts multiple drivers.

Changes:

  • Added a new driver::account_refresh module containing registry/RAII guard primitives and the per-endpoint refresh implementation.
  • Extended CosmosDriverRuntime with an endpoint refresh registry plus a lazily-started tokio loop that sequentially refreshes all registered endpoints each tick.
  • Updated CosmosDriver to self-register with the runtime loop for its account endpoint and updated/expanded in-memory-emulator integration tests to validate single-loop behavior and registration semantics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure_data_cosmos_driver/src/driver/account_refresh.rs New module implementing per-endpoint refresh logic plus registry/RAII deregistration plumbing and unit tests.
sdk/cosmos/azure_data_cosmos_driver/src/driver/runtime.rs Adds runtime-owned refresh registry, id allocator, lazy loop start, and loop implementation.
sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs Registers each driver with the runtime-owned refresh loop and wires an on_success callback to sync LocationStateStore.
sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/location_state_store.rs Removes per-driver periodic refresh loop and keeps the event-driven refresh path; retains a test-only force-refresh entrypoint.
sdk/cosmos/azure_data_cosmos_driver/src/driver/mod.rs Exposes the new account_refresh module within the driver.
sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/account_metadata_refresh.rs Expands integration coverage to assert one loop per runtime, per-endpoint refresh attribution, and runtime/driver sharing behaviors.
sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Adds an Unreleased entry describing the new runtime-owned refresh behavior.

Comment thread sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs Outdated
1. on_success now bumps the event-driven rate-limit clock.
   The runtime's register_for_account_refresh docs (and the OnSuccessFn
   type doc) promise that on_success advances LocationStateStore's
   last_refresh_epoch_ms so the event-driven RefreshAccountProperties
   path will throttle for at least one refresh interval after a
   successful periodic tick — matching the bookkeeping the old per-
   driver loop did inside refresh_account_properties_inner. The
   previous on_success_for_lss closure only called
   sync_account_properties, regressing that behavior. Added
   LocationStateStore::note_periodic_refresh_succeeded which performs
   the clock bump and the snapshot sync as a single helper, and routed
   the runtime callback through it.

2. Renamed dropping_one_driver_stops_refresh_for_that_account_only ->
   registered_drivers_keep_refreshing_when_local_handle_dropped to
   match the property the test actually asserts (the test verifies the
   inverse of the original name, because the strong-ref cycle between
   CosmosDriver and CosmosDriverRuntime prevents observing the drop
   side of the contract from this integration-test layer). Updated the
   doc comment, the file-header regression-list bullet, and trimmed
   the verbose inline comment block above the drop call.

3. Consolidated the three stale 0.3.0 Unreleased CHANGELOG entries
   (which still referenced LocationStateStore::start_account_refresh_loop
   — an API removed in this PR) into a single accurate entry covering
   the periodic refresh, the warn-on-failure logging, and the move to
   the runtime, citing both Azure#4407 and Azure#4458.

Verified:
* cargo clippy -p azure_data_cosmos_driver --all-features --all-targets PASS
* cargo test -p azure_data_cosmos_driver --lib -- driver::account_refresh
  driver::routing::location_state_store PASS (10 passed)
* cargo test -p azure_data_cosmos_driver --features __internal_in_memory_emulator
  --test in_memory_emulator account_metadata PASS (6 passed, 1 ignored)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants