Cosmos: move periodic account-metadata refresh loop to runtime#4458
Open
tvaron3 wants to merge 7 commits into
Open
Cosmos: move periodic account-metadata refresh loop to runtime#4458tvaron3 wants to merge 7 commits into
tvaron3 wants to merge 7 commits into
Conversation
Per Fabian's review comment on PR Azure#4407: the periodic background refresh loop now lives on CosmosDriverRuntime (one tokio task per runtime that iterates registered account endpoints on each tick) instead of on LocationStateStore (one task per CosmosDriver). A runtime hosting N drivers no longer fans out into N concurrent refresh tasks all firing on the same cadence — it issues one fetch per registered account per tick. The event-driven 5-second-rate-limited refresh triggered by LocationEffect::RefreshAccountProperties stays on LocationStateStore as refresh_account_properties_if_due. That path solves a different problem (burst throttling during real incidents — 403/3 WriteForbidden, 403/1008 DatabaseAccountNotFound, etc.) than the timer solves (idle clients not refreshing) and they have different cadences. New file driver/account_refresh.rs holds the per-endpoint refresh body (snapshot previous -> fetch outside lock -> atomic get_or_refresh_with -> warn-on-failure -> on_success callback) extracted from LocationStateStore::refresh_account_properties_inner, the AccountRefreshFn / OnSuccessFn type aliases, the AccountRefreshEntry struct with per-registration id token, the AccountRefreshRegistry trait, and the AccountRefreshRegistration RAII guard with Drop impl that deregisters by (endpoint, id) so re-registration races during driver recreation are safe. CosmosDriverRuntime gains: * account_refresh_registry: RwLock<HashMap<AccountEndpoint, Entry>> * account_refresh_next_id: AtomicU64 * account_refresh_loop_started: Once (lazy start on first registration) * account_refresh_task_manager: BackgroundTaskManager (Drop -> abort) * pub(crate) fn register_for_account_refresh(...) -> Registration * impl AccountRefreshRegistry::deregister(endpoint, id) * run_account_refresh_loop free fn (snapshot under read lock, drop the strong ref before iterating, sequential per-endpoint refresh) * test-only pub fn account_refresh_registry_len() and account_refresh_loop_started() accessors CosmosDriver gains: * account_refresh_guard: Option<AccountRefreshRegistration> field (held for the driver's lifetime; Drop deregisters) * on_success_for_lss(weak_lss) helper - builds the OnSuccessFn that upgrades Weak<LocationStateStore> and calls sync_account_properties on each successful tick, no-ops when the driver is dropped LocationStateStore loses: * start_account_refresh_loop, account_refresh_loop free fn, force_refresh_account_properties (no longer called - replaced by the runtime loop's call to refresh_one_account) * BACKGROUND_REFRESH_INTERVAL constant (moved to account_refresh.rs) * The local AccountRefreshFn type alias (re-exported from account_refresh as a type alias for backward compatibility) LocationStateStore keeps: * refresh_account_properties_if_due (event-driven, 5s rate-limited) * refresh_account_properties_inner shared body (used only by the event-driven path now; the runtime loop has its own equivalent in account_refresh::refresh_one_account) * failback loop infrastructure (unchanged) Tests: * 5 new unit tests in driver/account_refresh.rs covering: registration Drop deregisters with captured id+endpoint; Drop after runtime drop is a no-op; next_registration_id monotonic; refresh_one_account success invokes on_success with fresh props; refresh_one_account failure skips on_success and retains the cached value. * 2 new integration tests in tests/in_memory_emulator_tests/account_metadata_refresh.rs: - single_loop_serves_two_drivers_in_same_runtime: two drivers registered, single Once-guarded loop spawned, both endpoints refresh independently each tick. Verifies the core consolidation invariant. - dropping_one_driver_stops_refresh_for_that_account_only: documents the registry-stays-populated reality given the existing Arc cycle between CosmosDriver and CosmosDriverRuntime; asserts the inverse property (both drivers' refreshes continue as long as the registry holds them). * 1 ignored test runtime_drop_aborts_refresh_task: documents the Arc cycle limitation that prevents observable runtime drop from this test layer; preserves the test for re-enabling if/when the cycle is broken (e.g., by switching driver_registry to Weak). * Existing background_loop_refetches_database_account_with_no_traffic and back_to_back_operations_do_not_trigger_per_request_account_reads pass verbatim - from the RequestObserver perspective the behavior is unchanged (still >=N GET / calls in N tick windows, still 0 per- operation hits). CHANGELOG: additive entry under 0.3.0 Unreleased Bugs Fixed documenting the move; no public API change. Validation: * cargo fmt -p azure_data_cosmos_driver -- --check PASS * cargo clippy -p azure_data_cosmos_driver --all-features --all-targets PASS (0 warnings) * cargo test -p azure_data_cosmos_driver --lib --all-features PASS (1520 passed) * cargo test -p azure_data_cosmos_driver --features __internal_in_memory_emulator --test in_memory_emulator account_metadata PASS (4 passed, 1 documented-ignored) * cargo build -p azure_data_cosmos --all-features PASS * cargo doc -p azure_data_cosmos_driver --no-deps --all-features PASS Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fresh-to-runtime # Conflicts: # sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs
- Drop the AccountRefreshFn type alias re-export in LocationStateStore; import the canonical type from driver::account_refresh directly. - Trim the BACKGROUND_REFRESH_INTERVAL doc comment. - Remove the Option wrapper on CosmosDriver::account_refresh_guard. CosmosDriver::new is the only construction path and always registers, so the Option was misleading about the actual invariant. - Switch next_registration_id from Ordering::Relaxed to Ordering::SeqCst. The RwLock on the registry already provides the necessary barriers, but SeqCst makes the cross-thread coordination explicit at zero runtime cost (called once per driver construction). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Two new integration tests in tests/in_memory_emulator_tests/account_metadata_refresh.rs: * multiple_clients_same_account_same_runtime_share_one_driver — asserts that two get_or_create_driver calls for the same account return the same Arc<CosmosDriver> (so every client built on top shares the same LocationStateStore, PartitionKeyRangeCache, SessionManager, transport and refresh registration), verifies the runtime's account-refresh registry stays at size 1, and confirms the background loop still emits only one GET / per refresh interval. * same_account_different_runtimes_do_not_share_driver — asserts that two separate CosmosDriverRuntime instances built for the same account produce two distinct Arc<CosmosDriver> values with independent registries and refresh loops, and that the per-interval GET / count grows at the expected ~2x rate. Guards against an accidental cross-runtime singleton being introduced later. The dedup invariant on get_or_create_driver was previously only covered by the failure-path test get_or_create_driver_removes_failed_initialization_from_registry; these tests lock in the success-path invariant that the runtime-owned refresh loop depends on. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors Cosmos driver account-metadata background refresh so the periodic refresh task is owned by CosmosDriverRuntime (one loop per runtime) rather than by each CosmosDriver, reducing redundant concurrent refresh tasks when a single runtime hosts multiple drivers.
Changes:
- Added a new
driver::account_refreshmodule containing registry/RAII guard primitives and the per-endpoint refresh implementation. - Extended
CosmosDriverRuntimewith an endpoint refresh registry plus a lazily-started tokio loop that sequentially refreshes all registered endpoints each tick. - Updated
CosmosDriverto self-register with the runtime loop for its account endpoint and updated/expanded in-memory-emulator integration tests to validate single-loop behavior and registration semantics.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos_driver/src/driver/account_refresh.rs | New module implementing per-endpoint refresh logic plus registry/RAII deregistration plumbing and unit tests. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/runtime.rs | Adds runtime-owned refresh registry, id allocator, lazy loop start, and loop implementation. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs | Registers each driver with the runtime-owned refresh loop and wires an on_success callback to sync LocationStateStore. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/location_state_store.rs | Removes per-driver periodic refresh loop and keeps the event-driven refresh path; retains a test-only force-refresh entrypoint. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/mod.rs | Exposes the new account_refresh module within the driver. |
| sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/account_metadata_refresh.rs | Expands integration coverage to assert one loop per runtime, per-endpoint refresh attribution, and runtime/driver sharing behaviors. |
| sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md | Adds an Unreleased entry describing the new runtime-owned refresh behavior. |
1. on_success now bumps the event-driven rate-limit clock. The runtime's register_for_account_refresh docs (and the OnSuccessFn type doc) promise that on_success advances LocationStateStore's last_refresh_epoch_ms so the event-driven RefreshAccountProperties path will throttle for at least one refresh interval after a successful periodic tick — matching the bookkeeping the old per- driver loop did inside refresh_account_properties_inner. The previous on_success_for_lss closure only called sync_account_properties, regressing that behavior. Added LocationStateStore::note_periodic_refresh_succeeded which performs the clock bump and the snapshot sync as a single helper, and routed the runtime callback through it. 2. Renamed dropping_one_driver_stops_refresh_for_that_account_only -> registered_drivers_keep_refreshing_when_local_handle_dropped to match the property the test actually asserts (the test verifies the inverse of the original name, because the strong-ref cycle between CosmosDriver and CosmosDriverRuntime prevents observing the drop side of the contract from this integration-test layer). Updated the doc comment, the file-header regression-list bullet, and trimmed the verbose inline comment block above the drop call. 3. Consolidated the three stale 0.3.0 Unreleased CHANGELOG entries (which still referenced LocationStateStore::start_account_refresh_loop — an API removed in this PR) into a single accurate entry covering the periodic refresh, the warn-on-failure logging, and the move to the runtime, citing both Azure#4407 and Azure#4458. Verified: * cargo clippy -p azure_data_cosmos_driver --all-features --all-targets PASS * cargo test -p azure_data_cosmos_driver --lib -- driver::account_refresh driver::routing::location_state_store PASS (10 passed) * cargo test -p azure_data_cosmos_driver --features __internal_in_memory_emulator --test in_memory_emulator account_metadata PASS (6 passed, 1 ignored) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per Fabian's review on #4407: the periodic background account-metadata refresh loop now lives on
CosmosDriverRuntime(one tokio task per runtime that iterates registered account endpoints on each tick) instead of onLocationStateStore(one task perCosmosDriver). A runtime hosting N drivers no longer fans out into N concurrent refresh tasks firing on the same cadence — it issues one fetch per registered account per tick.The event-driven 5-second-rate-limited refresh triggered by
LocationEffect::RefreshAccountPropertiesstays onLocationStateStoreasrefresh_account_properties_if_due. That path solves a different problem (burst throttling during real incidents — 403/3 WriteForbidden, 403/1008 DatabaseAccountNotFound, etc.) than the timer solves (idle clients not refreshing), and they have different cadences.What changed
sdk/cosmos/azure_data_cosmos_driver/src/driver/account_refresh.rs: per-endpoint refresh body (snapshot previous → fetch outside lock → atomicget_or_refresh_with→ warn-on-failure →on_successcallback),AccountRefreshFn/OnSuccessFntype aliases,AccountRefreshEntrywith per-registration id token,AccountRefreshRegistrytrait, andAccountRefreshRegistrationRAII guard whoseDropderegisters by(endpoint, id)so re-registration races during driver recreation are safe.CosmosDriverRuntimegains: an account-refresh registry (RwLock<HashMap<AccountEndpoint, Entry>>), anAtomicU64id allocator, aOnce-guarded lazy loop start on first registration, aBackgroundTaskManagerthat aborts the loop on drop, apub(crate) fn register_for_account_refresh(...), anAccountRefreshRegistryimpl, and arun_account_refresh_loopfree fn that snapshots under a read lock, drops the strong ref, then iterates per-endpoint sequentially.CosmosDriverholds anOption<AccountRefreshRegistration>for the driver's lifetime; theDropderegisters.on_success_for_lss(weak_lss)upgrades aWeak<LocationStateStore>and callssync_account_propertieson each successful tick, and no-ops once the driver is dropped.LocationStateStorelosesstart_account_refresh_loop, the in-moduleaccount_refresh_loop,force_refresh_account_properties, and theBACKGROUND_REFRESH_INTERVALconstant (moved toaccount_refresh.rs). It keepsrefresh_account_properties_if_due(event-driven, 5 s rate-limited), therefresh_account_properties_innershared body, and the failback-loop infrastructure.Tests
driver/account_refresh.rscovering: registrationDropderegisters with captured id+endpoint,Dropafter runtime drop is a no-op,next_registration_idis monotonic,refresh_one_accountsuccess invokeson_successwith fresh props, andrefresh_one_accountfailure skipson_successand retains the cached value.tests/in_memory_emulator_tests/account_metadata_refresh.rs:single_loop_serves_two_drivers_in_same_runtime— two drivers registered, a singleOnce-guarded loop spawned, both endpoints refresh independently each tick.dropping_one_driver_stops_refresh_for_that_account_only— documents the registry-stays-populated reality given the existingArccycle betweenCosmosDriverandCosmosDriverRuntime; asserts the inverse property.runtime_drop_aborts_refresh_task— documents theArccycle limitation; preserved for re-enabling if/when the cycle is broken.background_loop_refetches_database_account_with_no_trafficandback_to_back_operations_do_not_trigger_per_request_account_readspass verbatim.Validation
cargo fmt -p azure_data_cosmos_driver -- --check✅cargo clippy -p azure_data_cosmos_driver --all-features --all-targets✅ (0 warnings)cargo test -p azure_data_cosmos_driver --lib --all-features✅ (1520 passed)cargo test -p azure_data_cosmos_driver --features __internal_in_memory_emulator --test in_memory_emulator account_metadata✅ (4 passed, 1 documented-ignored)cargo build -p azure_data_cosmos --all-features✅cargo doc -p azure_data_cosmos_driver --no-deps --all-features✅