feat: periodic DNS re-resolution for client-side service discovery#520
feat: periodic DNS re-resolution for client-side service discovery#520cgreeno wants to merge 2 commits intoelixir-grpc:masterfrom
Conversation
|
@sleipnir How do you see Health checker working? Do you see Health checking as a different concern from Resolution? Something like: defstruct virtual_channel: nil, |
|
@sleipnir Everything is addressed except disconnection event forwarding, which I will follow-up after this PR... the plumbing (update/2 callback) is in place but we haven't wired :refresh to call it on dead channels yet. Assume it is OK to address after? |
|
Hi @cgreeno I will review it later today. But need formatting |
|
Ya Sorry was in a rush - let me stop being lazy! |
Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. Essential for Kubernetes deployments where pods scale behind headless services. Architecture follows @sleipnir's design: Connection remains the orchestrator, delegating to resolvers via behaviour callbacks. The Resolver behaviour gains optional init/2, update/2, shutdown/1 callbacks. DNS resolver implements these to start an internal worker process — Connection doesn't know the implementation details. Key changes: - GRPC.Client.Resolver: optional init/update/shutdown callbacks - GRPC.Client.Resolver.DNS: implements callbacks, starts DNSResolver worker for periodic re-resolution with backoff and rate limiting - GRPC.Client.DNSResolver: linked GenServer owning the resolve loop, sends {:resolver_update, result} back to Connection - GRPC.Client.Connection: delegates to resolver behaviour, handles channel reconciliation, re-inits resolver on worker crash - Channel state uses {:connected, ch}/{:failed, reason} tuples - reconcile_channels extracted into focused helper functions - persistent_term writes guarded to avoid redundant global GC - Telemetry: [:grpc, :client, :resolve, :stop] and [:error] events New connect/2 options: resolve_interval, max_resolve_interval, min_resolve_interval. Public API: resolve_now/1 for on-demand re-resolution. Rebased onto master after grpc_client/ → grpc/ directory rename. 30 tests covering scale-up/down, failure/recovery, backoff, rate limiting, telemetry, worker crash recovery, and edge cases.
b0b34c9 to
aaefd21
Compare
|
@sleipnir get a chance to review? |
|
@sleipnir BTW - I really think we need to address the persistant_term - global GC issues. If we are starting to write to persistent term more often it is going to be an issue I think. BUT I didn't want to do it until I had some alignment with yourself. Doing some research on how others do LB and round robin it seems most other open source libs use ets.update_counter(table, key, {pos, increment, threshold, reset}) it atomically increments, and when it hits the threshold it wraps to the reset value..... One operation, no race conditions. Here is the research if you want to take a look
|
|
@sleipnir how is the review? Any feedback or anything you want me to change? |
Just remember that there is an existing load balancing strategy: |
We can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module. |
|
@sleipnir - I know your probably busy, sorry to bug you again, just a gentle nudge to see if you had a chance to review. :) |
Hi, yes, a little busy, but we'll get there. Did you take a look at my comments? |
@sleipnir - The comments about ETS or other? On ETS yes 100% - but I planned on submitting another PR for that later. Working on it now but will take a couple more days For this PR I addressed all the comments from the other 513(which I closed and resubmitted) - including the formating message above |
|
@sleipnir - Any more comments I need to address? FYI - I am almost done the ETS table migration but dont want to submit a PR until I get this approved. My end goal here is the get xDs implemented so we can moved to weighted canary deployments rollouts. 💪 |
|
Hi @cgreeno I realized I left a couple of questions open earlier.
My concern is that the PR seems to bypass the existing load balancing mechanism. As far as I understand, there’s already a module responsible for handling the pick, but it doesn’t look like it’s being used here. Could you confirm if that’s the case? If not, could you walk me through how your approach integrates with the current load balancing logic so I can better follow along in the review? |
|
Thanks @sleipnir! LB isn't bypassed, the three places lb_mod.pick(lb_state) runs:
So this PR actually adds an LB integration point (#3), it doesn't remove one. The piece you might be looking at is That's actually the thread I was pulling on with the ETS comment above and your point then ("the responsibility for the pick lies with the load balancing module") is exactly right. The follow-up would land the fast-read path inside the LB behaviour itself, so pick_channel/2 goes through lb_mod.pick and the module owns its own storage. Keeping that out of this PR to keep the DNS re-resolution change focused. So let me take you through what I considered
I have a follow up PR that does 4 it satisfies your "LB module owns the pick" point, removes the persistent_term write-frequency issue I flagged, and keeps the hot path lock-free. Keeping it out of this PR to keep the DNS re-resolution change focused. I created a draft PR of what I am thinking long term 👇 - Feedback more than welcome :) |
Summary
Rebased version of #513 onto current master (after
grpc_client/→grpc/rename in 1.0.0-rc.1).Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. Essential for Kubernetes deployments where pods scale up/down behind headless services, matching the behaviour of grpc-go and grpc-java.
Architecture
Follows @sleipnir's design from the #513 review: Connection remains the orchestrator, delegating to resolvers via behaviour callbacks.
The
GRPC.Client.Resolverbehaviour gains three optional callbacks:init/2— called by Connection during startup, lets the resolver initialize internal state (start workers, open streams, etc.)update/2— Connection notifies the resolver about events (:resolve_now, future: disconnections, rebalancing)shutdown/1— cleanup on disconnectStatic resolvers (IPv4, Unix) don't implement them — fully backwards compatible via
@optional_callbacks.The DNS resolver implements these callbacks to start an internal
DNSResolverworker process. Connection doesn't know the implementation details — it just calls the behaviour.What's included
Resolver behaviour expansion (
resolver.ex):init/2,update/2,shutdown/1callbacksDNS resolver (
resolver/dns.ex+dns_resolver.ex):DNS.init/2starts a linked worker for periodic re-resolution{:resolver_update, result}back to ConnectionConnection changes (
connection.ex):resolve_now/1public API viaGenServer.cast+resolver.update/2{:connected, ch}/{:failed, reason}for claritypersistent_termwrites guarded to avoid redundant global GC passesOpt-in via
dns://scheme — only DNS targets trigger re-resolution. No changes to static targets.New connect/2 options
:resolve_interval:max_resolve_interval:min_resolve_intervalresolve_now/1(ms)Note on persistent_term
Updates to
:persistent_termtrigger a global GC pass across all BEAM processes. We guard writes to only happen when the picked channel actually changes. A future improvement would be migrating to ETS withread_concurrency: true.Test coverage (30 tests)
Covers scale-up/down, complete replacement, DNS failure/recovery, empty addresses, pick_channel stability, repeated cycles, non-DNS targets, disconnect cleanup, LB errors, port changes, exponential backoff (doubles/resets/caps), rate limiting, telemetry events, stale persistent_term prevention, failed channel retry, slow DNS non-blocking, worker crash recovery, and resolver init with nil state.
Test plan
Supersedes #513 (rebased after directory rename).