Skip to content

feat: periodic DNS re-resolution for client-side service discovery#520

Open
cgreeno wants to merge 2 commits intoelixir-grpc:masterfrom
cgreeno:feature/periodic-dns-re-resolution-v2
Open

feat: periodic DNS re-resolution for client-side service discovery#520
cgreeno wants to merge 2 commits intoelixir-grpc:masterfrom
cgreeno:feature/periodic-dns-re-resolution-v2

Conversation

@cgreeno
Copy link
Copy Markdown

@cgreeno cgreeno commented Apr 13, 2026

Summary

Rebased version of #513 onto current master (after grpc_client/grpc/ rename in 1.0.0-rc.1).

Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. Essential for Kubernetes deployments where pods scale up/down behind headless services, matching the behaviour of grpc-go and grpc-java.

Architecture

Follows @sleipnir's design from the #513 review: Connection remains the orchestrator, delegating to resolvers via behaviour callbacks.

The GRPC.Client.Resolver behaviour gains three optional callbacks:

  • init/2 — called by Connection during startup, lets the resolver initialize internal state (start workers, open streams, etc.)
  • update/2 — Connection notifies the resolver about events (:resolve_now, future: disconnections, rebalancing)
  • shutdown/1 — cleanup on disconnect

Static resolvers (IPv4, Unix) don't implement them — fully backwards compatible via @optional_callbacks.

The DNS resolver implements these callbacks to start an internal DNSResolver worker process. Connection doesn't know the implementation details — it just calls the behaviour.

What's included

Resolver behaviour expansion (resolver.ex):

  • Optional init/2, update/2, shutdown/1 callbacks

DNS resolver (resolver/dns.ex + dns_resolver.ex):

  • DNS.init/2 starts a linked worker for periodic re-resolution
  • Worker owns the timer loop, exponential backoff, rate limiting, and telemetry
  • Sends {:resolver_update, result} back to Connection

Connection changes (connection.ex):

  • Delegates to resolver behaviour instead of managing processes directly
  • Channel reconciliation: new backends get channels, removed backends get disconnected
  • LB re-initialized with updated addresses after each reconciliation
  • Fallback to healthy channel when LB picks a failed one
  • Retry previously failed channels on subsequent cycles
  • Re-initializes resolver automatically on worker crash
  • resolve_now/1 public API via GenServer.cast + resolver.update/2
  • Channel state uses {:connected, ch} / {:failed, reason} for clarity
  • persistent_term writes guarded to avoid redundant global GC passes

Opt-in via dns:// scheme — only DNS targets trigger re-resolution. No changes to static targets.

New connect/2 options

Option Default Description
:resolve_interval 30000 Base DNS re-resolution interval (ms)
:max_resolve_interval 300000 Backoff cap (ms)
:min_resolve_interval 5000 Rate-limit floor for resolve_now/1 (ms)

Note on persistent_term

Updates to :persistent_term trigger a global GC pass across all BEAM processes. We guard writes to only happen when the picked channel actually changes. A future improvement would be migrating to ETS with read_concurrency: true.

Test coverage (30 tests)

Covers scale-up/down, complete replacement, DNS failure/recovery, empty addresses, pick_channel stability, repeated cycles, non-DNS targets, disconnect cleanup, LB errors, port changes, exponential backoff (doubles/resets/caps), rate limiting, telemetry events, stale persistent_term prevention, failed channel retry, slow DNS non-blocking, worker crash recovery, and resolver init with nil state.

Test plan

  • 30 re-resolution tests pass (5 consecutive runs, 0 flakes)
  • Full test suite: 255 tests, 0 failures
  • End-to-end tested with Docker Compose (scale up/down/outage/recovery)
  • End-to-end tested on Kubernetes with minikube + headless service + CoreDNS

Supersedes #513 (rebased after directory rename).

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 13, 2026

@sleipnir How do you see Health checker working? Do you see Health checking as a different concern from Resolution?

Something like:

defstruct virtual_channel: nil,
real_channels: %{},
lb_mod: nil,
lb_state: nil,
resolver: nil,
adapter: GRPC.Client.Adapters.Gun,
resolver_state: nil, # opaque — DNS worker, xDS stream, whatever
health_checker: nil, # module
health_checker_state: nil, # opaque — per-backend check processes
connect_opts: []

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 13, 2026

@sleipnir Everything is addressed except disconnection event forwarding, which I will follow-up after this PR... the plumbing (update/2 callback) is in place but we haven't wired :refresh to call it on dead channels yet. Assume it is OK to address after?

@sleipnir
Copy link
Copy Markdown
Collaborator

Hi @cgreeno I will review it later today. But need formatting

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 13, 2026

Ya Sorry was in a rush - let me stop being lazy!

Adds periodic DNS re-resolution to the gRPC client, enabling automatic
discovery of new backends and removal of stale ones. Essential for
Kubernetes deployments where pods scale behind headless services.

Architecture follows @sleipnir's design: Connection remains the
orchestrator, delegating to resolvers via behaviour callbacks. The
Resolver behaviour gains optional init/2, update/2, shutdown/1
callbacks. DNS resolver implements these to start an internal worker
process — Connection doesn't know the implementation details.

Key changes:

- GRPC.Client.Resolver: optional init/update/shutdown callbacks
- GRPC.Client.Resolver.DNS: implements callbacks, starts DNSResolver
  worker for periodic re-resolution with backoff and rate limiting
- GRPC.Client.DNSResolver: linked GenServer owning the resolve loop,
  sends {:resolver_update, result} back to Connection
- GRPC.Client.Connection: delegates to resolver behaviour, handles
  channel reconciliation, re-inits resolver on worker crash
- Channel state uses {:connected, ch}/{:failed, reason} tuples
- reconcile_channels extracted into focused helper functions
- persistent_term writes guarded to avoid redundant global GC
- Telemetry: [:grpc, :client, :resolve, :stop] and [:error] events

New connect/2 options: resolve_interval, max_resolve_interval,
min_resolve_interval. Public API: resolve_now/1 for on-demand
re-resolution.

Rebased onto master after grpc_client/ → grpc/ directory rename.

30 tests covering scale-up/down, failure/recovery, backoff, rate
limiting, telemetry, worker crash recovery, and edge cases.
@cgreeno cgreeno force-pushed the feature/periodic-dns-re-resolution-v2 branch from b0b34c9 to aaefd21 Compare April 13, 2026 19:29
@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 14, 2026

@sleipnir get a chance to review?

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 15, 2026

@sleipnir BTW - I really think we need to address the persistant_term - global GC issues. If we are starting to write to persistent term more often it is going to be an issue I think. BUT I didn't want to do it until I had some alignment with yourself.

Doing some research on how others do LB and round robin it seems most other open source libs use :ets.update_counter/4 something like so:

# Setup:                                                                                                   
  :ets.new(:lb_state, [:set, :public, :named_table, read_concurrency: true])
  :ets.insert(:lb_state, {{ref, :channels}, {ch1, ch2, ch3}})                                                                                                             
  :ets.insert(:lb_state, {{ref, :index}, 0})
                                                                                                                                                                          
  # pick_channel (per request, no GenServer, no lock):                                                       
  def pick_channel(%Channel{ref: ref}, _opts \\ []) do                                                                                                                    
    case :ets.lookup(:lb_state, {ref, :channels}) do                                                         
      [{{^ref, :channels}, channels}] when tuple_size(channels) > 0 ->                                                                                                    
        idx = :ets.update_counter(:lb_state, {ref, :index}, {2, 1})
        {:ok, elem(channels, rem(idx, tuple_size(channels)))}                                                                                                             
                                                                                                             
      _ ->                                                                                                                                                                
        {:error, :no_connection}                                                                             
    end
  end

ets.update_counter(table, key, {pos, increment, threshold, reset}) it atomically increments, and when it hits the threshold it wraps to the reset value..... One operation, no race conditions.

Here is the research if you want to take a look

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 15, 2026

@sleipnir how is the review? Any feedback or anything you want me to change?

@sleipnir
Copy link
Copy Markdown
Collaborator

@sleipnir how is the review? Any feedback or anything you want me to change?

Hi @cgreeno I'm a bit busy with other work, but I haven't forgotten about you. I'll review and respond as soon as I can. Overall, I didn't see anything wrong, but I haven't looked at it carefully yet.

@sleipnir
Copy link
Copy Markdown
Collaborator

# Setup:                                                                                                   
  :ets.new(:lb_state, [:set, :public, :named_table, read_concurrency: true])
  :ets.insert(:lb_state, {{ref, :channels}, {ch1, ch2, ch3}})                                                                                                             
  :ets.insert(:lb_state, {{ref, :index}, 0})
                                                                                                                                                                          
  # pick_channel (per request, no GenServer, no lock):                                                       
  def pick_channel(%Channel{ref: ref}, _opts \\ []) do                                                                                                                    
    case :ets.lookup(:lb_state, {ref, :channels}) do                                                         
      [{{^ref, :channels}, channels}] when tuple_size(channels) > 0 ->                                                                                                    
        idx = :ets.update_counter(:lb_state, {ref, :index}, {2, 1})
        {:ok, elem(channels, rem(idx, tuple_size(channels)))}                                                                                                             
                                                                                                             
      _ ->                                                                                                                                                                
        {:error, :no_connection}                                                                             
    end
  end

Just remember that there is an existing load balancing strategy: lb_mod.pick(lb_state)

@sleipnir
Copy link
Copy Markdown
Collaborator

# Setup:                                                                                                   
  :ets.new(:lb_state, [:set, :public, :named_table, read_concurrency: true])
  :ets.insert(:lb_state, {{ref, :channels}, {ch1, ch2, ch3}})                                                                                                             
  :ets.insert(:lb_state, {{ref, :index}, 0})
                                                                                                                                                                          
  # pick_channel (per request, no GenServer, no lock):                                                       
  def pick_channel(%Channel{ref: ref}, _opts \\ []) do                                                                                                                    
    case :ets.lookup(:lb_state, {ref, :channels}) do                                                         
      [{{^ref, :channels}, channels}] when tuple_size(channels) > 0 ->                                                                                                    
        idx = :ets.update_counter(:lb_state, {ref, :index}, {2, 1})
        {:ok, elem(channels, rem(idx, tuple_size(channels)))}                                                                                                             
                                                                                                             
      _ ->                                                                                                                                                                
        {:error, :no_connection}                                                                             
    end
  end

Just remember that there is an existing load balancing strategy: lb_mod.pick(lb_state)

We can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module.

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 22, 2026

@sleipnir - I know your probably busy, sorry to bug you again, just a gentle nudge to see if you had a chance to review. :)

@sleipnir
Copy link
Copy Markdown
Collaborator

@sleipnir - I know your probably busy, sorry to bug you again, just a gentle nudge to see if you had a chance to review. :)

Hi, yes, a little busy, but we'll get there.

Did you take a look at my comments?

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 22, 2026

Did you take a look at my comments?

@sleipnir - The comments about ETS or other?

On ETS yes 100% - but I planned on submitting another PR for that later. Working on it now but will take a couple more days

For this PR I addressed all the comments from the other 513(which I closed and resubmitted) - including the formating message above

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 23, 2026

@sleipnir - Any more comments I need to address?

FYI - I am almost done the ETS table migration but dont want to submit a PR until I get this approved. My end goal here is the get xDs implemented so we can moved to weighted canary deployments rollouts. 💪

@sleipnir
Copy link
Copy Markdown
Collaborator

sleipnir commented Apr 24, 2026

Hi @cgreeno

I realized I left a couple of questions open earlier.

We can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module.

My concern is that the PR seems to bypass the existing load balancing mechanism. As far as I understand, there’s already a module responsible for handling the pick, but it doesn’t look like it’s being used here.

Could you confirm if that’s the case? If not, could you walk me through how your approach integrates with the current load balancing logic so I can better follow along in the review?

@cgreeno
Copy link
Copy Markdown
Author

cgreeno commented Apr 24, 2026

Thanks @sleipnir!

LB isn't bypassed, the three places lb_mod.pick(lb_state) runs:

  1. Initial connect --> build_balanced_state/6 calls lb_mod.init then lb_mod.pick to choose the first channel (connection.ex:642-644)
  2. Refresh tick (every 15s) --> handle_info(:refresh, …) calls lb_mod.pick(lb_state) and updates the cached channel if the pick changed (connection.ex:310).
  3. After DNS re-resolution (new in this PR) --> rebalance_after_reconcile/3 calls lb_mod.init on the updated address set, then lb_mod.pick to choose a channel from the new pool (connection.ex:461-463).

So this PR actually adds an LB integration point (#3), it doesn't remove one.

The piece you might be looking at is pick_channel/2 on the hot path it reads the last-chosen channel from :persistent_term rather than calling lb_mod.pick per RPC (connection.ex:246). That read path is pre-existing on master and this PR didn't change it.

That's actually the thread I was pulling on with the ETS comment above and your point then ("the responsibility for the pick lies with the load balancing module") is exactly right. The follow-up would land the fast-read path inside the LB behaviour itself, so pick_channel/2 goes through lb_mod.pick and the module owns its own storage. Keeping that out of this PR to keep the DNS re-resolution change focused.

So let me take you through what I considered

  1. GenServer.call(via(ref), :pick) per RPC. Simple, no new primitive. Downside: every RPC serializes through the Connection mailbox. Under load this becomes a bottleneck and competes with orchestration work (refresh, re-resolution, disconnect) running in the same process. 5s default call timeouts start mattering. Rules itself out at any real traffic

  2. Keep persistent_term, also store lb_state + real_channels there. Lock-free reads on the hot path. Downside: persistent_term.put triggers a global GC pass across every process in the node. With periodic re-resolution (30s default) and N connections, that's N GC storms per 30s. This is the reason I flagged persistent_term on April 15, I dont think it scales with the new write frequency this PR introduces.

  3. :counters / :atomics inside the LB module. Lock-free, atomic, built-in. Perfect for RoundRobin's index. Downside: narrow. Solves the index-rotation case and nothing else. real_channels still needs a lock-free concurrent map (counters can't do lookups), so you'd pair it with ETS anyway. And it doesn't generalize to the LB strategies coming next — weighted round-robin, least-requests, health-aware routing, xDS, all of which need richer per-backend state than a single counter

  4. ETS owned by the LB module + ETS for real_channels. LB module owns its table, pick/1 does atomic work via :ets.update_counter, the hot path reads lock-free. Downside: More moving parts than persistent_term. But it's the only option that handles all four cases: per-RPC LB decision, real_channels lookup, frequent writes without global GC, and generalizes to future LB strategies.

I have a follow up PR that does 4 it satisfies your "LB module owns the pick" point, removes the persistent_term write-frequency issue I flagged, and keeps the hot path lock-free. Keeping it out of this PR to keep the DNS re-resolution change focused.

I created a draft PR of what I am thinking long term 👇 - Feedback more than welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants