feat: periodic DNS re-resolution for client-side service discovery by cgreeno · Pull Request #520 · elixir-grpc/grpc

cgreeno · 2026-04-13T16:41:28Z

Summary

Rebased version of #513 onto current master (after grpc_client/ → grpc/ rename in 1.0.0-rc.1).

Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. Essential for Kubernetes deployments where pods scale up/down behind headless services, matching the behaviour of grpc-go and grpc-java.

Architecture

Follows @sleipnir's design from the #513 review: Connection remains the orchestrator, delegating to resolvers via behaviour callbacks.

The GRPC.Client.Resolver behaviour gains three optional callbacks:

init/2 — called by Connection during startup, lets the resolver initialize internal state (start workers, open streams, etc.)
update/2 — Connection notifies the resolver about events (:resolve_now, future: disconnections, rebalancing)
shutdown/1 — cleanup on disconnect

Static resolvers (IPv4, Unix) don't implement them — fully backwards compatible via @optional_callbacks.

The DNS resolver implements these callbacks to start an internal DNSResolver worker process. Connection doesn't know the implementation details — it just calls the behaviour.

What's included

Resolver behaviour expansion (resolver.ex):

Optional init/2, update/2, shutdown/1 callbacks

DNS resolver (resolver/dns.ex + dns_resolver.ex):

DNS.init/2 starts a linked worker for periodic re-resolution
Worker owns the timer loop, exponential backoff, rate limiting, and telemetry
Sends {:resolver_update, result} back to Connection

Connection changes (connection.ex):

Delegates to resolver behaviour instead of managing processes directly
Channel reconciliation: new backends get channels, removed backends get disconnected
LB re-initialized with updated addresses after each reconciliation
Fallback to healthy channel when LB picks a failed one
Retry previously failed channels on subsequent cycles
Re-initializes resolver automatically on worker crash
resolve_now/1 public API via GenServer.cast + resolver.update/2
Channel state uses {:connected, ch} / {:failed, reason} for clarity
persistent_term writes guarded to avoid redundant global GC passes

Opt-in via dns:// scheme — only DNS targets trigger re-resolution. No changes to static targets.

New connect/2 options

Option	Default	Description
`:resolve_interval`	30000	Base DNS re-resolution interval (ms)
`:max_resolve_interval`	300000	Backoff cap (ms)
`:min_resolve_interval`	5000	Rate-limit floor for `resolve_now/1` (ms)

Note on persistent_term

Updates to :persistent_term trigger a global GC pass across all BEAM processes. We guard writes to only happen when the picked channel actually changes. A future improvement would be migrating to ETS with read_concurrency: true.

Test coverage (30 tests)

Covers scale-up/down, complete replacement, DNS failure/recovery, empty addresses, pick_channel stability, repeated cycles, non-DNS targets, disconnect cleanup, LB errors, port changes, exponential backoff (doubles/resets/caps), rate limiting, telemetry events, stale persistent_term prevention, failed channel retry, slow DNS non-blocking, worker crash recovery, and resolver init with nil state.

Test plan

30 re-resolution tests pass (5 consecutive runs, 0 flakes)
Full test suite: 255 tests, 0 failures
End-to-end tested with Docker Compose (scale up/down/outage/recovery)
End-to-end tested on Kubernetes with minikube + headless service + CoreDNS

Supersedes #513 (rebased after directory rename).

cgreeno · 2026-04-13T16:43:51Z

@sleipnir How do you see Health checker working? Do you see Health checking as a different concern from Resolution?

Something like:

defstruct virtual_channel: nil,
real_channels: %{},
lb_mod: nil,
lb_state: nil,
resolver: nil,
adapter: GRPC.Client.Adapters.Gun,
resolver_state: nil, # opaque — DNS worker, xDS stream, whatever
health_checker: nil, # module
health_checker_state: nil, # opaque — per-backend check processes
connect_opts: []

cgreeno · 2026-04-13T16:44:26Z

@sleipnir Everything is addressed except disconnection event forwarding, which I will follow-up after this PR... the plumbing (update/2 callback) is in place but we haven't wired :refresh to call it on dead channels yet. Assume it is OK to address after?

sleipnir · 2026-04-13T19:27:08Z

Hi @cgreeno I will review it later today. But need formatting

cgreeno · 2026-04-13T19:27:39Z

Ya Sorry was in a rush - let me stop being lazy!

@sleipnir

Adds periodic DNS re-resolution to the gRPC client, enabling automatic discovery of new backends and removal of stale ones. Essential for Kubernetes deployments where pods scale behind headless services. Architecture follows @sleipnir's design: Connection remains the orchestrator, delegating to resolvers via behaviour callbacks. The Resolver behaviour gains optional init/2, update/2, shutdown/1 callbacks. DNS resolver implements these to start an internal worker process — Connection doesn't know the implementation details. Key changes: - GRPC.Client.Resolver: optional init/update/shutdown callbacks - GRPC.Client.Resolver.DNS: implements callbacks, starts DNSResolver worker for periodic re-resolution with backoff and rate limiting - GRPC.Client.DNSResolver: linked GenServer owning the resolve loop, sends {:resolver_update, result} back to Connection - GRPC.Client.Connection: delegates to resolver behaviour, handles channel reconciliation, re-inits resolver on worker crash - Channel state uses {:connected, ch}/{:failed, reason} tuples - reconcile_channels extracted into focused helper functions - persistent_term writes guarded to avoid redundant global GC - Telemetry: [:grpc, :client, :resolve, :stop] and [:error] events New connect/2 options: resolve_interval, max_resolve_interval, min_resolve_interval. Public API: resolve_now/1 for on-demand re-resolution. Rebased onto master after grpc_client/ → grpc/ directory rename. 30 tests covering scale-up/down, failure/recovery, backoff, rate limiting, telemetry, worker crash recovery, and edge cases.

cgreeno · 2026-04-14T15:12:49Z

@sleipnir get a chance to review?

cgreeno · 2026-04-15T10:53:57Z

@sleipnir BTW - I really think we need to address the persistant_term - global GC issues. If we are starting to write to persistent term more often it is going to be an issue I think. BUT I didn't want to do it until I had some alignment with yourself.

Doing some research on how others do LB and round robin it seems most other open source libs use :ets.update_counter/4 something like so:

# Setup:                                                                                                   
  :ets.new(:lb_state, [:set, :public, :named_table, read_concurrency: true])
  :ets.insert(:lb_state, {{ref, :channels}, {ch1, ch2, ch3}})                                                                                                             
  :ets.insert(:lb_state, {{ref, :index}, 0})
                                                                                                                                                                          
  # pick_channel (per request, no GenServer, no lock):                                                       
  def pick_channel(%Channel{ref: ref}, _opts \\ []) do                                                                                                                    
    case :ets.lookup(:lb_state, {ref, :channels}) do                                                         
      [{{^ref, :channels}, channels}] when tuple_size(channels) > 0 ->                                                                                                    
        idx = :ets.update_counter(:lb_state, {ref, :index}, {2, 1})
        {:ok, elem(channels, rem(idx, tuple_size(channels)))}                                                                                                             
                                                                                                             
      _ ->                                                                                                                                                                
        {:error, :no_connection}                                                                             
    end
  end

ets.update_counter(table, key, {pos, increment, threshold, reset}) it atomically increments, and when it hits the threshold it wraps to the reset value..... One operation, no race conditions.

Here is the research if you want to take a look

https://github.com/silviucpp/erlpool — ETS counter + lookup per pick, round-robin for Erlang process pools
https://andrealeopardi.com/posts/process-pools-with-elixirs-registry/ — ETS update_counter/4 with threshold wraparound, the canonical Elixir blog post on this
https://github.com/sneako/finch / https://github.com/dashbitco/nimble_pool — different pattern (checkout pools for HTTP/1, not routing), but HTTP/2 connections are multiplexed similar to gRPC channels
Add built in connection pool elixir-ecto/db_connection#108 — ETS-based pool for database connections

cgreeno · 2026-04-15T19:13:47Z

@sleipnir how is the review? Any feedback or anything you want me to change?

sleipnir · 2026-04-15T20:58:46Z

@sleipnir how is the review? Any feedback or anything you want me to change?

Hi @cgreeno I'm a bit busy with other work, but I haven't forgotten about you. I'll review and respond as soon as I can. Overall, I didn't see anything wrong, but I haven't looked at it carefully yet.

sleipnir · 2026-04-15T21:04:19Z

# Setup:                                                                                                   
  :ets.new(:lb_state, [:set, :public, :named_table, read_concurrency: true])
  :ets.insert(:lb_state, {{ref, :channels}, {ch1, ch2, ch3}})                                                                                                             
  :ets.insert(:lb_state, {{ref, :index}, 0})
                                                                                                                                                                          
  # pick_channel (per request, no GenServer, no lock):                                                       
  def pick_channel(%Channel{ref: ref}, _opts \\ []) do                                                                                                                    
    case :ets.lookup(:lb_state, {ref, :channels}) do                                                         
      [{{^ref, :channels}, channels}] when tuple_size(channels) > 0 ->                                                                                                    
        idx = :ets.update_counter(:lb_state, {ref, :index}, {2, 1})
        {:ok, elem(channels, rem(idx, tuple_size(channels)))}                                                                                                             
                                                                                                             
      _ ->                                                                                                                                                                
        {:error, :no_connection}                                                                             
    end
  end

Just remember that there is an existing load balancing strategy: lb_mod.pick(lb_state)

sleipnir · 2026-04-15T21:06:49Z

# Setup:                                                                                                   
  :ets.new(:lb_state, [:set, :public, :named_table, read_concurrency: true])
  :ets.insert(:lb_state, {{ref, :channels}, {ch1, ch2, ch3}})                                                                                                             
  :ets.insert(:lb_state, {{ref, :index}, 0})
                                                                                                                                                                          
  # pick_channel (per request, no GenServer, no lock):                                                       
  def pick_channel(%Channel{ref: ref}, _opts \\ []) do                                                                                                                    
    case :ets.lookup(:lb_state, {ref, :channels}) do                                                         
      [{{^ref, :channels}, channels}] when tuple_size(channels) > 0 ->                                                                                                    
        idx = :ets.update_counter(:lb_state, {ref, :index}, {2, 1})
        {:ok, elem(channels, rem(idx, tuple_size(channels)))}                                                                                                             
                                                                                                             
      _ ->                                                                                                                                                                
        {:error, :no_connection}                                                                             
    end
  end

Just remember that there is an existing load balancing strategy: lb_mod.pick(lb_state)

We can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module.

cgreeno · 2026-04-22T10:13:57Z

@sleipnir - I know your probably busy, sorry to bug you again, just a gentle nudge to see if you had a chance to review. :)

sleipnir · 2026-04-22T13:01:46Z

@sleipnir - I know your probably busy, sorry to bug you again, just a gentle nudge to see if you had a chance to review. :)

Hi, yes, a little busy, but we'll get there.

Did you take a look at my comments?

cgreeno · 2026-04-22T13:03:37Z

Did you take a look at my comments?

@sleipnir - The comments about ETS or other?

On ETS yes 100% - but I planned on submitting another PR for that later. Working on it now but will take a couple more days

For this PR I addressed all the comments from the other 513(which I closed and resubmitted) - including the formating message above

cgreeno · 2026-04-23T20:32:22Z

@sleipnir - Any more comments I need to address?

FYI - I am almost done the ETS table migration but dont want to submit a PR until I get this approved. My end goal here is the get xDs implemented so we can moved to weighted canary deployments rollouts. 💪

sleipnir · 2026-04-24T14:02:31Z

Hi @cgreeno

I realized I left a couple of questions open earlier.

We can replace them with ETS, but we can't forget that the responsibility for the pick lies with the load balancing module.

My concern is that the PR seems to bypass the existing load balancing mechanism. As far as I understand, there’s already a module responsible for handling the pick, but it doesn’t look like it’s being used here.

Could you confirm if that’s the case? If not, could you walk me through how your approach integrates with the current load balancing logic so I can better follow along in the review?

cgreeno · 2026-04-24T18:02:51Z

Thanks @sleipnir!

LB isn't bypassed, the three places lb_mod.pick(lb_state) runs:

Initial connect --> build_balanced_state/6 calls lb_mod.init then lb_mod.pick to choose the first channel (connection.ex:642-644)
Refresh tick (every 15s) --> handle_info(:refresh, …) calls lb_mod.pick(lb_state) and updates the cached channel if the pick changed (connection.ex:310).
After DNS re-resolution (new in this PR) --> rebalance_after_reconcile/3 calls lb_mod.init on the updated address set, then lb_mod.pick to choose a channel from the new pool (connection.ex:461-463).

So this PR actually adds an LB integration point (#3), it doesn't remove one.

The piece you might be looking at is pick_channel/2 on the hot path it reads the last-chosen channel from :persistent_term rather than calling lb_mod.pick per RPC (connection.ex:246). That read path is pre-existing on master and this PR didn't change it.

That's actually the thread I was pulling on with the ETS comment above and your point then ("the responsibility for the pick lies with the load balancing module") is exactly right. The follow-up would land the fast-read path inside the LB behaviour itself, so pick_channel/2 goes through lb_mod.pick and the module owns its own storage. Keeping that out of this PR to keep the DNS re-resolution change focused.

So let me take you through what I considered

GenServer.call(via(ref), :pick) per RPC. Simple, no new primitive. Downside: every RPC serializes through the Connection mailbox. Under load this becomes a bottleneck and competes with orchestration work (refresh, re-resolution, disconnect) running in the same process. 5s default call timeouts start mattering. Rules itself out at any real traffic
Keep persistent_term, also store lb_state + real_channels there. Lock-free reads on the hot path. Downside: persistent_term.put triggers a global GC pass across every process in the node. With periodic re-resolution (30s default) and N connections, that's N GC storms per 30s. This is the reason I flagged persistent_term on April 15, I dont think it scales with the new write frequency this PR introduces.
:counters / :atomics inside the LB module. Lock-free, atomic, built-in. Perfect for RoundRobin's index. Downside: narrow. Solves the index-rotation case and nothing else. real_channels still needs a lock-free concurrent map (counters can't do lookups), so you'd pair it with ETS anyway. And it doesn't generalize to the LB strategies coming next — weighted round-robin, least-requests, health-aware routing, xDS, all of which need richer per-backend state than a single counter
ETS owned by the LB module + ETS for real_channels. LB module owns its table, pick/1 does atomic work via :ets.update_counter, the hot path reads lock-free. Downside: More moving parts than persistent_term. But it's the only option that handles all four cases: per-RPC LB decision, real_channels lookup, frequent writes without global GC, and generalizes to future LB strategies.

I have a follow up PR that does 4 it satisfies your "LB module owns the pick" point, removes the persistent_term write-frequency issue I flagged, and keeps the hot path lock-free. Keeping it out of this PR to keep the DNS re-resolution change focused.

I created a draft PR of what I am thinking long term 👇 - Feedback more than welcome :)

cgreeno force-pushed the feature/periodic-dns-re-resolution-v2 branch from b0b34c9 to aaefd21 Compare April 13, 2026 19:29

Merge branch 'master' into feature/periodic-dns-re-resolution-v2

1b69730

cgreeno mentioned this pull request Apr 24, 2026

feat: ETS-backed LB with per-request pick, remove persistent_term writes on reconcile #527

Draft

4 tasks

Conversation

cgreeno commented Apr 13, 2026

Summary

Architecture

What's included

New connect/2 options

Note on persistent_term

Test coverage (30 tests)

Test plan

Uh oh!

cgreeno commented Apr 13, 2026

Uh oh!

cgreeno commented Apr 13, 2026

Uh oh!

sleipnir commented Apr 13, 2026

Uh oh!

cgreeno commented Apr 13, 2026

Uh oh!

cgreeno commented Apr 14, 2026

Uh oh!

cgreeno commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgreeno commented Apr 15, 2026

Uh oh!

sleipnir commented Apr 15, 2026

Uh oh!

sleipnir commented Apr 15, 2026

Uh oh!

sleipnir commented Apr 15, 2026

Uh oh!

cgreeno commented Apr 22, 2026

Uh oh!

sleipnir commented Apr 22, 2026

Uh oh!

cgreeno commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgreeno commented Apr 23, 2026

Uh oh!

sleipnir commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgreeno commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cgreeno commented Apr 15, 2026 •

edited

Loading

cgreeno commented Apr 22, 2026 •

edited

Loading

sleipnir commented Apr 24, 2026 •

edited

Loading

cgreeno commented Apr 24, 2026 •

edited

Loading