Skip to content

Make statistics a mandatory arg for statistics providers#1017

Closed
nirandaperera wants to merge 16 commits into
rapidsai:mainfrom
nirandaperera:stat-mandatory-arg-for-providers
Closed

Make statistics a mandatory arg for statistics providers#1017
nirandaperera wants to merge 16 commits into
rapidsai:mainfrom
nirandaperera:stat-mandatory-arg-for-providers

Conversation

@nirandaperera

@nirandaperera nirandaperera commented May 7, 2026

Copy link
Copy Markdown
Contributor

Make Statistics a mandatory constructor argument for providers

Follow-up to #1009. After the initial stats cleanup, BufferResource, ProgressThread, and streaming::Context still defaulted their statistics parameter to Statistics::disabled(), which made it easy to silently drop stats by forgetting to pass an instance and obscured ownership at the call site (e.g. communicator vs. progress thread vs. buffer resource each potentially holding a different Statistics).

In this PR:

  • BufferResource(...)/BufferResource::from_options(...) now require an explicit Statistics (passed as the first ctor arg); ProgressThread(...) does too. Communicator stores a shared_ptr<ProgressThread> and forwards statistics() to it, so Single/MPI/UCXX drop their own progress-thread members. Tests, examples, and Python bindings are updated; the Python Communicator gains a statistics property.
  • New MutableStatisticsProvider concept and ProgressThread::set_statistics() (wired through to Python) allow swapping the instance at runtime; the progress loop is paused around the swap.
  • MemoryRecorder and StreamOrderedTiming now hold Statistics weakly so a swap can destroy the original even with outstanding scopes; the destructors lock and skip publishing when the instance has expired.
  • Benchmarks default to stats enabled (previously Statistics::disabled() / from_options defaulting to off) and use clear() before the final run to keep per-run reporting.

Depends on #1009

Related to #1008

@nirandaperera nirandaperera requested review from a team as code owners May 7, 2026 21:20
@nirandaperera nirandaperera added breaking Introduces a breaking change improvement Improves an existing functionality labels May 7, 2026
@nirandaperera nirandaperera marked this pull request as draft May 7, 2026 21:20
@copy-pr-bot

copy-pr-bot Bot commented May 7, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@nirandaperera nirandaperera force-pushed the stat-mandatory-arg-for-providers branch 2 times, most recently from afd1218 to 3b2d2da Compare May 26, 2026 22:56
@nirandaperera

Copy link
Copy Markdown
Contributor Author

/ok to test e5214b4

@nirandaperera nirandaperera force-pushed the stat-mandatory-arg-for-providers branch from e5214b4 to 9445b50 Compare May 27, 2026 20:57
@nirandaperera nirandaperera marked this pull request as ready for review May 27, 2026 20:58
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera nirandaperera changed the title [WIP] Make statistics a mandatory arg for statistics providers Make statistics a mandatory arg for statistics providers May 27, 2026
Signed-off-by: niranda perera <niranda.perera@gmail.com>
This reverts commit a6803e6.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Comment thread cpp/benchmarks/bench_comm.cpp
Comment thread cpp/src/stream_ordered_timing.cpp Outdated
Comment thread cpp/benchmarks/streaming/bench_streaming_shuffle.cpp Outdated
Comment on lines +327 to +328
auto stats = rapidsmpf::Statistics::from_options(options);
auto progress_thread = std::make_shared<rapidsmpf::ProgressThread>(stats);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the existing behavior, I think unintentionally, from statistics being enabled by default to disabled by default. As is, users now require specifying RAPIDSMPF_STATISTICS=1, which I think is undesirable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all benchmarks should have enabled statistics. So yes, I think you should create() the stats. Rather than from_options()ing them

Comment thread python/rapidsmpf/rapidsmpf/memory/buffer_resource.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/statistics.pyx Outdated
Comment thread cpp/benchmarks/bench_comm.cpp
Comment thread cpp/include/rapidsmpf/statistics.hpp Outdated
Comment on lines +147 to +164
/**
* @brief Resets the current state of this instance based on @p new_options.
*
* @warning Live `MemoryRecorder` instances (constructed before this call
* but not yet destroyed) are *not* tracked. Their destructors will run
* after the reset and publish their scope into the freshly cleared
* `memory_records_` map, mixing pre-reset measurements into the new
* session. Likewise, a `StreamOrderedTiming` whose stop callback is
* already mid-execution at reset time may still record one stale stat
* (its global-map entry was extracted before `cancel_inflight_timings`
* could remove it). Callers that need clean boundaries must quiesce
* recorders and synchronise the relevant CUDA streams before calling.
*
* @param new_options Configuration options whose `"statistics"` field
* controls the new enabled state.
*/
void reset_from_options(config::Options new_options);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you need this for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- In cudf-polars engines, there is a _reset_worker(...options) method. There, we are only resetting buffer resource, context, etc with a new stats object. Then, after calling _reset_worker, we would have multiple instances of statistics in the engine ([commuicator, progress thread] vs [buffer resource, context])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we need to reset the statistics in this setup? When is _reset_worker called? Is it only for tests? For every query, something else?

What do we actually need to achieve here? It seems like we want to zero the statistics object that we have around no?

@nirandaperera nirandaperera May 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into two options.

  1. Introduce set_stats for communicator and progress thread and reset the new stats object
  2. reset the stats object in-place (clear + enable/disable based on new options)

I think 1. is the safer option.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the mechanism for how to reset stats. I am asking a different question: why do we want to reset stats?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, _reset is only called in the tests (it is a "hack" to workaround that booting a dask or ray cluster is expensive), and I do not think that we have to tie ourselves in knots to publicly support that api also being able to reset statistics. Note that the docstring of _reset already says that it doesn't reset and reinitialize with all the options.

I think we have enough rope with .enable() and .clear() on the stats object, without introducing a new thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. Let's do that.

@nirandaperera nirandaperera May 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt clear() was insufficient because it only clears the stats, but not the memory records and report entries. So, we can have a scenario where we create a new Buffer resource with new rmm resource adapter, but stats contains old memory records from a previous buffer resource.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, but that problem can be fixed (separately) by clearing memory records

Comment thread cpp/src/communicator/communicator.cpp
Comment on lines +327 to +328
auto stats = rapidsmpf::Statistics::from_options(options);
auto progress_thread = std::make_shared<rapidsmpf::ProgressThread>(stats);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all benchmarks should have enabled statistics. So yes, I think you should create() the stats. Rather than from_options()ing them

if (i == total_num_runs - 1) {
ctx->statistics()->clear();
stats->clear();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the ctx not advertise a statistics object any more? That seems wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but its the same as stats in this scope

@wence-

wence- commented May 28, 2026

Copy link
Copy Markdown
Contributor

Some broader comments:

  1. I still think that statistics providers should return a shared_ptr<Statistics> const&. That way the caller can either upgrade to an owning shared_ptr or just keep a reference. The converse is not possible: if we return a shared_ptr<Statistics> there is no way to downgrade it to a reference.
  2. Everything that has a Statistics object "somewhere" should be a statistics provider.

@pentschev

Copy link
Copy Markdown
Member
  • I still think that statistics providers should return a shared_ptr<Statistics> const&. That way the caller can either upgrade to an owning shared_ptr or just keep a reference. The converse is not possible: if we return a shared_ptr<Statistics> there is no way to downgrade it to a reference.

What I don't like about returning a const-ref to a shared_ptr is that it's easy to overlook and assume you're getting a copy with potential for use-after-free. One can argue that's true for all const-ref, but a smart pointer falls under a "safe" category in my mental model at least.

@wence-

wence- commented May 28, 2026

Copy link
Copy Markdown
Contributor
  • I still think that statistics providers should return a shared_ptr<Statistics> const&. That way the caller can either upgrade to an owning shared_ptr or just keep a reference. The converse is not possible: if we return a shared_ptr<Statistics> there is no way to downgrade it to a reference.

What I don't like about returning a const-ref to a shared_ptr is that it's easy to overlook and assume you're getting a copy with potential for use-after-free. One can argue that's true for all const-ref, but a smart pointer falls under a "safe" category in my mental model at least.

So there are some usual options, I think:

struct Foo {
    std::shared_ptr<Statistics> const& statistics();
};

struct Owns {
   std::shared_ptr<Statistics> stats_;
};

Owns takes_ref(std::shared_ptr<Statistics> const& stat) {
   return Owns{stat};
}
...

auto stats = std::make_shared<Statistics>();
Foo foo{stats};

foo.statistics()->some_stats_method(); // This can never be use-after-free because foo is alive

auto owning = takes_ref(foo.statistics()); // transparently upgraded in Owns ctor

auto owning_stat = foo.statistics(); // owning

auto &ref_stat = foo.statistics(); // reference, but obvious

@pentschev

Copy link
Copy Markdown
Member

So there are some usual options, I think:

struct Foo {
    std::shared_ptr<Statistics> const& statistics();
};

struct Owns {
   std::shared_ptr<Statistics> stats_;
};

Owns takes_ref(std::shared_ptr<Statistics> const& stat) {
   return Owns{stat};
}
...

auto stats = std::make_shared<Statistics>();
Foo foo{stats};

foo.statistics()->some_stats_method(); // This can never be use-after-free because foo is alive

auto owning = takes_ref(foo.statistics()); // transparently upgraded in Owns ctor

auto owning_stat = foo.statistics(); // owning

auto &ref_stat = foo.statistics(); // reference, but obvious

That looks reasonable to me.

@madsbk

madsbk commented May 28, 2026

Copy link
Copy Markdown
Member

But you only get a reference if you explicitly assign to one, right?

auto& ref_stat = foo.statistics();  // reference, but explicit

Please, let's not introduce a special "reference semantics for smart pointers" concept like:

Owns takes_ref(std::shared_ptr<Statistics> const& stat) {
    return Owns{stat};
}

If you assign to a reference, you are explicitly choosing to take on the associated lifetime risk. That is already true for references in general, and arguably even more so for non-const references. So if someone is not confident about the lifetime guarantees, they simply should not bind to a reference.

@wence-

wence- commented May 28, 2026

Copy link
Copy Markdown
Contributor

But you only get a reference if you explicitly assign to one, right?

auto& ref_stat = foo.statistics();  // reference, but explicit

Yes. You have to request a reference explicitly to get a reference

@wence-

wence- commented May 28, 2026

Copy link
Copy Markdown
Contributor

Note, I too do not advocate the "takes a reference and upgrades it some time later" model.

@nirandaperera

Copy link
Copy Markdown
Contributor Author

@wence- @pentschev I am marking this as draft because, I want to revisit MutableStatisticsProvider, set_statistics() approach.
a6803e6

I think this is a more safer approach.

@wence- wence- marked this pull request as draft May 28, 2026 17:07
2ce411f
891af67

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…ory-arg-for-providers

Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera nirandaperera marked this pull request as ready for review May 28, 2026 22:05
@nirandaperera nirandaperera requested a review from wence- May 28, 2026 22:05
nirandaperera and others added 3 commits May 28, 2026 15:17
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…perera/rapidsmpf into stat-mandatory-arg-for-providers
@nirandaperera nirandaperera requested a review from pentschev May 28, 2026 22:20
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>

Warnings
--------
Concurrent calls to ``set_statistics`` will result in undefined behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this warning not documented in the C++ docstring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry about that, don't know how I missed it. 😅

Comment on lines +187 to +203
/**
* @brief Replace the statistics instance held by this progress thread.
*
* This method pauses the progress loop for the duration of the swap so the loop
* cannot dereference `statistics_` while it is being replaced, then restores the
* prior running state.
*
* @param statistics The new statistics instance. Pass `Statistics::disabled()` to opt
* out of statistics collection.
*
* @throws std::invalid_argument If @p statistics is null. Validated
* before the pause so a null argument does not leave the progress loop
* paused.
*
* @warning Concurrent calls to `set_statistics()` will result in undefined behavior.
*/
void set_statistics(std::shared_ptr<Statistics> statistics);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes ProgressThread a mutable provider, but existing Statistics handles are snapshots. A caller can do auto stats = progress->statistics(), then another path calls set_statistics(new_stats), and the caller continues recording/reporting against the old object while ProgressThread and Communicator report the new one. I think this reintroduces the "which stats object owns this?" ambiguity the PR is trying to remove. If runtime reconfiguration is needed, mutating the existing Statistics instance is safer than swapping identity. If swapping remains, docs/tests should explicitly state that previously returned handles are not retargeted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. This is how I looked at this.
Currently, Statistics only gets reset when we reset the engine. While running, we initiate async stat recording, like Memory records, stream ordered timings etc. Recording only happens at a later point, and there could be a scenario where the recording happens after a reset. If stat object is the same, we might end up with some stats from before.
This PR handles it by handing off a weak ptr to async stats recording, and if the ptr has expired, recording is skipped.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but what will happen to those objects that rely on the Statistics to be still alive? Do they have a mechanism to eventually get a new one, or will they just fall out of the Statistics loop forever?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pentschev If they rely on Statistics, it could be a problem. If they want to get the most current statistics, they would have to keep a strong reference to long-living stats provider like ProgressThread.

will they just fall out of the Statistics loop forever?

Yes. At least in current usages, IMO it will be safe to miss some recordings on the edge cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the edge cases?

Currently, Statistics only gets reset when we reset the engine.

Or more generically, when is the engine expected to reset? I assume be "engine reset" you're referring to cudf-polars' engine, is that right? One case where reset may occur is part of the testing framework, is that what you have in mind? What would other cases be?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, edge cases I was referring to was, in-flight stats recordings while engine is being reset.

During tests, I think this is not happening, because we have a long running communicator/ progress thread in the test environment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you please at a minimum document that previously returned handles are not retargeted?

Comment thread cpp/src/progress_thread.cpp Outdated
Comment thread cpp/benchmarks/streaming/bench_streaming_shuffle.cpp Outdated
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Comment on lines +683 to +685
* @warning Concurrent calls to `set_statistics()` are implementation-defined and
* likely result in undefined behavior unless the implementation explicitly
* documents otherwise. Callers should assume external synchronization is required.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's not leave this hanging on implementation details and just say this is not thread-safe/concurrent calls are UB.


Warnings
--------
Concurrent calls to ``set_statistics`` will result in undefined behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry about that, don't know how I missed it. 😅

Comment on lines +187 to +203
/**
* @brief Replace the statistics instance held by this progress thread.
*
* This method pauses the progress loop for the duration of the swap so the loop
* cannot dereference `statistics_` while it is being replaced, then restores the
* prior running state.
*
* @param statistics The new statistics instance. Pass `Statistics::disabled()` to opt
* out of statistics collection.
*
* @throws std::invalid_argument If @p statistics is null. Validated
* before the pause so a null argument does not leave the progress loop
* paused.
*
* @warning Concurrent calls to `set_statistics()` will result in undefined behavior.
*/
void set_statistics(std::shared_ptr<Statistics> statistics);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but what will happen to those objects that rely on the Statistics to be still alive? Do they have a mechanism to eventually get a new one, or will they just fall out of the Statistics loop forever?

@nirandaperera nirandaperera requested review from a team as code owners June 1, 2026 20:37
Comment thread .github/workflows/pr.yaml
- '!.github/release.yml'
- '!.pre-commit-config.yaml'
- '!.shellcheckrc'
- '!SECURITY.md'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprising to see these changes from #1064 showing up in the diff again. Those are already out on main.

Same with the cupy changes throughout packaging files. I suspect something has gone wrong with resolving a merge conflict here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes! I think I've messed up bigtime during a merge..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the merge mess @jameslamb

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great, thanks very much. Commit history looks much better now, and you no longer need ci-codeowners / packaging-codeowners so I'll unsubscribe from this.

@ me back if you need my help with anything.

…ory-arg-for-providers

Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera nirandaperera force-pushed the stat-mandatory-arg-for-providers branch from e8d1b76 to 30df058 Compare June 2, 2026 18:33
@jameslamb jameslamb removed the request for review from KyleFromNVIDIA June 3, 2026 21:34
@nirandaperera

Copy link
Copy Markdown
Contributor Author

closing in favor of #1082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change improvement Improves an existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants