Make statistics a mandatory arg for statistics providers#1017
Make statistics a mandatory arg for statistics providers#1017nirandaperera wants to merge 16 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
afd1218 to
3b2d2da
Compare
|
/ok to test e5214b4 |
e5214b4 to
9445b50
Compare
Signed-off-by: niranda perera <niranda.perera@gmail.com>
This reverts commit a6803e6.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
| auto stats = rapidsmpf::Statistics::from_options(options); | ||
| auto progress_thread = std::make_shared<rapidsmpf::ProgressThread>(stats); |
There was a problem hiding this comment.
This changes the existing behavior, I think unintentionally, from statistics being enabled by default to disabled by default. As is, users now require specifying RAPIDSMPF_STATISTICS=1, which I think is undesirable.
There was a problem hiding this comment.
I think all benchmarks should have enabled statistics. So yes, I think you should create() the stats. Rather than from_options()ing them
| /** | ||
| * @brief Resets the current state of this instance based on @p new_options. | ||
| * | ||
| * @warning Live `MemoryRecorder` instances (constructed before this call | ||
| * but not yet destroyed) are *not* tracked. Their destructors will run | ||
| * after the reset and publish their scope into the freshly cleared | ||
| * `memory_records_` map, mixing pre-reset measurements into the new | ||
| * session. Likewise, a `StreamOrderedTiming` whose stop callback is | ||
| * already mid-execution at reset time may still record one stale stat | ||
| * (its global-map entry was extracted before `cancel_inflight_timings` | ||
| * could remove it). Callers that need clean boundaries must quiesce | ||
| * recorders and synchronise the relevant CUDA streams before calling. | ||
| * | ||
| * @param new_options Configuration options whose `"statistics"` field | ||
| * controls the new enabled state. | ||
| */ | ||
| void reset_from_options(config::Options new_options); | ||
|
|
There was a problem hiding this comment.
@wence- In cudf-polars engines, there is a _reset_worker(...options) method. There, we are only resetting buffer resource, context, etc with a new stats object. Then, after calling _reset_worker, we would have multiple instances of statistics in the engine ([commuicator, progress thread] vs [buffer resource, context])
There was a problem hiding this comment.
But why do we need to reset the statistics in this setup? When is _reset_worker called? Is it only for tests? For every query, something else?
What do we actually need to achieve here? It seems like we want to zero the statistics object that we have around no?
There was a problem hiding this comment.
I was looking into two options.
- Introduce
set_statsfor communicator and progress thread and reset the new stats object - reset the stats object in-place (clear + enable/disable based on new options)
I think 1. is the safer option.
There was a problem hiding this comment.
That is the mechanism for how to reset stats. I am asking a different question: why do we want to reset stats?
There was a problem hiding this comment.
I think this stems from StreamingEngine._reset method. https://github.com/rapidsai/cudf/blob/main/python/cudf_polars/cudf_polars/engine/core.py#L289
There was a problem hiding this comment.
OK, _reset is only called in the tests (it is a "hack" to workaround that booting a dask or ray cluster is expensive), and I do not think that we have to tie ourselves in knots to publicly support that api also being able to reset statistics. Note that the docstring of _reset already says that it doesn't reset and reinitialize with all the options.
I think we have enough rope with .enable() and .clear() on the stats object, without introducing a new thing.
There was a problem hiding this comment.
okay. Let's do that.
There was a problem hiding this comment.
I felt clear() was insufficient because it only clears the stats, but not the memory records and report entries. So, we can have a scenario where we create a new Buffer resource with new rmm resource adapter, but stats contains old memory records from a previous buffer resource.
There was a problem hiding this comment.
ah, but that problem can be fixed (separately) by clearing memory records
| auto stats = rapidsmpf::Statistics::from_options(options); | ||
| auto progress_thread = std::make_shared<rapidsmpf::ProgressThread>(stats); |
There was a problem hiding this comment.
I think all benchmarks should have enabled statistics. So yes, I think you should create() the stats. Rather than from_options()ing them
| if (i == total_num_runs - 1) { | ||
| ctx->statistics()->clear(); | ||
| stats->clear(); | ||
| } |
There was a problem hiding this comment.
Does the ctx not advertise a statistics object any more? That seems wrong.
There was a problem hiding this comment.
It does, but its the same as stats in this scope
|
Some broader comments:
|
What I don't like about returning a const-ref to a |
So there are some usual options, I think: struct Foo {
std::shared_ptr<Statistics> const& statistics();
};
struct Owns {
std::shared_ptr<Statistics> stats_;
};
Owns takes_ref(std::shared_ptr<Statistics> const& stat) {
return Owns{stat};
}
...
auto stats = std::make_shared<Statistics>();
Foo foo{stats};
foo.statistics()->some_stats_method(); // This can never be use-after-free because foo is alive
auto owning = takes_ref(foo.statistics()); // transparently upgraded in Owns ctor
auto owning_stat = foo.statistics(); // owning
auto &ref_stat = foo.statistics(); // reference, but obvious |
That looks reasonable to me. |
|
But you only get a reference if you explicitly assign to one, right? auto& ref_stat = foo.statistics(); // reference, but explicitPlease, let's not introduce a special "reference semantics for smart pointers" concept like: Owns takes_ref(std::shared_ptr<Statistics> const& stat) {
return Owns{stat};
}If you assign to a reference, you are explicitly choosing to take on the associated lifetime risk. That is already true for references in general, and arguably even more so for non-const references. So if someone is not confident about the lifetime guarantees, they simply should not bind to a reference. |
Yes. You have to request a reference explicitly to get a reference |
|
Note, I too do not advocate the "takes a reference and upgrades it some time later" model. |
|
@wence- @pentschev I am marking this as draft because, I want to revisit I think this is a more safer approach. |
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…ory-arg-for-providers Signed-off-by: niranda perera <niranda.perera@gmail.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…perera/rapidsmpf into stat-mandatory-arg-for-providers
Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
|
||
| Warnings | ||
| -------- | ||
| Concurrent calls to ``set_statistics`` will result in undefined behavior. |
There was a problem hiding this comment.
Why is this warning not documented in the C++ docstring?
There was a problem hiding this comment.
There was a problem hiding this comment.
Yeah, sorry about that, don't know how I missed it. 😅
| /** | ||
| * @brief Replace the statistics instance held by this progress thread. | ||
| * | ||
| * This method pauses the progress loop for the duration of the swap so the loop | ||
| * cannot dereference `statistics_` while it is being replaced, then restores the | ||
| * prior running state. | ||
| * | ||
| * @param statistics The new statistics instance. Pass `Statistics::disabled()` to opt | ||
| * out of statistics collection. | ||
| * | ||
| * @throws std::invalid_argument If @p statistics is null. Validated | ||
| * before the pause so a null argument does not leave the progress loop | ||
| * paused. | ||
| * | ||
| * @warning Concurrent calls to `set_statistics()` will result in undefined behavior. | ||
| */ | ||
| void set_statistics(std::shared_ptr<Statistics> statistics); |
There was a problem hiding this comment.
This makes ProgressThread a mutable provider, but existing Statistics handles are snapshots. A caller can do auto stats = progress->statistics(), then another path calls set_statistics(new_stats), and the caller continues recording/reporting against the old object while ProgressThread and Communicator report the new one. I think this reintroduces the "which stats object owns this?" ambiguity the PR is trying to remove. If runtime reconfiguration is needed, mutating the existing Statistics instance is safer than swapping identity. If swapping remains, docs/tests should explicitly state that previously returned handles are not retargeted.
There was a problem hiding this comment.
Fair point. This is how I looked at this.
Currently, Statistics only gets reset when we reset the engine. While running, we initiate async stat recording, like Memory records, stream ordered timings etc. Recording only happens at a later point, and there could be a scenario where the recording happens after a reset. If stat object is the same, we might end up with some stats from before.
This PR handles it by handing off a weak ptr to async stats recording, and if the ptr has expired, recording is skipped.
There was a problem hiding this comment.
Ok, but what will happen to those objects that rely on the Statistics to be still alive? Do they have a mechanism to eventually get a new one, or will they just fall out of the Statistics loop forever?
There was a problem hiding this comment.
@pentschev If they rely on Statistics, it could be a problem. If they want to get the most current statistics, they would have to keep a strong reference to long-living stats provider like ProgressThread.
will they just fall out of the Statistics loop forever?
Yes. At least in current usages, IMO it will be safe to miss some recordings on the edge cases.
There was a problem hiding this comment.
What are the edge cases?
Currently,
Statisticsonly gets reset when we reset the engine.
Or more generically, when is the engine expected to reset? I assume be "engine reset" you're referring to cudf-polars' engine, is that right? One case where reset may occur is part of the testing framework, is that what you have in mind? What would other cases be?
There was a problem hiding this comment.
Yes, edge cases I was referring to was, in-flight stats recordings while engine is being reset.
During tests, I think this is not happening, because we have a long running communicator/ progress thread in the test environment.
There was a problem hiding this comment.
Ok, can you please at a minimum document that previously returned handles are not retargeted?
Signed-off-by: niranda perera <niranda.perera@gmail.com>
| * @warning Concurrent calls to `set_statistics()` are implementation-defined and | ||
| * likely result in undefined behavior unless the implementation explicitly | ||
| * documents otherwise. Callers should assume external synchronization is required. |
There was a problem hiding this comment.
Nit: let's not leave this hanging on implementation details and just say this is not thread-safe/concurrent calls are UB.
|
|
||
| Warnings | ||
| -------- | ||
| Concurrent calls to ``set_statistics`` will result in undefined behavior. |
There was a problem hiding this comment.
Yeah, sorry about that, don't know how I missed it. 😅
| /** | ||
| * @brief Replace the statistics instance held by this progress thread. | ||
| * | ||
| * This method pauses the progress loop for the duration of the swap so the loop | ||
| * cannot dereference `statistics_` while it is being replaced, then restores the | ||
| * prior running state. | ||
| * | ||
| * @param statistics The new statistics instance. Pass `Statistics::disabled()` to opt | ||
| * out of statistics collection. | ||
| * | ||
| * @throws std::invalid_argument If @p statistics is null. Validated | ||
| * before the pause so a null argument does not leave the progress loop | ||
| * paused. | ||
| * | ||
| * @warning Concurrent calls to `set_statistics()` will result in undefined behavior. | ||
| */ | ||
| void set_statistics(std::shared_ptr<Statistics> statistics); |
There was a problem hiding this comment.
Ok, but what will happen to those objects that rely on the Statistics to be still alive? Do they have a mechanism to eventually get a new one, or will they just fall out of the Statistics loop forever?
| - '!.github/release.yml' | ||
| - '!.pre-commit-config.yaml' | ||
| - '!.shellcheckrc' | ||
| - '!SECURITY.md' |
There was a problem hiding this comment.
It's surprising to see these changes from #1064 showing up in the diff again. Those are already out on main.
Same with the cupy changes throughout packaging files. I suspect something has gone wrong with resolving a merge conflict here.
There was a problem hiding this comment.
Oh yes! I think I've messed up bigtime during a merge..
There was a problem hiding this comment.
Ah great, thanks very much. Commit history looks much better now, and you no longer need ci-codeowners / packaging-codeowners so I'll unsubscribe from this.
@ me back if you need my help with anything.
…ory-arg-for-providers Signed-off-by: niranda perera <niranda.perera@gmail.com>
e8d1b76 to
30df058
Compare
|
closing in favor of #1082 |
Make
Statisticsa mandatory constructor argument for providersFollow-up to #1009. After the initial stats cleanup,
BufferResource,ProgressThread, andstreaming::Contextstill defaulted theirstatisticsparameter toStatistics::disabled(), which made it easy to silently drop stats by forgetting to pass an instance and obscured ownership at the call site (e.g. communicator vs. progress thread vs. buffer resource each potentially holding a differentStatistics).In this PR:
BufferResource(...)/BufferResource::from_options(...)now require an explicitStatistics(passed as the first ctor arg);ProgressThread(...)does too.Communicatorstores ashared_ptr<ProgressThread>and forwardsstatistics()to it, soSingle/MPI/UCXXdrop their own progress-thread members. Tests, examples, and Python bindings are updated; the PythonCommunicatorgains astatisticsproperty.MutableStatisticsProviderconcept andProgressThread::set_statistics()(wired through to Python) allow swapping the instance at runtime; the progress loop is paused around the swap.MemoryRecorderandStreamOrderedTimingnow holdStatisticsweakly so a swap can destroy the original even with outstanding scopes; the destructors lock and skip publishing when the instance has expired.Statistics::disabled()/from_optionsdefaulting to off) and useclear()before the final run to keep per-run reporting.Depends on #1009
Related to #1008