fix: avoid inefficient collect/count at interrogation time#407
Merged
Conversation
Collapse the _count_validation_units() call onto a single line to satisfy the ruff-format pre-commit hook in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In Pointblank
0.25.0, per-row validation checks on Polars LazyFrames became 8–11× slower than0.24.0, with the worst slowdown occurring exactly when checks pass:col_vals_not_nullcol_vals_leIn
0.24.0, each LazyFrame was collected once before per-row checks ran, so the results table was eager. In0.25.0that eager collect was removed to keep the pipeline lazy, but the code that tallies results still issued three (or four, forcol_vals_in_set) separate.collect()calls on the lazy results table: one for the passed count, one for the failed count, one for nulls, and one for the total row count. Each.collect()re-executed the entire lazy plan (including the per-row predicate) from scratch.To address this, I added the
_count_validation_units()util function. It's a single Narwhals aggregation that computes the row count, pass count, fail count, and null count in one pass (one.collect()). The result-tallying block invalidate.pynow calls it once instead of scanning the data three to four times. Booleans are cast toInt32before summing for PySpark compatibility, and the sums naturally exclude nulls so pass/fail semantics are unchanged.On the 2M-row reproduction from the issue:
col_vals_not_nullcol_vals_lePerformance is now back to
0.24.0levels.Fixes: #405