Skip to content

fix: avoid inefficient collect/count at interrogation time#407

Merged
rich-iannone merged 5 commits into
mainfrom
fix-avoid-collect-counting-slowdown
Jun 30, 2026
Merged

fix: avoid inefficient collect/count at interrogation time#407
rich-iannone merged 5 commits into
mainfrom
fix-avoid-collect-counting-slowdown

Conversation

@rich-iannone

Copy link
Copy Markdown
Member

In Pointblank 0.25.0, per-row validation checks on Polars LazyFrames became 8–11× slower than 0.24.0, with the worst slowdown occurring exactly when checks pass:

Check 0.24.0 0.25.0 Slowdown
col_vals_not_null 0.007s 0.077s 11×
col_vals_le 0.004s 0.031s

In 0.24.0, each LazyFrame was collected once before per-row checks ran, so the results table was eager. In 0.25.0 that eager collect was removed to keep the pipeline lazy, but the code that tallies results still issued three (or four, for col_vals_in_set) separate .collect() calls on the lazy results table: one for the passed count, one for the failed count, one for nulls, and one for the total row count. Each .collect() re-executed the entire lazy plan (including the per-row predicate) from scratch.

To address this, I added the _count_validation_units() util function. It's a single Narwhals aggregation that computes the row count, pass count, fail count, and null count in one pass (one .collect()). The result-tallying block in validate.py now calls it once instead of scanning the data three to four times. Booleans are cast to Int32 before summing for PySpark compatibility, and the sums naturally exclude nulls so pass/fail semantics are unchanged.

On the 2M-row reproduction from the issue:

Check Before After
col_vals_not_null 0.077s 0.008s
col_vals_le 0.031s 0.009s

Performance is now back to 0.24.0 levels.

Fixes: #405

rich-iannone and others added 5 commits June 30, 2026 16:13
Collapse the _count_validation_units() call onto a single line to satisfy
the ruff-format pre-commit hook in CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot temporarily deployed to pr-407 June 30, 2026 20:40 Destroyed
@github-actions github-actions Bot temporarily deployed to pr-407 June 30, 2026 20:49 Destroyed
@rich-iannone rich-iannone merged commit e45f0c9 into main Jun 30, 2026
9 checks passed
@rich-iannone rich-iannone deleted the fix-avoid-collect-counting-slowdown branch June 30, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in 0.25.0: per-row checks on Polars LazyFrames are ~10× slower

1 participant