fix: avoid inefficient collect/count at interrogation time by rich-iannone · Pull Request #407 · posit-dev/pointblank

rich-iannone · 2026-06-30T20:17:26Z

In Pointblank 0.25.0, per-row validation checks on Polars LazyFrames became 8–11× slower than 0.24.0, with the worst slowdown occurring exactly when checks pass:

Check	0.24.0	0.25.0	Slowdown
`col_vals_not_null`	0.007s	0.077s	11×
`col_vals_le`	0.004s	0.031s	8×

In 0.24.0, each LazyFrame was collected once before per-row checks ran, so the results table was eager. In 0.25.0 that eager collect was removed to keep the pipeline lazy, but the code that tallies results still issued three (or four, for col_vals_in_set) separate .collect() calls on the lazy results table: one for the passed count, one for the failed count, one for nulls, and one for the total row count. Each .collect() re-executed the entire lazy plan (including the per-row predicate) from scratch.

To address this, I added the _count_validation_units() util function. It's a single Narwhals aggregation that computes the row count, pass count, fail count, and null count in one pass (one .collect()). The result-tallying block in validate.py now calls it once instead of scanning the data three to four times. Booleans are cast to Int32 before summing for PySpark compatibility, and the sums naturally exclude nulls so pass/fail semantics are unchanged.

On the 2M-row reproduction from the issue:

Check	Before	After
`col_vals_not_null`	0.077s	0.008s
`col_vals_le`	0.031s	0.009s

Performance is now back to 0.24.0 levels.

Fixes: #405

Collapse the _count_validation_units() call onto a single line to satisfy the ruff-format pre-commit hook in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rich-iannone and others added 5 commits June 30, 2026 16:13

Add _count_validation_units() util function

0e03e70

Use _count_validation_units() at interrogation time

8b43150

Update test__utils.py

eedc5e5

Apply ruff-format to test__utils.py

4b4945f

Collapse the _count_validation_units() call onto a single line to satisfy the ruff-format pre-commit hook in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Merge branch 'main' into fix-avoid-collect-counting-slowdown

c1a4e45

github-actions Bot temporarily deployed to pr-407 June 30, 2026 20:40 Destroyed

github-actions Bot temporarily deployed to pr-407 June 30, 2026 20:49 Destroyed

github-actions Bot deployed to pr-407 June 30, 2026 20:55 View deployment

rich-iannone merged commit e45f0c9 into main Jun 30, 2026
9 checks passed

rich-iannone deleted the fix-avoid-collect-counting-slowdown branch June 30, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: avoid inefficient collect/count at interrogation time#407

fix: avoid inefficient collect/count at interrogation time#407
rich-iannone merged 5 commits into
mainfrom
fix-avoid-collect-counting-slowdown

rich-iannone commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rich-iannone commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant