Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading by khushthecoder · Pull Request #1155 · malariagen/malariagen-data-python

khushthecoder · 2026-03-19T07:57:10Z

issue: #926

Summary

This PR addresses #926 by removing multiple pandas row-wise iteration patterns (DataFrame.apply(..., axis=\"columns\") and Series.apply(lambda ...)) from metadata/cohort loading hot paths and replacing them with vectorized NumPy/Pandas operations.

The main goal is to reduce Python-level per-row overhead when working with large cohort metadata tables, which improves end-to-end load times for downstream workflows (e.g. ML feature generation and interactive dashboards).

Changes

malariagen_data/anoph/sample_metadata.py
- Vectorized quarter derivation from month using np.where(...).
- Semantics preserved: month <= 0 (sentinel / missing) results in quarter == -1.
malariagen_data/anoph/base.py
- Vectorized unrestricted_use derivation from terms_of_use_expiry_date using boolean ops.
- Semantics preserved:
  - NaN expiry date → True
  - ISO-string compare against date.today().isoformat() retained
  - dtype preserved as pd.BooleanDtype().
malariagen_data/anoph/genome_features.py
- Vectorized gene pointer glyphs and y-offset helper columns (gene_pointer, pointer_y, label_y) using np.where.
- Semantics preserved: empty gene_label → empty pointer (no glyph displayed).
malariagen_data/anoph/frq_base.py
- Replaced row-wise period creation with vectorized PeriodIndex construction for year, quarter, and month grouping.
- Reduced Python-level iteration in cohort preparation:
  - Prefer .dt.start_time / .dt.end_time when period is Period dtype.
  - Cohort label construction is fully vectorized via Series.str and concatenation.
- Removed remaining apply(axis=1) for heatmap multi-column index label concatenation.
Repo hygiene
- Added .venv/ to .gitignore to prevent accidental commits of local virtualenvs.

Tests

Ran targeted suites for the touched modules locally:
- tests/anoph/test_frq_base.py
- tests/anoph/test_sample_metadata.py
- tests/anoph/test_base.py
- tests/anoph/test_genome_features.py

Performance note

These changes remove several Python-level per-row loops in hot paths. The biggest expected wins are:

quarter derivation and unrestricted_use computation on large metadata tables
vectorized PeriodIndex construction for cohort grouping (removes apply(axis=1) overhead)

While the operations remain (O(N)), they now execute primarily in optimized Pandas/NumPy internals instead of repeated Python callbacks.

Maintainer note

This optimization matters because metadata and cohort tables can be very large, and row-wise .apply() scales poorly due to interpreter overhead. The approach here is intentionally conservative: keep output values/dtypes and sentinel handling identical, but replace row-wise lambdas with vectorized primitives (np.where, boolean masks, PeriodIndex, .dt accessors, and vectorized string ops).

Replaces row-wise pandas apply() usage in metadata/cohort preparation with vectorized numpy/pandas operations to reduce Python-level iteration. Made-with: Cursor

Made-with: Cursor

khushthecoder force-pushed the optimize/issue-926-vectorize-apply branch 2 times, most recently from 7beae52 to 9bae0c0 Compare March 19, 2026 08:12

perf: vectorize metadata DataFrame operations

5d0e8b8

Replaces row-wise pandas apply() usage in metadata/cohort preparation with vectorized numpy/pandas operations to reduce Python-level iteration. Made-with: Cursor

khushthecoder force-pushed the optimize/issue-926-vectorize-apply branch from 356f12b to 5d0e8b8 Compare March 19, 2026 08:28

style: apply pre-commit (ruff/ruff-format) fixes

3118c0a

khushthecoder force-pushed the optimize/issue-926-vectorize-apply branch from 84634d2 to 3118c0a Compare March 19, 2026 08:33

Merge branch 'master' into optimize/issue-926-vectorize-apply

3de4f22

khushthecoder mentioned this pull request Mar 19, 2026

Optimization: Replace slow DataFrame .apply() iterations with vectorized operations for metadata loading #926

Closed

perf: vectorize gene_pointer construction

7c5e9e4

Made-with: Cursor

jonbrenas approved these changes Mar 19, 2026

View reviewed changes

jonbrenas merged commit 9ef3865 into malariagen:master Mar 19, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading#1155

Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading#1155
jonbrenas merged 4 commits intomalariagen:masterfrom
khushthecoder:optimize/issue-926-vectorize-apply

khushthecoder commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khushthecoder commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

issue: #926

Summary

Changes

Tests

Performance note

Maintainer note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khushthecoder commented Mar 19, 2026 •

edited

Loading