Skip to content

Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading#1155

Merged
jonbrenas merged 4 commits intomalariagen:masterfrom
khushthecoder:optimize/issue-926-vectorize-apply
Mar 19, 2026
Merged

Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading#1155
jonbrenas merged 4 commits intomalariagen:masterfrom
khushthecoder:optimize/issue-926-vectorize-apply

Conversation

@khushthecoder
Copy link
Copy Markdown
Contributor

@khushthecoder khushthecoder commented Mar 19, 2026

issue: #926

Summary

This PR addresses #926 by removing multiple pandas row-wise iteration patterns (DataFrame.apply(..., axis=\"columns\") and Series.apply(lambda ...)) from metadata/cohort loading hot paths and replacing them with vectorized NumPy/Pandas operations.

The main goal is to reduce Python-level per-row overhead when working with large cohort metadata tables, which improves end-to-end load times for downstream workflows (e.g. ML feature generation and interactive dashboards).

Changes

  • malariagen_data/anoph/sample_metadata.py

    • Vectorized quarter derivation from month using np.where(...).
    • Semantics preserved: month <= 0 (sentinel / missing) results in quarter == -1.
  • malariagen_data/anoph/base.py

    • Vectorized unrestricted_use derivation from terms_of_use_expiry_date using boolean ops.
    • Semantics preserved:
      • NaN expiry date → True
      • ISO-string compare against date.today().isoformat() retained
      • dtype preserved as pd.BooleanDtype().
  • malariagen_data/anoph/genome_features.py

    • Vectorized gene pointer glyphs and y-offset helper columns (gene_pointer, pointer_y, label_y) using np.where.
    • Semantics preserved: empty gene_label → empty pointer (no glyph displayed).
  • malariagen_data/anoph/frq_base.py

    • Replaced row-wise period creation with vectorized PeriodIndex construction for year, quarter, and month grouping.
    • Reduced Python-level iteration in cohort preparation:
      • Prefer .dt.start_time / .dt.end_time when period is Period dtype.
      • Cohort label construction is fully vectorized via Series.str and concatenation.
    • Removed remaining apply(axis=1) for heatmap multi-column index label concatenation.
  • Repo hygiene

    • Added .venv/ to .gitignore to prevent accidental commits of local virtualenvs.

Tests

  • Ran targeted suites for the touched modules locally:
    • tests/anoph/test_frq_base.py
    • tests/anoph/test_sample_metadata.py
    • tests/anoph/test_base.py
    • tests/anoph/test_genome_features.py

Performance note

These changes remove several Python-level per-row loops in hot paths. The biggest expected wins are:

  • quarter derivation and unrestricted_use computation on large metadata tables
  • vectorized PeriodIndex construction for cohort grouping (removes apply(axis=1) overhead)

While the operations remain (O(N)), they now execute primarily in optimized Pandas/NumPy internals instead of repeated Python callbacks.

Maintainer note

This optimization matters because metadata and cohort tables can be very large, and row-wise .apply() scales poorly due to interpreter overhead. The approach here is intentionally conservative: keep output values/dtypes and sentinel handling identical, but replace row-wise lambdas with vectorized primitives (np.where, boolean masks, PeriodIndex, .dt accessors, and vectorized string ops).

@khushthecoder khushthecoder force-pushed the optimize/issue-926-vectorize-apply branch 2 times, most recently from 7beae52 to 9bae0c0 Compare March 19, 2026 08:12
Replaces row-wise pandas apply() usage in metadata/cohort preparation with vectorized numpy/pandas operations to reduce Python-level iteration.

Made-with: Cursor
@khushthecoder khushthecoder force-pushed the optimize/issue-926-vectorize-apply branch from 356f12b to 5d0e8b8 Compare March 19, 2026 08:28
@khushthecoder khushthecoder force-pushed the optimize/issue-926-vectorize-apply branch from 84634d2 to 3118c0a Compare March 19, 2026 08:33
@jonbrenas jonbrenas merged commit 9ef3865 into malariagen:master Mar 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants