Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading#1155
Merged
jonbrenas merged 4 commits intomalariagen:masterfrom Mar 19, 2026
Conversation
7beae52 to
9bae0c0
Compare
Replaces row-wise pandas apply() usage in metadata/cohort preparation with vectorized numpy/pandas operations to reduce Python-level iteration. Made-with: Cursor
356f12b to
5d0e8b8
Compare
84634d2 to
3118c0a
Compare
Made-with: Cursor
jonbrenas
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
issue: #926
Summary
This PR addresses
#926by removing multiple pandas row-wise iteration patterns (DataFrame.apply(..., axis=\"columns\")andSeries.apply(lambda ...)) from metadata/cohort loading hot paths and replacing them with vectorized NumPy/Pandas operations.The main goal is to reduce Python-level per-row overhead when working with large cohort metadata tables, which improves end-to-end load times for downstream workflows (e.g. ML feature generation and interactive dashboards).
Changes
malariagen_data/anoph/sample_metadata.pyquarterderivation frommonthusingnp.where(...).month <= 0(sentinel / missing) results inquarter == -1.malariagen_data/anoph/base.pyunrestricted_usederivation fromterms_of_use_expiry_dateusing boolean ops.NaNexpiry date →Truedate.today().isoformat()retainedpd.BooleanDtype().malariagen_data/anoph/genome_features.pygene_pointer,pointer_y,label_y) usingnp.where.gene_label→ empty pointer (no glyph displayed).malariagen_data/anoph/frq_base.pyPeriodIndexconstruction foryear,quarter, andmonthgrouping..dt.start_time/.dt.end_timewhenperiodis Period dtype.Series.strand concatenation.apply(axis=1)for heatmap multi-column index label concatenation.Repo hygiene
.venv/to.gitignoreto prevent accidental commits of local virtualenvs.Tests
tests/anoph/test_frq_base.pytests/anoph/test_sample_metadata.pytests/anoph/test_base.pytests/anoph/test_genome_features.pyPerformance note
These changes remove several Python-level per-row loops in hot paths. The biggest expected wins are:
unrestricted_usecomputation on large metadata tablesPeriodIndexconstruction for cohort grouping (removesapply(axis=1)overhead)While the operations remain (O(N)), they now execute primarily in optimized Pandas/NumPy internals instead of repeated Python callbacks.
Maintainer note
This optimization matters because metadata and cohort tables can be very large, and row-wise
.apply()scales poorly due to interpreter overhead. The approach here is intentionally conservative: keep output values/dtypes and sentinel handling identical, but replace row-wise lambdas with vectorized primitives (np.where, boolean masks,PeriodIndex,.dtaccessors, and vectorized string ops).