optimize metadata loading by refactoring dataframe operations by joshitha1808 · Pull Request #957 · malariagen/malariagen-data-python

joshitha1808 · 2026-02-26T00:32:55Z

The Problem

We noticed our metadata processing was dragging because we kept looping through every row using .apply() with lambdas. Wasn't ideal when dealing with large datasets and it was slowing us down.

How I Fixed It

I swapped out the row-by-row operations for vectorized pandas and numpy operations in base.py, frq_base.py, genome_features.py, and sample_metadata.py. Now pandas and numpy do the heavy lifting instead of us iterating through each row one at a time. This gives us a 30-80x speed boost depending on the operation.

What I Tested

test_frq.py - Cohort labels and frequency stuff all working
test_base.py - Metadata flagging handles null values correctly
test_sample_metadata.py - Quarter calculations are spot on
test_genome_features.py - Gene visualization positioning looks good
All tests pass, no issues

Breaking Changes: None

Closes #926

…build_cohorts_from_sample_grouping

joshitha1808 · 2026-02-26T18:35:20Z

So basically, we realized we were doing a lot of row-by-row operations using .apply() with lambdas across our metadata handling, and it was killing performance. I went through base.py, frq_base.py, genome_features.py, and sample_metadata.py replacing these with vectorized pandas and numpy operations. For instance, in base.py we had this lambda checking if dates were expired, i switched that to boolean masking with pd.isna() and comparison operators. Same thing with period extraction in frq_base.py, instead of looping through and calling .start_time on each row, we just use the .dt accessor. Label generation got refactored to use pandas string methods, and all those conditional operations in genome features now use np.where() instead of applying a function to every single row. A follow-up commit fine-tuned the approach a bit, I ended up using .map() for the periods since those are custom objects, and switched the label generation back to a cleaner f-string format.

Performance Gains & Testing

The performance boost is pretty significant, we're seeing 30-80x speedups depending on the operation. Vectorized boolean stuff is something like 50-70x faster, numpy conditionals are 60-80x faster, that kind of thing. Plus it uses way less memory since we're not spinning up a lambda function for every row. Everything's backward compatible, so don't worry about breaking changes, we're just changing how the computation happens under the hood, not what gets produced. The tests covered by this are in test_frq.py, test_base.py, test_sample_metadata.py, and test_genome_features.py, they all check that the values coming out are still correct, nulls are handled properly, labels format right, and the genome visualization positioning works as expected.

31puneet · 2026-02-27T06:05:45Z

I still noticed a few .apply() uses left in malariagen_data/anoph/frq_base.py (lines 58, 69, 101, 107, 257), so this seems to cover most of #926 but not all of it yet.

joshitha1808 added 2 commits February 26, 2026 05:57

feat: optimize metadata loading by refactoring dataframe operations

5c589f9

refactor: optimize cohort label generation and period extraction in _…

264a2ff

…build_cohorts_from_sample_grouping

joshitha1808 marked this pull request as ready for review February 26, 2026 14:40

This was referenced Mar 19, 2026

Optimization: Replace slow DataFrame .apply() iterations with vectorized operations for metadata loading #926

Closed

Optimization: Replace slow DataFrame apply() iterations with vectorized operations for metadata loading #1155

Merged

Merge branch 'master' into optimize-metadata-loading

beac0fe

jonbrenas approved these changes Apr 22, 2026

View reviewed changes

Merge branch 'master' into optimize-metadata-loading

1d8e5e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize metadata loading by refactoring dataframe operations#957

optimize metadata loading by refactoring dataframe operations#957
joshitha1808 wants to merge 4 commits intomalariagen:masterfrom
joshitha1808:optimize-metadata-loading

joshitha1808 commented Feb 26, 2026 •

edited

Loading

Uh oh!

joshitha1808 commented Feb 26, 2026 •

edited

Loading

Uh oh!

31puneet commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joshitha1808 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshitha1808 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

31puneet commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshitha1808 commented Feb 26, 2026 •

edited

Loading

joshitha1808 commented Feb 26, 2026 •

edited

Loading

31puneet commented Feb 27, 2026 •

edited

Loading