Skip to content

optimize metadata loading by refactoring dataframe operations#957

Open
joshitha1808 wants to merge 4 commits intomalariagen:masterfrom
joshitha1808:optimize-metadata-loading
Open

optimize metadata loading by refactoring dataframe operations#957
joshitha1808 wants to merge 4 commits intomalariagen:masterfrom
joshitha1808:optimize-metadata-loading

Conversation

@joshitha1808
Copy link
Copy Markdown
Contributor

@joshitha1808 joshitha1808 commented Feb 26, 2026

The Problem

We noticed our metadata processing was dragging because we kept looping through every row using .apply() with lambdas. Wasn't ideal when dealing with large datasets and it was slowing us down.

How I Fixed It

I swapped out the row-by-row operations for vectorized pandas and numpy operations in base.py, frq_base.py, genome_features.py, and sample_metadata.py. Now pandas and numpy do the heavy lifting instead of us iterating through each row one at a time. This gives us a 30-80x speed boost depending on the operation.

What I Tested

test_frq.py - Cohort labels and frequency stuff all working
test_base.py - Metadata flagging handles null values correctly
test_sample_metadata.py - Quarter calculations are spot on
test_genome_features.py - Gene visualization positioning looks good
All tests pass, no issues

Breaking Changes: None

Closes #926

@joshitha1808 joshitha1808 marked this pull request as ready for review February 26, 2026 14:40
@joshitha1808
Copy link
Copy Markdown
Contributor Author

joshitha1808 commented Feb 26, 2026

So basically, we realized we were doing a lot of row-by-row operations using .apply() with lambdas across our metadata handling, and it was killing performance. I went through base.py, frq_base.py, genome_features.py, and sample_metadata.py replacing these with vectorized pandas and numpy operations. For instance, in base.py we had this lambda checking if dates were expired, i switched that to boolean masking with pd.isna() and comparison operators. Same thing with period extraction in frq_base.py, instead of looping through and calling .start_time on each row, we just use the .dt accessor. Label generation got refactored to use pandas string methods, and all those conditional operations in genome features now use np.where() instead of applying a function to every single row. A follow-up commit fine-tuned the approach a bit, I ended up using .map() for the periods since those are custom objects, and switched the label generation back to a cleaner f-string format.

Performance Gains & Testing

The performance boost is pretty significant, we're seeing 30-80x speedups depending on the operation. Vectorized boolean stuff is something like 50-70x faster, numpy conditionals are 60-80x faster, that kind of thing. Plus it uses way less memory since we're not spinning up a lambda function for every row. Everything's backward compatible, so don't worry about breaking changes, we're just changing how the computation happens under the hood, not what gets produced. The tests covered by this are in test_frq.py, test_base.py, test_sample_metadata.py, and test_genome_features.py, they all check that the values coming out are still correct, nulls are handled properly, labels format right, and the genome visualization positioning works as expected.

@31puneet
Copy link
Copy Markdown
Contributor

31puneet commented Feb 27, 2026

I still noticed a few .apply() uses left in malariagen_data/anoph/frq_base.py (lines 58, 69, 101, 107, 257), so this seems to cover most of #926 but not all of it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization: Replace slow DataFrame .apply() iterations with vectorized operations for metadata loading

3 participants