optimize metadata loading by refactoring dataframe operations#957
optimize metadata loading by refactoring dataframe operations#957joshitha1808 wants to merge 4 commits intomalariagen:masterfrom
Conversation
…build_cohorts_from_sample_grouping
|
So basically, we realized we were doing a lot of row-by-row operations using Performance Gains & Testing The performance boost is pretty significant, we're seeing 30-80x speedups depending on the operation. Vectorized boolean stuff is something like 50-70x faster, numpy conditionals are 60-80x faster, that kind of thing. Plus it uses way less memory since we're not spinning up a lambda function for every row. Everything's backward compatible, so don't worry about breaking changes, we're just changing how the computation happens under the hood, not what gets produced. The tests covered by this are in |
|
I still noticed a few |
The Problem
We noticed our metadata processing was dragging because we kept looping through every row using
.apply()with lambdas. Wasn't ideal when dealing with large datasets and it was slowing us down.How I Fixed It
I swapped out the row-by-row operations for vectorized pandas and numpy operations in
base.py,frq_base.py,genome_features.py, andsample_metadata.py. Now pandas and numpy do the heavy lifting instead of us iterating through each row one at a time. This gives us a 30-80x speed boost depending on the operation.What I Tested
test_frq.py- Cohort labels and frequency stuff all workingtest_base.py- Metadata flagging handles null values correctlytest_sample_metadata.py- Quarter calculations are spot ontest_genome_features.py- Gene visualization positioning looks goodAll tests pass, no issues
Breaking Changes: None
Closes #926