-
Notifications
You must be signed in to change notification settings - Fork 1
feat: add GO enrichment analysis page for ProteomicsLFQ results #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughA new GO enrichment workflow is added to the ProteomicsLFQ results interface. The feature allows users to adjust p-value and log2FC thresholds, fetches UniProt GO terms via MyGene.info, performs Fisher's exact test enrichment analysis, and displays results across three GO term categories. A new dependency on mygene is introduced. Changes
Sequence DiagramsequenceDiagram
actor User
participant Streamlit as Streamlit UI
participant DataHandler as Data Handler
participant MyGeneAPI as MyGene.info API
participant Stats as Statistical Engine
participant Viz as Visualization
User->>Streamlit: Open ProteomicsLFQ Results
Streamlit->>DataHandler: Retrieve abundance data
DataHandler-->>Streamlit: Protein abundance table
Streamlit->>User: Display table & GO Enrichment form
User->>Streamlit: Adjust p-value/log2FC thresholds
User->>Streamlit: Run GO Enrichment
Streamlit->>DataHandler: Filter proteins by thresholds
DataHandler-->>Streamlit: Filtered foreground & background sets
Streamlit->>MyGeneAPI: Fetch GO terms for proteins
MyGeneAPI-->>Streamlit: GO annotations (BP/CC/MF)
Streamlit->>Stats: Perform Fisher's exact test per GO type
Stats-->>Streamlit: Enrichment p-values & statistics
Streamlit->>Viz: Generate results (bars & tables)
Viz-->>Streamlit: Three tabs with visualizations
Streamlit->>User: Display enrichment results
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@content/results_proteomicslfq.py`:
- Around line 68-117: The foreground/background counts are using all proteins
(bg_ids/fg_ids) even if MyGene returned no annotation, so update
run_go_enrichment to first compute annotated_ids = set(res["query"].astype(str))
(or otherwise derive the set of IDs present in the filtered res) and then
replace bg_set and fg_set with their intersections with annotated_ids before
computing N_bg/N_fg and running the Fisher tests; keep building go2bg/go2fg from
res rows as-is so counts and p-values reflect only annotated proteins.
🧹 Nitpick comments (4)
requirements.txt (1)
152-152: Consider pinningmygenefor deterministic builds.
requirements.txtis generated by pip-compile, butmygeneis unpinned. Align it with the rest of the lockfile by re-running pip-compile or pinning a version to avoid non-reproducible installs.content/results_proteomicslfq.py (3)
45-50: Wrap the GO enrichment UI in@st.fragmentto avoid full reruns.This keeps slider/button interactions from re-running the entire page. As per coding guidelines,
**/*.py: Use@st.fragmentdecorator for interactive UI updates without full page reloads.Suggested refactor (skeleton)
+@st.fragment +def go_enrichment_panel(pivot_df): st.subheader("🧬 GO Enrichment Analysis") p_cutoff = st.slider(...) fc_cutoff = st.slider(...) if st.button("Run GO Enrichment"): ... + +go_enrichment_panel(pivot_df)
55-65: Avoid blindExceptioncatches to improve debuggability.Catching broad exceptions hides unexpected failures. Consider narrowing to the likely exceptions (e.g.,
AttributeError/IndexErrorin parsing and request-related exceptions around the API call) or re-raise after logging.Also applies to: 140-141
134-137: Usestreamlit_plotly_eventsfor interactive Plotly charts.Right now the chart is displayed but you aren’t capturing interactions. Consider using
plotly_eventsto support click/selection actions. As per coding guidelines,**/*.py: Use Plotly and streamlit_plotly_events for interactive visualizations.Example integration
+from streamlit_plotly_events import plotly_events ... - st.plotly_chart(fig, use_container_width=True) + selected = plotly_events(fig, click_event=True, select_event=True) + st.plotly_chart(fig, use_container_width=True)
| bg_ids = analysis_df["UniProt"].dropna().unique().tolist() | ||
| fg_ids = analysis_df[ | ||
| (analysis_df["p-value"] < p_cutoff) & | ||
| (analysis_df["log2FC"].abs() >= fc_cutoff) | ||
| ]["UniProt"].dropna().unique().tolist() | ||
|
|
||
| if len(fg_ids) < 3: | ||
| st.warning(f"Not enough significant proteins (p < {p_cutoff}, |log2FC| ≥ {fc_cutoff}). Found: {len(fg_ids)}") | ||
| else: | ||
| res_list = mg.querymany(bg_ids, scopes="uniprot", fields="go", as_dataframe=False) | ||
| res = pd.DataFrame(res_list) | ||
| if "notfound" in res.columns: | ||
| res = res[res["notfound"] != True] | ||
|
|
||
| def extract_go_terms(go_data, go_type): | ||
| if not isinstance(go_data, dict) or go_type not in go_data: | ||
| return [] | ||
| terms = go_data[go_type] | ||
| if isinstance(terms, dict): | ||
| terms = [terms] | ||
| return list({t.get("term") for t in terms if "term" in t}) | ||
|
|
||
| for go_type in ["BP", "CC", "MF"]: | ||
| res[f"{go_type}_terms"] = res["go"].apply(lambda x: extract_go_terms(x, go_type)) | ||
|
|
||
| fg_set = set(fg_ids) | ||
| bg_set = set(bg_ids) | ||
|
|
||
| def run_go_enrichment(go_type): | ||
| go2fg = defaultdict(set) | ||
| go2bg = defaultdict(set) | ||
| for _, row in res.iterrows(): | ||
| uid = str(row["query"]) | ||
| for term in row[f"{go_type}_terms"]: | ||
| go2bg[term].add(uid) | ||
| if uid in fg_set: | ||
| go2fg[term].add(uid) | ||
|
|
||
| records = [] | ||
| N_fg = len(fg_set) | ||
| N_bg = len(bg_set) | ||
| for term, fg_genes in go2fg.items(): | ||
| a = len(fg_genes) | ||
| if a == 0: | ||
| continue | ||
| b = N_fg - a | ||
| c = len(go2bg[term]) - a | ||
| d = N_bg - (a + b + c) | ||
| _, p = fisher_exact([[a, b], [c, d]], alternative="greater") | ||
| records.append({"GO_Term": term, "Count": a, "GeneRatio": f"{a}/{N_fg}", "p_value": p}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Foreground/background counts include unannotated proteins, biasing Fisher p-values.
N_bg/N_fg are computed from all proteins, even those without GO annotations. This inflates the background and can understate enrichment. Restrict both sets to annotated proteins returned by MyGene before computing Fisher’s exact test.
Proposed fix
- bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
+ bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
fg_ids = analysis_df[
(analysis_df["p-value"] < p_cutoff) &
(analysis_df["log2FC"].abs() >= fc_cutoff)
]["UniProt"].dropna().unique().tolist()
...
- fg_set = set(fg_ids)
- bg_set = set(bg_ids)
+ annotated_ids = set(res["query"].astype(str))
+ bg_set = annotated_ids
+ fg_set = annotated_ids.intersection(map(str, fg_ids))🧰 Tools
🪛 Ruff (0.14.14)
[error] 80-80: Avoid inequality comparisons to True; use not res["notfound"]: for false checks
Replace with not res["notfound"]
(E712)
[warning] 91-91: Function definition does not bind loop variable go_type
(B023)
🤖 Prompt for AI Agents
In `@content/results_proteomicslfq.py` around lines 68 - 117, The
foreground/background counts are using all proteins (bg_ids/fg_ids) even if
MyGene returned no annotation, so update run_go_enrichment to first compute
annotated_ids = set(res["query"].astype(str)) (or otherwise derive the set of
IDs present in the filtered res) and then replace bg_set and fg_set with their
intersections with annotated_ids before computing N_bg/N_fg and running the
Fisher tests; keep building go2bg/go2fg from res rows as-is so counts and
p-values reflect only annotated proteins.
This PR adds a new GO Enrichment Analysis page for ProteomicsLFQ results.
The page allows users to perform GO term enrichment (BP, CC, MF) based on protein-level differential abundance results.
Summary by CodeRabbit