Skip to content

Conversation

@hjn0415a
Copy link

@hjn0415a hjn0415a commented Feb 4, 2026

This PR adds a new GO Enrichment Analysis page for ProteomicsLFQ results.
The page allows users to perform GO term enrichment (BP, CC, MF) based on protein-level differential abundance results.

  • Added a new Streamlit results page: results_proteomicslfq.py
  • Integrated GO enrichment analysis using MyGene.info for GO annotation
  • Foreground proteins are selected based on configurable p-value and |log2FC| thresholds
  • Enrichment is computed using Fisher’s exact test
  • Results are visualized as bar plots and tables, separated by GO category (BP / CC / MF)
  • Added mygene as a new dependency

Summary by CodeRabbit

  • New Features
    • Added "GO Terms" page to the Results section for protein analysis.
    • Implemented Gene Ontology enrichment analysis with adjustable p-value and log2FC thresholds.
    • Results displayed across three tabs with visualizations and detailed data tables.

@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

A new GO enrichment workflow is added to the ProteomicsLFQ results interface. The feature allows users to adjust p-value and log2FC thresholds, fetches UniProt GO terms via MyGene.info, performs Fisher's exact test enrichment analysis, and displays results across three GO term categories. A new dependency on mygene is introduced.

Changes

Cohort / File(s) Summary
UI Routing
app.py
Added new "GO Terms" page (🧪 icon) to Results section, referencing the new results_proteomicslfq module.
GO Enrichment Workflow
content/results_proteomicslfq.py
New module implementing Streamlit-based ProteomicsLFQ results interface with abundance data display, GO enrichment pipeline including threshold adjustment, MyGene.info API integration, protein set construction, Fisher's exact test analysis, and results visualization across BP/CC/MF tabs.
Dependencies
requirements.txt
Added mygene package dependency for UniProt GO term retrieval.

Sequence Diagram

sequenceDiagram
    actor User
    participant Streamlit as Streamlit UI
    participant DataHandler as Data Handler
    participant MyGeneAPI as MyGene.info API
    participant Stats as Statistical Engine
    participant Viz as Visualization

    User->>Streamlit: Open ProteomicsLFQ Results
    Streamlit->>DataHandler: Retrieve abundance data
    DataHandler-->>Streamlit: Protein abundance table
    Streamlit->>User: Display table & GO Enrichment form

    User->>Streamlit: Adjust p-value/log2FC thresholds
    User->>Streamlit: Run GO Enrichment
    Streamlit->>DataHandler: Filter proteins by thresholds
    DataHandler-->>Streamlit: Filtered foreground & background sets

    Streamlit->>MyGeneAPI: Fetch GO terms for proteins
    MyGeneAPI-->>Streamlit: GO annotations (BP/CC/MF)
    
    Streamlit->>Stats: Perform Fisher's exact test per GO type
    Stats-->>Streamlit: Enrichment p-values & statistics
    
    Streamlit->>Viz: Generate results (bars & tables)
    Viz-->>Streamlit: Three tabs with visualizations
    Streamlit->>User: Display enrichment results
Loading

Poem

🐰 Hops of joy through GO's grand garden,
Proteomics bloom with enrichment's pardon,
Fisher's test dances, MyGene API sings,
Three tabs of terms on biology's wings!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically summarizes the main change: adding a GO enrichment analysis page for ProteomicsLFQ results, which aligns with all the modifications in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@content/results_proteomicslfq.py`:
- Around line 68-117: The foreground/background counts are using all proteins
(bg_ids/fg_ids) even if MyGene returned no annotation, so update
run_go_enrichment to first compute annotated_ids = set(res["query"].astype(str))
(or otherwise derive the set of IDs present in the filtered res) and then
replace bg_set and fg_set with their intersections with annotated_ids before
computing N_bg/N_fg and running the Fisher tests; keep building go2bg/go2fg from
res rows as-is so counts and p-values reflect only annotated proteins.
🧹 Nitpick comments (4)
requirements.txt (1)

152-152: Consider pinning mygene for deterministic builds.

requirements.txt is generated by pip-compile, but mygene is unpinned. Align it with the rest of the lockfile by re-running pip-compile or pinning a version to avoid non-reproducible installs.

content/results_proteomicslfq.py (3)

45-50: Wrap the GO enrichment UI in @st.fragment to avoid full reruns.

This keeps slider/button interactions from re-running the entire page. As per coding guidelines, **/*.py: Use @st.fragment decorator for interactive UI updates without full page reloads.

Suggested refactor (skeleton)
+@st.fragment
+def go_enrichment_panel(pivot_df):
     st.subheader("🧬 GO Enrichment Analysis")
     p_cutoff = st.slider(...)
     fc_cutoff = st.slider(...)
     if st.button("Run GO Enrichment"):
         ...
+
+go_enrichment_panel(pivot_df)

55-65: Avoid blind Exception catches to improve debuggability.

Catching broad exceptions hides unexpected failures. Consider narrowing to the likely exceptions (e.g., AttributeError/IndexError in parsing and request-related exceptions around the API call) or re-raise after logging.

Also applies to: 140-141


134-137: Use streamlit_plotly_events for interactive Plotly charts.

Right now the chart is displayed but you aren’t capturing interactions. Consider using plotly_events to support click/selection actions. As per coding guidelines, **/*.py: Use Plotly and streamlit_plotly_events for interactive visualizations.

Example integration
+from streamlit_plotly_events import plotly_events
 ...
-                                st.plotly_chart(fig, use_container_width=True)
+                                selected = plotly_events(fig, click_event=True, select_event=True)
+                                st.plotly_chart(fig, use_container_width=True)

Comment on lines +68 to +117
bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
fg_ids = analysis_df[
(analysis_df["p-value"] < p_cutoff) &
(analysis_df["log2FC"].abs() >= fc_cutoff)
]["UniProt"].dropna().unique().tolist()

if len(fg_ids) < 3:
st.warning(f"Not enough significant proteins (p < {p_cutoff}, |log2FC| ≥ {fc_cutoff}). Found: {len(fg_ids)}")
else:
res_list = mg.querymany(bg_ids, scopes="uniprot", fields="go", as_dataframe=False)
res = pd.DataFrame(res_list)
if "notfound" in res.columns:
res = res[res["notfound"] != True]

def extract_go_terms(go_data, go_type):
if not isinstance(go_data, dict) or go_type not in go_data:
return []
terms = go_data[go_type]
if isinstance(terms, dict):
terms = [terms]
return list({t.get("term") for t in terms if "term" in t})

for go_type in ["BP", "CC", "MF"]:
res[f"{go_type}_terms"] = res["go"].apply(lambda x: extract_go_terms(x, go_type))

fg_set = set(fg_ids)
bg_set = set(bg_ids)

def run_go_enrichment(go_type):
go2fg = defaultdict(set)
go2bg = defaultdict(set)
for _, row in res.iterrows():
uid = str(row["query"])
for term in row[f"{go_type}_terms"]:
go2bg[term].add(uid)
if uid in fg_set:
go2fg[term].add(uid)

records = []
N_fg = len(fg_set)
N_bg = len(bg_set)
for term, fg_genes in go2fg.items():
a = len(fg_genes)
if a == 0:
continue
b = N_fg - a
c = len(go2bg[term]) - a
d = N_bg - (a + b + c)
_, p = fisher_exact([[a, b], [c, d]], alternative="greater")
records.append({"GO_Term": term, "Count": a, "GeneRatio": f"{a}/{N_fg}", "p_value": p})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Foreground/background counts include unannotated proteins, biasing Fisher p-values.

N_bg/N_fg are computed from all proteins, even those without GO annotations. This inflates the background and can understate enrichment. Restrict both sets to annotated proteins returned by MyGene before computing Fisher’s exact test.

Proposed fix
-                    bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
+                    bg_ids = analysis_df["UniProt"].dropna().unique().tolist()
                     fg_ids = analysis_df[
                         (analysis_df["p-value"] < p_cutoff) &
                         (analysis_df["log2FC"].abs() >= fc_cutoff)
                     ]["UniProt"].dropna().unique().tolist()
 ...
-                        fg_set = set(fg_ids)
-                        bg_set = set(bg_ids)
+                        annotated_ids = set(res["query"].astype(str))
+                        bg_set = annotated_ids
+                        fg_set = annotated_ids.intersection(map(str, fg_ids))
🧰 Tools
🪛 Ruff (0.14.14)

[error] 80-80: Avoid inequality comparisons to True; use not res["notfound"]: for false checks

Replace with not res["notfound"]

(E712)


[warning] 91-91: Function definition does not bind loop variable go_type

(B023)

🤖 Prompt for AI Agents
In `@content/results_proteomicslfq.py` around lines 68 - 117, The
foreground/background counts are using all proteins (bg_ids/fg_ids) even if
MyGene returned no annotation, so update run_go_enrichment to first compute
annotated_ids = set(res["query"].astype(str)) (or otherwise derive the set of
IDs present in the filtered res) and then replace bg_set and fg_set with their
intersections with annotated_ids before computing N_bg/N_fg and running the
Fisher tests; keep building go2bg/go2fg from res rows as-is so counts and
p-values reflect only annotated proteins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant