Skip to content

docs: Add real-world examples of Curator for Nemotron datasets and SE…#1847

Open
Pritiks23 wants to merge 10 commits intoNVIDIA-NeMo:mainfrom
Pritiks23:main
Open

docs: Add real-world examples of Curator for Nemotron datasets and SE…#1847
Pritiks23 wants to merge 10 commits intoNVIDIA-NeMo:mainfrom
Pritiks23:main

Conversation

@Pritiks23
Copy link
Copy Markdown

Description

Adds real-world examples to the documentation showing how NeMo Curator is used to build Nemotron datasets, including LANL/NVIDIA collaboration and SES AI Chemistry LLM use cases.
Closes #1548.

Usage

See docs/about/index.md for real-world usage scenarios and references.

## Checklist
<!--
Note: All commits need to be signed and signed off. This can be done via `-sS` flags while commiting
`git commit -sS -m "...."
-->
- [Y ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA-NeMo/Curator/blob/main/CONTRIBUTING.md).
- [ ] New or Existing tests cover these changes.
- [Y ] The documentation is up to date with these changes.

@Pritiks23 Pritiks23 requested a review from a team as a code owner April 21, 2026 22:46
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR adds a "Curator in Action" section to docs/about/index.md showcasing real-world NeMo Curator use cases (LANL/NVIDIA ICF collaboration and SES AI Chemistry LLM), along with minor copy fixes in the How It Works section.

  • The new content block was inserted at line 57 inside an unclosed :::{grid-item-card} directive, and the original Image/Video/Audio card content (:link: options, bodies, closers) was left duplicated outside the grid from line 98 onward — both issues have been flagged in prior review rounds and remain unresolved in the current head commit.

Confidence Score: 3/5

Not safe to merge — the directive structure in the Concepts grid is broken and orphaned card content remains in the file.

Two P1 structural issues flagged in prior review rounds remain unresolved: a duplicate Image card opener (lines 57-58) and orphaned :link: options plus duplicate card directives outside the grid (lines 98-118). These will cause Sphinx-Design rendering failures for the Concepts section. The new prose content itself is correct and the referenced Nemotron-CC link resolves to a valid file.

docs/about/index.md — lines 57-58 (duplicate opener) and lines 98-118 (orphaned directives) require cleanup before the page renders correctly.

Important Files Changed

Filename Overview
docs/about/index.md New "Curator in Action" section added with valid prose and link, but the insertion left a duplicate Image card opener (lines 57-58) and orphaned :link:/card directives (lines 98-118) outside any grid container, breaking the Sphinx-Design directive structure.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["docs/about/index.md"] --> B["## Concepts grid block\n(lines 47-79)"]
    B --> C["Text card ✅"]
    B --> D["Image card opener ×2\n(lines 57-58) ⚠️"]
    D --> E["New: Image card with options ✅"]
    D --> F["New: Video card ✅"]
    D --> G["New: Audio card ✅"]
    D --> H["Grid closer ::::"]
    A --> I["## Curator in Action section\n(lines 81-97) ✅ new content"]
    A --> J["Orphaned Image :link: options\n(lines 98-99) ⚠️"]
    A --> K["Orphaned Image/Video/Audio cards\n(lines 101-118) ⚠️"]
Loading

Reviews (5): Last reviewed commit: "Update docs/about/index.md" | Re-trigger Greptile

Comment thread docs/about/index.md
Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review and provided a few copyedits

Comment thread docs/about/index.md Outdated
Comment thread docs/about/index.md Outdated
Comment thread docs/about/index.md Outdated
Comment thread docs/about/index.md Outdated
Comment thread docs/about/index.md Outdated
Comment thread docs/about/index.md
Comment on lines +97 to +117
:link: about-concepts-image
:link-type: ref

Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering, deduplication), and dataset export.
:::

:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` Video Curation Concepts
:link: about-concepts-video
:link-type: ref

Discover video data curation concepts, such as distributed processing, pipeline stages, execution modes, and efficient data flow.
:::

:::{grid-item-card} {octicon}`unmute;1.5em;sd-mr-1` Audio Curation Concepts
:link: about-concepts-audio
:link-type: ref

Learn about speech data curation, ASR inference, quality assessment, and audio-text integration workflows.
:::

::::
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify orphaned/duplicated grid-card markup from Lines 58-78 left appended after the SES AI paragraph in Lines 97–117.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 22, 2026
Pritiks23 and others added 6 commits April 22, 2026 14:46
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs - Add how Curator is used to build Nemotron datasets

3 participants