Skip to content

Add translation#1844

Merged
ayushdg merged 21 commits into
NVIDIA-NeMo:mainfrom
rkalaniNV:rkalani/translate
May 6, 2026
Merged

Add translation#1844
ayushdg merged 21 commits into
NVIDIA-NeMo:mainfrom
rkalaniNV:rkalani/translate

Conversation

@rkalaniNV
Copy link
Copy Markdown
Contributor

Description

This PR adds and stabilizes the text translation package in Curator compared to main.

It introduces a reusable translation workflow centered on TranslationPipeline, with explicit stages for segmentation, translation, reassembly, formatting, and
evaluation. The package now supports structured and nested inputs, including wildcard field paths such as messages.*.content, and can emit translated output in
replaced, raw, or both modes.

The branch also adds reusable translation backends and helpers:

  • backend adapters for LLM, Google, AWS, and NMT-based translation
  • prompt-loading, async, metadata, and field-path utilities
  • FAITH-based translation quality evaluation
  • generic text quality metrics via TextQualityMetricStage / compute_text_quality_metric

In addition to adding the package, this branch simplifies and cleans up the implementation:

  • reorganizes the package into translation/stages, translation/evaluation, and translation/utils
  • reduces pipeline.py to pipeline composition only
  • removes redundant round-trip/backtranslation orchestration from the public API
  • keeps backtranslation as normal translation with reversed source/target languages, while exposing metrics as reusable primitives
  • improves structured reassembly and Speaker-style translation metadata for multi-field and wildcard cases
  • preserves valid JSON objects/arrays and tool payloads as non-translatable content during segmentation/translation
  • makes short-text passthrough an explicit min_segment_chars setting instead of hidden cutoff logic
  • improves non-LLM backend behavior by attempting batch translation first and falling back per-segment only on failure

This branch also updates the translation test suite to cover the new module layout and behavior, including segmentation, reassembly, metadata emission, prompt
loading, and backend translation paths.

Usage

from nemo_curator.stages.text.translation import TranslationPipeline

pipeline = TranslationPipeline(
source_lang="en",
target_lang="hi",
text_field="messages.*.content",
output_mode="both",
preserve_segment_pairs=True,
enable_faith_eval=True,
model_name="openai/openai/gpt-5.1",
)

Checklist

@rkalaniNV rkalaniNV requested review from a team as code owners April 21, 2026 16:10
@rkalaniNV rkalaniNV requested review from praateekmahajan and removed request for a team April 21, 2026 16:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR introduces a complete experimental translation package for NeMo Curator, organized into segmentation, translation, reassembly, formatting, and FAITH-based evaluation stages, with backends for LLM, Google, AWS, and NMT translation. Several previously-flagged issues (shared mutable stage state, stale aiohttp session, missing fallback passthrough, asyncio.run nesting) have been addressed in this revision.

  • FormatTranslationOutputStage drops output_field before _build_translated_messages reads it, producing empty translated content in raw+reconstruct_messages mode.
  • FaithEvalFilter._score_batch uses or in nonempty_mask, scoring rows where translation silently failed against an empty string and discarding them for the wrong reason.
  • ReassemblyStage._collect_multi_field_outputs keys translation_map by the leaf path component only, causing silent data loss when multiple field paths share the same leaf name (e.g. user.content + assistant.content).

Confidence Score: 3/5

Three independently-triggered data-correctness bugs are present in the core output path; merging without fixing them would produce silently wrong translated content in specific but realistic configurations.

The output_field column is dropped before _build_translated_messages reads it, so the raw+reconstruct_messages combination always produces empty translated messages. The FAITH or mask causes failed translations to be scored against empty strings and filtered out for the wrong reason. The leaf-key collision in translation_map silently discards all but the last same-leaf path's translation.

stages/reassembly.py, evaluation/faith.py, and stages/format_translation_output.py each contain one of the three data-correctness issues and warrant careful attention before merge.

Important Files Changed

Filename Overview
nemo_curator/stages/text/experimental/translation/pipeline.py Composite translation stage with proper validation guards; generation_config is not forwarded to FaithEvalFilter.
nemo_curator/stages/text/experimental/translation/stages/reassembly.py Reassembles translated segments; _leaf_field_key returns only the last path component, causing silent key collision when multiple non-wildcard paths share the same leaf name.
nemo_curator/stages/text/experimental/translation/evaluation/faith.py FAITH scoring stage; nonempty_mask uses or so rows with a non-empty source but empty translation are still scored against an empty string.
nemo_curator/stages/text/experimental/translation/stages/format_translation_output.py Output formatting stage; drops output_field column before _build_translated_messages reads it, yielding empty translated content in raw+reconstruct_messages mode.
nemo_curator/stages/text/experimental/translation/stages/translate.py Segment translation stage with LLM and backend paths; correctly handles passthrough and fallback translation.
nemo_curator/stages/text/experimental/translation/stages/segmentation.py Segmentation stage with coarse and fine modes; spaCy cache correctly keyed by (model_name, max_length) to avoid cross-caller mutation.
nemo_curator/stages/text/experimental/translation/backends/nmt.py NMT backend; stale-session-across-loops issue addressed via _session_loop tracking.
nemo_curator/stages/text/experimental/translation/utils/field_paths.py Wildcard dot-path utilities; terminal-wildcard write-back now correctly handled via _set_list_values.

Sequence Diagram

sequenceDiagram
    participant Input as DocumentBatch
    participant Skip as SkipExistingTranslationsStage
    participant Seg as SegmentationStage
    participant Trans as SegmentTranslationStage
    participant Faith as FaithEvalFilter
    participant Reassembly as ReassemblyStage
    participant FaithFilter as FaithThresholdFilterStage
    participant Restore as RestoreSkippedRowsStage
    participant Format as FormatTranslationOutputStage
    participant Merge as MergeFaithScoresStage

    Input->>Skip: "batch (skip_translated=True)"
    Skip-->>Skip: stash already-translated rows in _metadata
    Skip->>Seg: untranslated rows
    Seg->>Seg: explode rows into segments
    Seg->>Trans: exploded batch
    Trans->>Trans: translate each segment
    Trans->>Faith: segments + translations
    Faith->>Faith: score each segment
    Faith->>Reassembly: scored segments
    Reassembly->>Reassembly: group by _seg_doc_id, reassemble
    Reassembly->>FaithFilter: reassembled docs
    FaithFilter->>Restore: filtered docs
    Restore->>Restore: merge stashed rows back
    Restore->>Format: merged batch
    Format->>Format: build metadata / reconstruct messages
    Format->>Merge: formatted batch
    Merge->>Merge: merge faith scores into metadata
    Merge-->>Input: final DocumentBatch
Loading

Reviews (15): Last reviewed commit: "Merge branch 'main' into rkalani/transla..." | Re-trigger Greptile

Comment on lines +71 to +78
if key == "*":
if isinstance(obj, list):
for item in obj:
if not rest:
pass
else:
_set(item, rest)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 set_nested_fields silently drops values at terminal-wildcard positions

When a path ends with * (i.e. rest is empty after consuming the wildcard key), the code hits pass and never writes the translated value back to the list item. Meanwhile, extract_nested_fields does collect these string items, so the read/write contract is broken: extraction finds the texts, translation runs, but reassembly silently discards the result. Any field path of the form "messages.*" (where messages is a list of strings) would produce an empty translation output.

        if key == "*":
            if isinstance(obj, list):
                for i, item in enumerate(obj):
                    if not rest:
                        # Write back translated values to the list in-place
                        if isinstance(item, str) and value_index[0] < len(values):
                            obj[i] = values[value_index[0]]
                            value_index[0] += 1
                    else:
                        _set(item, rest)
            return

Comment thread nemo_curator/stages/text/translation/stages/formatting.py Outdated
model_name = SPACY_FALLBACK_MODEL
nlp = spacy.load(model_name)

nlp.max_length = 10_000_000 # Handle very long texts (10M chars)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Global spaCy model cache mutated at load time

nlp.max_length = 10_000_000 permanently modifies the model stored in _nlp_cache. Because the cache is a module-level dict shared for the lifetime of the process, the first call to _get_spacy_nlp silently changes the default max_length for every subsequent user of that model in the same process — including any code that imported spaCy independently and expected the default 1 000 000 limit. If a very large text is passed, the model processes it without raising, which may cause excessive memory consumption in unexpected places.

Consider applying max_length only at the call site or documenting this as a known process-wide side effect.

Comment on lines +144 to +146
return asyncio.run(
self.translate_batch_async(texts, source_lang, target_lang)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 translate_batch calls asyncio.run() which fails inside a running event loop

translate_batch delegates directly to asyncio.run(self.translate_batch_async(...)). asyncio.run() raises RuntimeError: This event loop is already running when called from within an existing event loop (e.g., a Jupyter notebook, a Ray async actor, or anywhere the pipeline's run_async_safe helper is in effect). The same pattern appears in google.py and nmt.py.

For built-in backends the pipeline always calls translate_batch_async via run_async_safe, so the sync path isn't exercised internally. But the translate_batch method is part of the public interface, and any direct caller in an async context will hit this error. Using run_async_safe from async_utils.py here would make the sync wrapper safe in all contexts.

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a tech pubs review of .md files and provided a few copyedits.

Comment on lines +81 to +83
- valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim
- when `output_mode="replaced"`, translated values are written back into the original field path
- when `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim
- when `output_mode="replaced"`, translated values are written back into the original field path
- when `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings
- Valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim
- When `output_mode="replaced"`, translated values are written back into the original field path
- When `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings


- `segmentation_mode="coarse"` keeps line-level splitting with code-block awareness
- `segmentation_mode="fine"` uses sentence-level segmentation with structure preservation
- `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff
- `min_segment_chars` lets you explicitly bypass segmentation for short text rather than relying on an implicit cutoff


### Round-Trip Metrics

Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`:
Backtranslation uses a second translation pass with reversed languages, followed by `TextQualityMetricStage`:


## Notes

- The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation.
- The translation package is designed for pipeline execution. Avoid converting large datasets to pandas on the driver just to orchestrate translation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## How It Works

Comment thread docs/curate-text/process-data/index.md Outdated
@@ -18,7 +18,7 @@ NeMo Curator provides a comprehensive suite of tools for processing text data as
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## How It Works

Comment thread docs/get-started/text.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Set your Hugging Face token** to avoid rate limiting when downloading models or datasets:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## How It Works

Comment thread docs/curate-text/process-data/index.md Outdated
@@ -18,7 +18,7 @@ NeMo Curator provides a comprehensive suite of tools for processing text data as
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## How It Works

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 21, 2026
@rkalaniNV rkalaniNV requested a review from a team as a code owner April 23, 2026 11:05
Comment on lines +86 to +104
raise ValueError(
f"Invalid output_mode '{self.output_mode}'. "
f"Must be one of: {sorted(_VALID_OUTPUT_MODES)}"
)

if self.merge_scores and self.output_mode == "replaced":
raise ValueError(
"merge_scores=True requires output_mode in {'raw','both'}. "
"Got output_mode='replaced'. Set output_mode='both' explicitly."
)

if self.merge_scores and not self.enable_faith_eval:
logger.warning(
"merge_scores=True but enable_faith_eval=False; "
"score merging will be skipped"
)

super().__init__()
self.stages = self._build_stages()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ValueError when FAITH eval is requested with a non-LLM backend

When backend_type is "google", "aws", or "nmt", client is typically None. If enable_faith_eval=True is also set, _build_stages() passes client=self.client (i.e. None) to FaithEvalFilter, which immediately raises ValueError("FaithEvalFilter requires a non-None 'client'") at construction time. The warning guard on line 100 only fires when backend_type == "llm" and client is None for segment-level FAITH — non-LLM backends + FAITH produce no helpful warning before the crash.

A preflight check in __post_init__ would give a clearer message, e.g.:

if self.enable_faith_eval and self.client is None:
    raise ValueError(
        "enable_faith_eval=True requires an AsyncLLMClient for scoring. "
        "Provide 'client=' even when using a non-LLM translation backend."
    )

Copy link
Copy Markdown
Contributor

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a lot of new logic to support translation features. I've left a few comments

from nemo_curator.stages.text.translation.backends._retry import MAX_RETRIES
from nemo_curator.stages.text.translation.backends.base import TranslationBackend

logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on why this is needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, not needed. Removed

_MAX_BACKOFF_SECONDS = 60.0


async def retry_with_backoff(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be generalized?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being used by both aws and google backends. Do we want this for AsyncLLMClient as well?

from ._retry import retry_with_backoff


class TranslationBackend(ABC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for later: See if some of this can be extended as an adapter to asynchronous llm client or Async TranslateClient

Comment on lines +93 to +94
source_lang: str = "en"
target_lang: str = "zh"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer these defaults, or enforce the user to set these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Updated code to get users to input these.


name: str = "FaithEvalFilter"
client: AsyncLLMClient | None = None
model_name: str = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be required when the class is created rather than an empty string.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Added explicit checks in post init. It now raises clear error message.

except ImportError as exc: # pragma: no cover - optional dependency
raise ImportError(
"sacrebleu is required for round-trip quality metrics. "
"Install with: pip install sacrebleu"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv pip install sacrebleu.

Due to dependency conflicts in the project pip might fail to resolve in many cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, updated the files



@dataclass(kw_only=True)
class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already renamed from TranslationPipeline to TranslationStage. Any other suggestion?

min_segment_chars: int = 0

client: AsyncLLMClient | None = None
model_name: str = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow an empty string here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Added explicit checks

Comment thread pyproject.toml Outdated
# Translation
"aiohttp>=3.13.3",
"boto3>=1.35",
"google-cloud-translate",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this optional since only one codepath depends on this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made these changes

Comment thread pyproject.toml Outdated
"iso639-lang>=2.6.0",
"requests",
"sacrebleu>=2.6.0",
"spacy>=3.8,<3.9",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is space <3.9 a hard requirement? In the past space has caused a few conflicts w.r.t cuda versions etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Havent tested with later spacy, so let's keep this for now. Will do some more tests to update this.

Copy link
Copy Markdown
Contributor Author

@rkalaniNV rkalaniNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed new commit

from nemo_curator.stages.text.translation.backends._retry import MAX_RETRIES
from nemo_curator.stages.text.translation.backends.base import TranslationBackend

logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, not needed. Removed

_MAX_BACKOFF_SECONDS = 60.0


async def retry_with_backoff(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being used by both aws and google backends. Do we want this for AsyncLLMClient as well?


name: str = "FaithEvalFilter"
client: AsyncLLMClient | None = None
model_name: str = ""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Added explicit checks in post init. It now raises clear error message.

Comment on lines +93 to +94
source_lang: str = "en"
target_lang: str = "zh"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Updated code to get users to input these.


def process(self, batch: DocumentBatch) -> DocumentBatch:
"""Score each translation row and filter rows below threshold."""
df = batch.to_pandas().copy()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not by pyarrow but just to be safe for pandas dataframe.



@dataclass
class TextQualityMetricStage(ProcessingStage[DocumentBatch, DocumentBatch]):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific to translation. Given the OriginalText and BackTranslatedText we compute these metrics.



@dataclass(kw_only=True)
class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already renamed from TranslationPipeline to TranslationStage. Any other suggestion?

min_segment_chars: int = 0

client: AsyncLLMClient | None = None
model_name: str = ""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Added explicit checks

Comment thread pyproject.toml Outdated
# Translation
"aiohttp>=3.13.3",
"boto3>=1.35",
"google-cloud-translate",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made these changes

Comment thread pyproject.toml Outdated
"iso639-lang>=2.6.0",
"requests",
"sacrebleu>=2.6.0",
"spacy>=3.8,<3.9",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Havent tested with later spacy, so let's keep this for now. Will do some more tests to update this.

Comment on lines +333 to +338
for idx, seg in enumerate(segments):
if not seg or not seg.strip():
continue
if not is_line_translatable_content(seg):
continue
start = time.time()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Fallback loop silently drops non-translatable content

When the bulk translation call fails and the code falls to the per-segment fallback, continue is reached for non-translatable segments (JSON blobs, XML tags, etc.) without writing the original text back to translated[idx]. Because translated was initialized to [""] * len(segments) and the passthrough assignment translated[idx] = seg only happens in the batch-path loop (lines 299–301), non-translatable content is silently replaced with "" in the fallback path.

    if not is_line_translatable_content(seg):
        translated[idx] = seg  # preserve passthrough in fallback too
        continue

Comment on lines +196 to +200
nonempty_mask = df.apply(
lambda row: bool(str(row.get(self.source_text_field, "")).strip())
or bool(str(row.get(self.translated_text_field, "")).strip()),
axis=1,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 or mask scores rows with failed/empty translations

nonempty_mask uses or, so any row where the source text is non-empty (but translation is "" due to a failure) is still sent to the LLM for FAITH scoring. The LLM evaluates a real source against an empty string, returns a near-zero score, and the row is filtered by the threshold — silently discarding it for the wrong reason (FAITH failure rather than translation error). The fix is to require both fields to be non-empty:

Suggested change
nonempty_mask = df.apply(
lambda row: bool(str(row.get(self.source_text_field, "")).strip())
or bool(str(row.get(self.translated_text_field, "")).strip()),
axis=1,
)
nonempty_mask = df.apply(
lambda row: bool(str(row.get(self.source_text_field, "")).strip())
and bool(str(row.get(self.translated_text_field, "")).strip()),
axis=1,
)

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Apr 25, 2026
@mikemckiernan mikemckiernan mentioned this pull request Apr 30, 2026
3 tasks
mikemckiernan added a commit to mikemckiernan/Curator that referenced this pull request Apr 30, 2026
Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Copy link
Copy Markdown

@mikemckiernan mikemckiernan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raunak, actually, this repo covers the docs in the /fern directory. I made a proposed PR with these updates in #1903. You are welcome to make similar changes and we can close 1903, or we can use 1903 (likely with changes that I don't know to make).

Whatever works best for everyone.

pipeline.add_stage(
TranslationStage(
client=client,
model_name="openai/openai/gpt-5.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it OK to change this to openai/gpt-oss-120b? I can't find gpt-5.1.

mikemckiernan added a commit to mikemckiernan/Curator that referenced this pull request Apr 30, 2026
Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 1, 2026
rkalaniNV added 6 commits May 1, 2026 15:29
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Signed-off-by: rkalani <rkalani@nvidia.com>
Comment on lines +249 to +263
def _write_multi_field_payload(
self,
out_row: dict[str, Any],
reassembled_by_path: dict[str, list[str]],
translation_map: dict[str, Any],
) -> None:
"""Write reassembled multi-field output payload into ``out_row``."""
output_payload: Any = None
for field_path, texts in reassembled_by_path.items():
output_payload = self._write_one_field_payload(out_row, field_path, texts)

if len(reassembled_by_path) == 1 and output_payload is not None:
out_row[self.output_field] = output_payload
else:
out_row[self.output_field] = translation_map
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Empty output field set to {} instead of ""

When a document has no extractable text in any configured field path (all source fields are empty strings or missing), _collect_multi_field_outputs returns an empty reassembled_by_path = {} and an empty translation_map = {}. In _write_multi_field_payload, because len(reassembled_by_path) != 1, it falls into the else branch and sets out_row[self.output_field] = {} (an empty dict). Downstream stages like FormatTranslationOutputStage._build_translated_messages receive this dict as translated_text, call str({}) to get "{}", and inject the string literal "{}" into every message's content field rather than an empty string.

Suggested change
def _write_multi_field_payload(
self,
out_row: dict[str, Any],
reassembled_by_path: dict[str, list[str]],
translation_map: dict[str, Any],
) -> None:
"""Write reassembled multi-field output payload into ``out_row``."""
output_payload: Any = None
for field_path, texts in reassembled_by_path.items():
output_payload = self._write_one_field_payload(out_row, field_path, texts)
if len(reassembled_by_path) == 1 and output_payload is not None:
out_row[self.output_field] = output_payload
else:
out_row[self.output_field] = translation_map
if not reassembled_by_path:
out_row[self.output_field] = ""
elif len(reassembled_by_path) == 1 and output_payload is not None:
out_row[self.output_field] = output_payload
else:
out_row[self.output_field] = translation_map

Signed-off-by: rkalani <rkalani@nvidia.com>
Comment on lines +1 to +10
---
description: "Translate flat and structured text fields with Curator's experimental translation pipeline, quality scoring, and backend integrations"
categories: ["how-to-guides"]
tags: ["translation", "experimental", "multilingual", "faith", "text-quality", "llm", "nmt"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "text-only"
---

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 PR description usage example has wrong import path, wrong class name, and a non-existent parameter

The PR description shows from nemo_curator.stages.text.translation import TranslationPipeline with preserve_segment_pairs=True. Three problems: (1) the module is nemo_curator.stages.text.experimental.translation, not nemo_curator.stages.text.translation; (2) the public class is TranslationStage, not TranslationPipeline; (3) preserve_segment_pairs is not a valid parameter on TranslationStage. The documentation files themselves are correct — this discrepancy is in the PR description only.

Signed-off-by: rkalani <rkalani@nvidia.com>
@ayushdg
Copy link
Copy Markdown
Contributor

ayushdg commented May 6, 2026

/ok to test 6e32a16

@ayushdg
Copy link
Copy Markdown
Contributor

ayushdg commented May 6, 2026

/ok to test 823c707

Comment on lines +231 to +236
if wildcard_path:
translation_map.setdefault(field_key, []).append(reassembled)
segmented_map.setdefault(field_key, []).extend(current_pairs)
else:
translation_map[field_key] = reassembled
segmented_map[field_key] = current_pairs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Leaf-key collision silently drops translations for same-leaf multi-path configs

When text_field contains two non-wildcard paths that share the same trailing component (e.g. ["user.content", "assistant.content"]), _leaf_field_key returns "content" for both. The second path's translation_map[field_key] = reassembled call overwrites the first, and segmented_map[field_key] = current_pairs does the same. When there are multiple field paths in reassembled_by_path, out_row[self.output_field] is set to this incomplete translation_map, silently discarding every path except the last one that shares a leaf name. The _translation_map metadata column is equally affected.

Suggested change
if wildcard_path:
translation_map.setdefault(field_key, []).append(reassembled)
segmented_map.setdefault(field_key, []).extend(current_pairs)
else:
translation_map[field_key] = reassembled
segmented_map[field_key] = current_pairs
if wildcard_path:
translation_map.setdefault(field_key, []).append(reassembled)
segmented_map.setdefault(field_key, []).extend(current_pairs)
else:
# Use the full field_path as key to avoid collisions when
# multiple paths share the same leaf name (e.g. user.content
# and assistant.content both have leaf "content").
translation_map[field_path] = reassembled
segmented_map[field_path] = current_pairs

Comment on lines +244 to +248
nonempty_mask = df.apply(
lambda row: bool(str(row.get(self.source_text_field, "")).strip())
or bool(str(row.get(self.translated_text_field, "")).strip()),
axis=1,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 or mask scores rows where translation silently failed

nonempty_mask uses or, so any row where the source text is non-empty but _translated is "" (a failed translation) is still submitted to the LLM for FAITH scoring. The model evaluates a real source against an empty string and typically returns a near-zero score, causing the row to be dropped by the threshold filter for the wrong reason — appearing as a quality failure rather than a translation error. Using and ensures only rows with both a source and a translation are scored.

Suggested change
nonempty_mask = df.apply(
lambda row: bool(str(row.get(self.source_text_field, "")).strip())
or bool(str(row.get(self.translated_text_field, "")).strip()),
axis=1,
)
nonempty_mask = df.apply(
lambda row: bool(str(row.get(self.source_text_field, "")).strip())
and bool(str(row.get(self.translated_text_field, "")).strip()),
axis=1,
)

Comment on lines +71 to +78
if self.output_mode in ("raw", "both"):
self._build_metadata_column(df)

if self.output_mode == "raw" and self.output_field in df.columns:
df = df.drop(columns=[self.output_field])

if self.reconstruct_messages and self.messages_field in df.columns:
self._build_translated_messages(df)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 output_field column dropped before _build_translated_messages reads it

When output_mode="raw" and reconstruct_messages=True, the output_field column is removed on line 75 before _build_translated_messages is called on line 78. Inside that method, df.iloc[idx].get(self.output_field, "") returns "" for every row because the column no longer exists, so reconstruct_messages_with_translation is called with an empty string and every message ends up with empty translated content. The _build_translated_messages call should be moved before the column drop.

Suggested change
if self.output_mode in ("raw", "both"):
self._build_metadata_column(df)
if self.output_mode == "raw" and self.output_field in df.columns:
df = df.drop(columns=[self.output_field])
if self.reconstruct_messages and self.messages_field in df.columns:
self._build_translated_messages(df)
if self.output_mode in ("raw", "both"):
self._build_metadata_column(df)
if self.reconstruct_messages and self.messages_field in df.columns:
self._build_translated_messages(df)
if self.output_mode == "raw" and self.output_field in df.columns:
df = df.drop(columns=[self.output_field])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants