Add translation by rkalaniNV · Pull Request #1844 · NVIDIA-NeMo/Curator

rkalaniNV · 2026-04-21T16:10:11Z

Description

This PR adds and stabilizes the text translation package in Curator compared to main.

It introduces a reusable translation workflow centered on TranslationPipeline, with explicit stages for segmentation, translation, reassembly, formatting, and
evaluation. The package now supports structured and nested inputs, including wildcard field paths such as messages.*.content, and can emit translated output in
replaced, raw, or both modes.

The branch also adds reusable translation backends and helpers:

backend adapters for LLM, Google, AWS, and NMT-based translation
prompt-loading, async, metadata, and field-path utilities
FAITH-based translation quality evaluation
generic text quality metrics via TextQualityMetricStage / compute_text_quality_metric

In addition to adding the package, this branch simplifies and cleans up the implementation:

reorganizes the package into translation/stages, translation/evaluation, and translation/utils
reduces pipeline.py to pipeline composition only
removes redundant round-trip/backtranslation orchestration from the public API
keeps backtranslation as normal translation with reversed source/target languages, while exposing metrics as reusable primitives
improves structured reassembly and Speaker-style translation metadata for multi-field and wildcard cases
preserves valid JSON objects/arrays and tool payloads as non-translatable content during segmentation/translation
makes short-text passthrough an explicit min_segment_chars setting instead of hidden cutoff logic
improves non-LLM backend behavior by attempting batch translation first and falling back per-segment only on failure

This branch also updates the translation test suite to cover the new module layout and behavior, including segmentation, reassembly, metadata emission, prompt
loading, and backend translation paths.

Usage

from nemo_curator.stages.text.translation import TranslationPipeline

pipeline = TranslationPipeline(
source_lang="en",
target_lang="hi",
text_field="messages.*.content",
output_mode="both",
preserve_segment_pairs=True,
enable_faith_eval=True,
model_name="openai/openai/gpt-5.1",
)

Checklist

I am familiar with the Contributing Guide (https://github.com/NVIDIA-NeMo/Curator/blob/main/CONTRIBUTING.md).
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-21T16:10:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-21T16:14:51Z

Greptile Summary

This PR introduces a complete experimental translation package for NeMo Curator, organized into segmentation, translation, reassembly, formatting, and FAITH-based evaluation stages, with backends for LLM, Google, AWS, and NMT translation. Several previously-flagged issues (shared mutable stage state, stale aiohttp session, missing fallback passthrough, asyncio.run nesting) have been addressed in this revision.

FormatTranslationOutputStage drops output_field before _build_translated_messages reads it, producing empty translated content in raw+reconstruct_messages mode.
FaithEvalFilter._score_batch uses or in nonempty_mask, scoring rows where translation silently failed against an empty string and discarding them for the wrong reason.
ReassemblyStage._collect_multi_field_outputs keys translation_map by the leaf path component only, causing silent data loss when multiple field paths share the same leaf name (e.g. user.content + assistant.content).

Confidence Score: 3/5

Three independently-triggered data-correctness bugs are present in the core output path; merging without fixing them would produce silently wrong translated content in specific but realistic configurations.

The output_field column is dropped before _build_translated_messages reads it, so the raw+reconstruct_messages combination always produces empty translated messages. The FAITH or mask causes failed translations to be scored against empty strings and filtered out for the wrong reason. The leaf-key collision in translation_map silently discards all but the last same-leaf path's translation.

stages/reassembly.py, evaluation/faith.py, and stages/format_translation_output.py each contain one of the three data-correctness issues and warrant careful attention before merge.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/experimental/translation/pipeline.py	Composite translation stage with proper validation guards; `generation_config` is not forwarded to `FaithEvalFilter`.
nemo_curator/stages/text/experimental/translation/stages/reassembly.py	Reassembles translated segments; `_leaf_field_key` returns only the last path component, causing silent key collision when multiple non-wildcard paths share the same leaf name.
nemo_curator/stages/text/experimental/translation/evaluation/faith.py	FAITH scoring stage; `nonempty_mask` uses `or` so rows with a non-empty source but empty translation are still scored against an empty string.
nemo_curator/stages/text/experimental/translation/stages/format_translation_output.py	Output formatting stage; drops `output_field` column before `_build_translated_messages` reads it, yielding empty translated content in `raw`+`reconstruct_messages` mode.
nemo_curator/stages/text/experimental/translation/stages/translate.py	Segment translation stage with LLM and backend paths; correctly handles passthrough and fallback translation.
nemo_curator/stages/text/experimental/translation/stages/segmentation.py	Segmentation stage with coarse and fine modes; spaCy cache correctly keyed by (model_name, max_length) to avoid cross-caller mutation.
nemo_curator/stages/text/experimental/translation/backends/nmt.py	NMT backend; stale-session-across-loops issue addressed via `_session_loop` tracking.
nemo_curator/stages/text/experimental/translation/utils/field_paths.py	Wildcard dot-path utilities; terminal-wildcard write-back now correctly handled via `_set_list_values`.

Sequence Diagram

sequenceDiagram
    participant Input as DocumentBatch
    participant Skip as SkipExistingTranslationsStage
    participant Seg as SegmentationStage
    participant Trans as SegmentTranslationStage
    participant Faith as FaithEvalFilter
    participant Reassembly as ReassemblyStage
    participant FaithFilter as FaithThresholdFilterStage
    participant Restore as RestoreSkippedRowsStage
    participant Format as FormatTranslationOutputStage
    participant Merge as MergeFaithScoresStage

    Input->>Skip: "batch (skip_translated=True)"
    Skip-->>Skip: stash already-translated rows in _metadata
    Skip->>Seg: untranslated rows
    Seg->>Seg: explode rows into segments
    Seg->>Trans: exploded batch
    Trans->>Trans: translate each segment
    Trans->>Faith: segments + translations
    Faith->>Faith: score each segment
    Faith->>Reassembly: scored segments
    Reassembly->>Reassembly: group by _seg_doc_id, reassemble
    Reassembly->>FaithFilter: reassembled docs
    FaithFilter->>Restore: filtered docs
    Restore->>Restore: merge stashed rows back
    Restore->>Format: merged batch
    Format->>Format: build metadata / reconstruct messages
    Format->>Merge: formatted batch
    Merge->>Merge: merge faith scores into metadata
    Merge-->>Input: final DocumentBatch

_{Reviews (15): Last reviewed commit: "Merge branch 'main' into rkalani/transla..." | Re-trigger Greptile}

greptile-apps · 2026-04-21T16:14:55Z

+        if key == "*":
+            if isinstance(obj, list):
+                for item in obj:
+                    if not rest:
+                        pass
+                    else:
+                        _set(item, rest)
+            return


set_nested_fields silently drops values at terminal-wildcard positions

When a path ends with * (i.e. rest is empty after consuming the wildcard key), the code hits pass and never writes the translated value back to the list item. Meanwhile, extract_nested_fields does collect these string items, so the read/write contract is broken: extraction finds the texts, translation runs, but reassembly silently discards the result. Any field path of the form "messages.*" (where messages is a list of strings) would produce an empty translation output.

if key == "*": if isinstance(obj, list): for i, item in enumerate(obj): if not rest: # Write back translated values to the list in-place if isinstance(item, str) and value_index[0] < len(values): obj[i] = values[value_index[0]] value_index[0] += 1 else: _set(item, rest) return

greptile-apps · 2026-04-21T16:14:57Z

+            model_name = SPACY_FALLBACK_MODEL
+            nlp = spacy.load(model_name)
+
+        nlp.max_length = 10_000_000  # Handle very long texts (10M chars)


Global spaCy model cache mutated at load time

nlp.max_length = 10_000_000 permanently modifies the model stored in _nlp_cache. Because the cache is a module-level dict shared for the lifetime of the process, the first call to _get_spacy_nlp silently changes the default max_length for every subsequent user of that model in the same process — including any code that imported spaCy independently and expected the default 1 000 000 limit. If a very large text is passed, the model processes it without raising, which may cause excessive memory consumption in unexpected places.

Consider applying max_length only at the call site or documenting this as a known process-wide side effect.

greptile-apps · 2026-04-21T16:14:58Z

+        return asyncio.run(
+            self.translate_batch_async(texts, source_lang, target_lang)
+        )


translate_batch calls asyncio.run() which fails inside a running event loop

translate_batch delegates directly to asyncio.run(self.translate_batch_async(...)). asyncio.run() raises RuntimeError: This event loop is already running when called from within an existing event loop (e.g., a Jupyter notebook, a Ray async actor, or anywhere the pipeline's run_async_safe helper is in effect). The same pattern appears in google.py and nmt.py.

For built-in backends the pipeline always calls translate_batch_async via run_async_safe, so the sync path isn't exercised internally. But the translate_batch method is part of the public interface, and any direct caller in an async context will hit this error. Using run_async_safe from async_utils.py here would make the sync wrapper safe in all contexts.

jgerh

Completed a tech pubs review of .md files and provided a few copyedits.

jgerh · 2026-04-21T17:01:19Z

+- valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim
+- when `output_mode="replaced"`, translated values are written back into the original field path
+- when `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings


Suggested change

- valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim

- when `output_mode="replaced"`, translated values are written back into the original field path

- when `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings

- Valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim

- When `output_mode="replaced"`, translated values are written back into the original field path

- When `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings

jgerh · 2026-04-21T17:02:32Z

+
+- `segmentation_mode="coarse"` keeps line-level splitting with code-block awareness
+- `segmentation_mode="fine"` uses sentence-level segmentation with structure preservation
+- `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff


Suggested change

- `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff

- `min_segment_chars` lets you explicitly bypass segmentation for short text rather than relying on an implicit cutoff

jgerh · 2026-04-21T17:02:59Z

+
+### Round-Trip Metrics
+
+Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`:


Suggested change

Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`:

Backtranslation uses a second translation pass with reversed languages, followed by `TextQualityMetricStage`:

jgerh · 2026-04-21T17:04:23Z

+
+## Notes
+
+- The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation.


Suggested change

- The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation.

- The translation package is designed for pipeline execution. Avoid converting large datasets to pandas on the driver just to orchestrate translation.

jgerh · 2026-04-21T17:06:19Z

Suggested change

## How It Works

jgerh · 2026-04-21T17:13:28Z

Suggested change

## How It Works

jgerh · 2026-04-21T17:26:07Z

Suggested change

**Set your Hugging Face token** to avoid rate limiting when downloading models or datasets:

jgerh · 2026-04-21T17:29:21Z

Suggested change

## How It Works

jgerh · 2026-04-21T17:29:43Z

Suggested change

## How It Works

greptile-apps · 2026-04-23T11:12:18Z

+            raise ValueError(
+                f"Invalid output_mode '{self.output_mode}'. "
+                f"Must be one of: {sorted(_VALID_OUTPUT_MODES)}"
+            )
+
+        if self.merge_scores and self.output_mode == "replaced":
+            raise ValueError(
+                "merge_scores=True requires output_mode in {'raw','both'}. "
+                "Got output_mode='replaced'. Set output_mode='both' explicitly."
+            )
+
+        if self.merge_scores and not self.enable_faith_eval:
+            logger.warning(
+                "merge_scores=True but enable_faith_eval=False; "
+                "score merging will be skipped"
+            )
+
+        super().__init__()
+        self.stages = self._build_stages()


ValueError when FAITH eval is requested with a non-LLM backend

When backend_type is "google", "aws", or "nmt", client is typically None. If enable_faith_eval=True is also set, _build_stages() passes client=self.client (i.e. None) to FaithEvalFilter, which immediately raises ValueError("FaithEvalFilter requires a non-None 'client'") at construction time. The warning guard on line 100 only fires when backend_type == "llm" and client is None for segment-level FAITH — non-LLM backends + FAITH produce no helpful warning before the crash.

A preflight check in __post_init__ would give a clearer message, e.g.:

if self.enable_faith_eval and self.client is None: raise ValueError( "enable_faith_eval=True requires an AsyncLLMClient for scoring. " "Provide 'client=' even when using a non-LLM translation backend." )

ayushdg

Thanks for adding a lot of new logic to support translation features. I've left a few comments

ayushdg · 2026-04-24T15:22:12Z

+from nemo_curator.stages.text.translation.backends._retry import MAX_RETRIES
+from nemo_curator.stages.text.translation.backends.base import TranslationBackend
+
+logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)


can you elaborate on why this is needed?

Yup, not needed. Removed

ayushdg · 2026-04-24T15:35:11Z

+_MAX_BACKOFF_SECONDS = 60.0
+
+
+async def retry_with_backoff(


can this be generalized?

This is being used by both aws and google backends. Do we want this for AsyncLLMClient as well?

ayushdg · 2026-04-24T15:47:28Z

+from ._retry import retry_with_backoff
+
+
+class TranslationBackend(ABC):


TODO for later: See if some of this can be extended as an adapter to asynchronous llm client or Async TranslateClient

ayushdg · 2026-04-24T15:50:23Z

+    source_lang: str = "en"
+    target_lang: str = "zh"


Do you prefer these defaults, or enforce the user to set these?

Yes. Updated code to get users to input these.

ayushdg · 2026-04-24T15:50:46Z

+
+    name: str = "FaithEvalFilter"
+    client: AsyncLLMClient | None = None
+    model_name: str = ""


This should be required when the class is created rather than an empty string.

Yup. Added explicit checks in post init. It now raises clear error message.

ayushdg · 2026-04-24T15:57:40Z

+    except ImportError as exc:  # pragma: no cover - optional dependency
+        raise ImportError(
+            "sacrebleu is required for round-trip quality metrics. "
+            "Install with: pip install sacrebleu"


uv pip install sacrebleu.

Due to dependency conflicts in the project pip might fail to resolve in many cases.

Yup, updated the files

ayushdg · 2026-04-24T16:12:29Z

+
+
+@dataclass(kw_only=True)
+class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]):


Should we rename this class?

Already renamed from TranslationPipeline to TranslationStage. Any other suggestion?

ayushdg · 2026-04-24T16:12:51Z

+    min_segment_chars: int = 0
+
+    client: AsyncLLMClient | None = None
+    model_name: str = ""


Should we allow an empty string here?

Nope. Added explicit checks

ayushdg · 2026-04-24T16:19:24Z

+    # Translation
+    "aiohttp>=3.13.3",
+    "boto3>=1.35",
+    "google-cloud-translate",


can we make this optional since only one codepath depends on this?

Yes, made these changes

ayushdg · 2026-04-24T16:19:46Z

+    "iso639-lang>=2.6.0",
+    "requests",
+    "sacrebleu>=2.6.0",
+    "spacy>=3.8,<3.9",


Is space <3.9 a hard requirement? In the past space has caused a few conflicts w.r.t cuda versions etc.

Havent tested with later spacy, so let's keep this for now. Will do some more tests to update this.

rkalaniNV

Pushed new commit

rkalaniNV · 2026-04-25T05:19:18Z

+from nemo_curator.stages.text.translation.backends._retry import MAX_RETRIES
+from nemo_curator.stages.text.translation.backends.base import TranslationBackend
+
+logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)


Yup, not needed. Removed

rkalaniNV · 2026-04-25T05:20:03Z

+_MAX_BACKOFF_SECONDS = 60.0
+
+
+async def retry_with_backoff(


This is being used by both aws and google backends. Do we want this for AsyncLLMClient as well?

rkalaniNV · 2026-04-25T05:20:57Z

+
+    name: str = "FaithEvalFilter"
+    client: AsyncLLMClient | None = None
+    model_name: str = ""


Yup. Added explicit checks in post init. It now raises clear error message.

rkalaniNV · 2026-04-25T05:21:29Z

+    source_lang: str = "en"
+    target_lang: str = "zh"


Yes. Updated code to get users to input these.

rkalaniNV · 2026-04-25T05:22:04Z

+
+    def process(self, batch: DocumentBatch) -> DocumentBatch:
+        """Score each translation row and filter rows below threshold."""
+        df = batch.to_pandas().copy()


Not by pyarrow but just to be safe for pandas dataframe.

rkalaniNV · 2026-04-25T05:23:49Z

+
+
+@dataclass
+class TextQualityMetricStage(ProcessingStage[DocumentBatch, DocumentBatch]):


Specific to translation. Given the OriginalText and BackTranslatedText we compute these metrics.

rkalaniNV · 2026-04-25T05:24:31Z

+
+
+@dataclass(kw_only=True)
+class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]):


Already renamed from TranslationPipeline to TranslationStage. Any other suggestion?

rkalaniNV · 2026-04-25T05:24:52Z

+    min_segment_chars: int = 0
+
+    client: AsyncLLMClient | None = None
+    model_name: str = ""


Nope. Added explicit checks

rkalaniNV · 2026-04-25T05:25:10Z

+    # Translation
+    "aiohttp>=3.13.3",
+    "boto3>=1.35",
+    "google-cloud-translate",


Yes, made these changes

rkalaniNV · 2026-04-25T05:25:54Z

+    "iso639-lang>=2.6.0",
+    "requests",
+    "sacrebleu>=2.6.0",
+    "spacy>=3.8,<3.9",


Havent tested with later spacy, so let's keep this for now. Will do some more tests to update this.

greptile-apps · 2026-04-25T06:00:58Z

+        for idx, seg in enumerate(segments):
+            if not seg or not seg.strip():
+                continue
+            if not is_line_translatable_content(seg):
+                continue
+            start = time.time()


Fallback loop silently drops non-translatable content

When the bulk translation call fails and the code falls to the per-segment fallback, continue is reached for non-translatable segments (JSON blobs, XML tags, etc.) without writing the original text back to translated[idx]. Because translated was initialized to [""] * len(segments) and the passthrough assignment translated[idx] = seg only happens in the batch-path loop (lines 299–301), non-translatable content is silently replaced with "" in the fallback path.

if not is_line_translatable_content(seg): translated[idx] = seg # preserve passthrough in fallback too continue

greptile-apps · 2026-04-25T06:27:32Z

+        nonempty_mask = df.apply(
+            lambda row: bool(str(row.get(self.source_text_field, "")).strip())
+            or bool(str(row.get(self.translated_text_field, "")).strip()),
+            axis=1,
+        )


or mask scores rows with failed/empty translations

nonempty_mask uses or, so any row where the source text is non-empty (but translation is "" due to a failure) is still sent to the LLM for FAITH scoring. The LLM evaluates a real source against an empty string, returns a near-zero score, and the row is filtered by the threshold — silently discarding it for the wrong reason (FAITH failure rather than translation error). The fix is to require both fields to be non-empty:

Suggested change

nonempty_mask = df.apply(

lambda row: bool(str(row.get(self.source_text_field, "")).strip())

or bool(str(row.get(self.translated_text_field, "")).strip()),

axis=1,

)

nonempty_mask = df.apply(

lambda row: bool(str(row.get(self.source_text_field, "")).strip())

and bool(str(row.get(self.translated_text_field, "")).strip()),

axis=1,

)

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

mikemckiernan

Raunak, actually, this repo covers the docs in the /fern directory. I made a proposed PR with these updates in #1903. You are welcome to make similar changes and we can close 1903, or we can use 1903 (likely with changes that I don't know to make).

Whatever works best for everyone.

mikemckiernan · 2026-04-30T20:04:08Z

+pipeline.add_stage(
+    TranslationStage(
+        client=client,
+        model_name="openai/openai/gpt-5.1",


is it OK to change this to openai/gpt-oss-120b? I can't find gpt-5.1.

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Signed-off-by: rkalani <rkalani@nvidia.com>

greptile-apps · 2026-05-05T19:03:14Z

Want your agent to iterate on Greptile's feedback? Try greploops.

Signed-off-by: rkalani <rkalani@nvidia.com>

greptile-apps · 2026-05-06T09:46:02Z

+    def _write_multi_field_payload(
+        self,
+        out_row: dict[str, Any],
+        reassembled_by_path: dict[str, list[str]],
+        translation_map: dict[str, Any],
+    ) -> None:
+        """Write reassembled multi-field output payload into ``out_row``."""
+        output_payload: Any = None
+        for field_path, texts in reassembled_by_path.items():
+            output_payload = self._write_one_field_payload(out_row, field_path, texts)
+
+        if len(reassembled_by_path) == 1 and output_payload is not None:
+            out_row[self.output_field] = output_payload
+        else:
+            out_row[self.output_field] = translation_map


Empty output field set to {} instead of ""

When a document has no extractable text in any configured field path (all source fields are empty strings or missing), _collect_multi_field_outputs returns an empty reassembled_by_path = {} and an empty translation_map = {}. In _write_multi_field_payload, because len(reassembled_by_path) != 1, it falls into the else branch and sets out_row[self.output_field] = {} (an empty dict). Downstream stages like FormatTranslationOutputStage._build_translated_messages receive this dict as translated_text, call str({}) to get "{}", and inject the string literal "{}" into every message's content field rather than an empty string.

Suggested change

def _write_multi_field_payload(

self,

out_row: dict[str, Any],

reassembled_by_path: dict[str, list[str]],

translation_map: dict[str, Any],

) -> None:

"""Write reassembled multi-field output payload into ``out_row``."""

output_payload: Any = None

for field_path, texts in reassembled_by_path.items():

output_payload = self._write_one_field_payload(out_row, field_path, texts)

if len(reassembled_by_path) == 1 and output_payload is not None:

out_row[self.output_field] = output_payload

else:

out_row[self.output_field] = translation_map

if not reassembled_by_path:

out_row[self.output_field] = ""

elif len(reassembled_by_path) == 1 and output_payload is not None:

out_row[self.output_field] = output_payload

else:

out_row[self.output_field] = translation_map

Signed-off-by: rkalani <rkalani@nvidia.com>

greptile-apps · 2026-05-06T10:40:18Z

+---
+description: "Translate flat and structured text fields with Curator's experimental translation pipeline, quality scoring, and backend integrations"
+categories: ["how-to-guides"]
+tags: ["translation", "experimental", "multilingual", "faith", "text-quality", "llm", "nmt"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "how-to"
+modality: "text-only"
+---
+


PR description usage example has wrong import path, wrong class name, and a non-existent parameter

The PR description shows from nemo_curator.stages.text.translation import TranslationPipeline with preserve_segment_pairs=True. Three problems: (1) the module is nemo_curator.stages.text.experimental.translation, not nemo_curator.stages.text.translation; (2) the public class is TranslationStage, not TranslationPipeline; (3) preserve_segment_pairs is not a valid parameter on TranslationStage. The documentation files themselves are correct — this discrepancy is in the PR description only.

Signed-off-by: rkalani <rkalani@nvidia.com>

ayushdg · 2026-05-06T13:13:47Z

/ok to test 6e32a16

ayushdg · 2026-05-06T21:04:40Z

/ok to test 823c707

greptile-apps · 2026-05-06T21:16:31Z

+            if wildcard_path:
+                translation_map.setdefault(field_key, []).append(reassembled)
+                segmented_map.setdefault(field_key, []).extend(current_pairs)
+            else:
+                translation_map[field_key] = reassembled
+                segmented_map[field_key] = current_pairs


Leaf-key collision silently drops translations for same-leaf multi-path configs

When text_field contains two non-wildcard paths that share the same trailing component (e.g. ["user.content", "assistant.content"]), _leaf_field_key returns "content" for both. The second path's translation_map[field_key] = reassembled call overwrites the first, and segmented_map[field_key] = current_pairs does the same. When there are multiple field paths in reassembled_by_path, out_row[self.output_field] is set to this incomplete translation_map, silently discarding every path except the last one that shares a leaf name. The _translation_map metadata column is equally affected.

Suggested change

if wildcard_path:

translation_map.setdefault(field_key, []).append(reassembled)

segmented_map.setdefault(field_key, []).extend(current_pairs)

else:

translation_map[field_key] = reassembled

segmented_map[field_key] = current_pairs

if wildcard_path:

translation_map.setdefault(field_key, []).append(reassembled)

segmented_map.setdefault(field_key, []).extend(current_pairs)

else:

# Use the full field_path as key to avoid collisions when

# multiple paths share the same leaf name (e.g. user.content

# and assistant.content both have leaf "content").

translation_map[field_path] = reassembled

segmented_map[field_path] = current_pairs

greptile-apps · 2026-05-06T21:16:33Z

+        nonempty_mask = df.apply(
+            lambda row: bool(str(row.get(self.source_text_field, "")).strip())
+            or bool(str(row.get(self.translated_text_field, "")).strip()),
+            axis=1,
+        )


or mask scores rows where translation silently failed

nonempty_mask uses or, so any row where the source text is non-empty but _translated is "" (a failed translation) is still submitted to the LLM for FAITH scoring. The model evaluates a real source against an empty string and typically returns a near-zero score, causing the row to be dropped by the threshold filter for the wrong reason — appearing as a quality failure rather than a translation error. Using and ensures only rows with both a source and a translation are scored.

Suggested change

nonempty_mask = df.apply(

lambda row: bool(str(row.get(self.source_text_field, "")).strip())

or bool(str(row.get(self.translated_text_field, "")).strip()),

axis=1,

)

nonempty_mask = df.apply(

lambda row: bool(str(row.get(self.source_text_field, "")).strip())

and bool(str(row.get(self.translated_text_field, "")).strip()),

axis=1,

)

greptile-apps · 2026-05-06T21:16:33Z

+        if self.output_mode in ("raw", "both"):
+            self._build_metadata_column(df)
+
+        if self.output_mode == "raw" and self.output_field in df.columns:
+            df = df.drop(columns=[self.output_field])
+
+        if self.reconstruct_messages and self.messages_field in df.columns:
+            self._build_translated_messages(df)


output_field column dropped before _build_translated_messages reads it

When output_mode="raw" and reconstruct_messages=True, the output_field column is removed on line 75 before _build_translated_messages is called on line 78. Inside that method, df.iloc[idx].get(self.output_field, "") returns "" for every row because the column no longer exists, so reconstruct_messages_with_translation is called with an empty string and every message ends up with empty translated content. The _build_translated_messages call should be moved before the column drop.

Suggested change

if self.output_mode in ("raw", "both"):

self._build_metadata_column(df)

if self.output_mode == "raw" and self.output_field in df.columns:

df = df.drop(columns=[self.output_field])

if self.reconstruct_messages and self.messages_field in df.columns:

self._build_translated_messages(df)

if self.output_mode in ("raw", "both"):

self._build_metadata_column(df)

if self.reconstruct_messages and self.messages_field in df.columns:

self._build_translated_messages(df)

if self.output_mode == "raw" and self.output_field in df.columns:

df = df.drop(columns=[self.output_field])

rkalaniNV requested review from a team as code owners April 21, 2026 16:10

rkalaniNV requested review from praateekmahajan and removed request for a team April 21, 2026 16:10

github-actions Bot added the community-request label Apr 21, 2026

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

jgerh reviewed Apr 21, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 21, 2026

rkalaniNV requested a review from a team as a code owner April 23, 2026 11:05

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

ayushdg requested changes Apr 24, 2026

View reviewed changes

rkalaniNV commented Apr 25, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 25, 2026

View reviewed changes

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Apr 25, 2026

mikemckiernan mentioned this pull request Apr 30, 2026

docs: Fern updates for #1844 #1903

Closed

3 tasks

mikemckiernan added a commit to mikemckiernan/Curator that referenced this pull request Apr 30, 2026

docs: Fern updates for NVIDIA-NeMo#1844

fc64ad7

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

mikemckiernan reviewed Apr 30, 2026

View reviewed changes

mikemckiernan added a commit to mikemckiernan/Curator that referenced this pull request Apr 30, 2026

docs: Fern updates for NVIDIA-NeMo#1844

73621d2

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 1, 2026

rkalaniNV added 6 commits May 1, 2026 15:29

Add translation

0507f5c

Signed-off-by: rkalani <rkalani@nvidia.com>

Bug Fix

4d6eb61

Signed-off-by: rkalani <rkalani@nvidia.com>

Bug Fixes

686d564

Signed-off-by: rkalani <rkalani@nvidia.com>

Simplify translation pipeline and evaluation helpers

ab60fd7

Signed-off-by: rkalani <rkalani@nvidia.com>

Refactor translation package layout

e8f8591

Signed-off-by: rkalani <rkalani@nvidia.com>

Document translation workflows

50617fd

Signed-off-by: rkalani <rkalani@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci May 5, 2026 18:51 Inactive

Fix translation Ruff issues

65b7df4

Signed-off-by: rkalani <rkalani@nvidia.com>

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Remove spaCy upper bound

1370421

Signed-off-by: rkalani <rkalani@nvidia.com>

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

docs: add Fern translation guide

6e32a16

Signed-off-by: rkalani <rkalani@nvidia.com>

copy-pr-bot Bot temporarily deployed to test May 6, 2026 13:14 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 6, 2026 13:14 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 13:14 Inactive

Merge branch 'main' into rkalani/translate

823c707

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

lbliii mentioned this pull request May 11, 2026

docs: add rolling main version, remove translation from v26.04 #1969

Merged

5 tasks

	- `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff
	- `min_segment_chars` lets you explicitly bypass segmentation for short text rather than relying on an implicit cutoff


		### Round-Trip Metrics

		Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`:

	Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`:
	Backtranslation uses a second translation pass with reversed languages, followed by `TextQualityMetricStage`:


		## Notes

		- The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation.


	Set your Hugging Face token to avoid rate limiting when downloading models or datasets:

		from ._retry import retry_with_backoff


		class TranslationBackend(ABC):



		@dataclass(kw_only=True)
		class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]):



		@dataclass
		class TextQualityMetricStage(ProcessingStage[DocumentBatch, DocumentBatch]):

Conversation

rkalaniNV commented Apr 21, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

greptile-apps Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading