Add translation#1844
Conversation
Greptile SummaryThis PR introduces a complete experimental translation package for NeMo Curator, organized into segmentation, translation, reassembly, formatting, and FAITH-based evaluation stages, with backends for LLM, Google, AWS, and NMT translation. Several previously-flagged issues (shared mutable stage state, stale aiohttp session, missing fallback passthrough, asyncio.run nesting) have been addressed in this revision.
Confidence Score: 3/5Three independently-triggered data-correctness bugs are present in the core output path; merging without fixing them would produce silently wrong translated content in specific but realistic configurations. The
Important Files Changed
Sequence DiagramsequenceDiagram
participant Input as DocumentBatch
participant Skip as SkipExistingTranslationsStage
participant Seg as SegmentationStage
participant Trans as SegmentTranslationStage
participant Faith as FaithEvalFilter
participant Reassembly as ReassemblyStage
participant FaithFilter as FaithThresholdFilterStage
participant Restore as RestoreSkippedRowsStage
participant Format as FormatTranslationOutputStage
participant Merge as MergeFaithScoresStage
Input->>Skip: "batch (skip_translated=True)"
Skip-->>Skip: stash already-translated rows in _metadata
Skip->>Seg: untranslated rows
Seg->>Seg: explode rows into segments
Seg->>Trans: exploded batch
Trans->>Trans: translate each segment
Trans->>Faith: segments + translations
Faith->>Faith: score each segment
Faith->>Reassembly: scored segments
Reassembly->>Reassembly: group by _seg_doc_id, reassemble
Reassembly->>FaithFilter: reassembled docs
FaithFilter->>Restore: filtered docs
Restore->>Restore: merge stashed rows back
Restore->>Format: merged batch
Format->>Format: build metadata / reconstruct messages
Format->>Merge: formatted batch
Merge->>Merge: merge faith scores into metadata
Merge-->>Input: final DocumentBatch
Reviews (15): Last reviewed commit: "Merge branch 'main' into rkalani/transla..." | Re-trigger Greptile |
| if key == "*": | ||
| if isinstance(obj, list): | ||
| for item in obj: | ||
| if not rest: | ||
| pass | ||
| else: | ||
| _set(item, rest) | ||
| return |
There was a problem hiding this comment.
set_nested_fields silently drops values at terminal-wildcard positions
When a path ends with * (i.e. rest is empty after consuming the wildcard key), the code hits pass and never writes the translated value back to the list item. Meanwhile, extract_nested_fields does collect these string items, so the read/write contract is broken: extraction finds the texts, translation runs, but reassembly silently discards the result. Any field path of the form "messages.*" (where messages is a list of strings) would produce an empty translation output.
if key == "*":
if isinstance(obj, list):
for i, item in enumerate(obj):
if not rest:
# Write back translated values to the list in-place
if isinstance(item, str) and value_index[0] < len(values):
obj[i] = values[value_index[0]]
value_index[0] += 1
else:
_set(item, rest)
return| model_name = SPACY_FALLBACK_MODEL | ||
| nlp = spacy.load(model_name) | ||
|
|
||
| nlp.max_length = 10_000_000 # Handle very long texts (10M chars) |
There was a problem hiding this comment.
Global spaCy model cache mutated at load time
nlp.max_length = 10_000_000 permanently modifies the model stored in _nlp_cache. Because the cache is a module-level dict shared for the lifetime of the process, the first call to _get_spacy_nlp silently changes the default max_length for every subsequent user of that model in the same process — including any code that imported spaCy independently and expected the default 1 000 000 limit. If a very large text is passed, the model processes it without raising, which may cause excessive memory consumption in unexpected places.
Consider applying max_length only at the call site or documenting this as a known process-wide side effect.
| return asyncio.run( | ||
| self.translate_batch_async(texts, source_lang, target_lang) | ||
| ) |
There was a problem hiding this comment.
translate_batch calls asyncio.run() which fails inside a running event loop
translate_batch delegates directly to asyncio.run(self.translate_batch_async(...)). asyncio.run() raises RuntimeError: This event loop is already running when called from within an existing event loop (e.g., a Jupyter notebook, a Ray async actor, or anywhere the pipeline's run_async_safe helper is in effect). The same pattern appears in google.py and nmt.py.
For built-in backends the pipeline always calls translate_batch_async via run_async_safe, so the sync path isn't exercised internally. But the translate_batch method is part of the public interface, and any direct caller in an async context will hit this error. Using run_async_safe from async_utils.py here would make the sync wrapper safe in all contexts.
jgerh
left a comment
There was a problem hiding this comment.
Completed a tech pubs review of .md files and provided a few copyedits.
| - valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim | ||
| - when `output_mode="replaced"`, translated values are written back into the original field path | ||
| - when `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings |
There was a problem hiding this comment.
| - valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim | |
| - when `output_mode="replaced"`, translated values are written back into the original field path | |
| - when `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings | |
| - Valid JSON objects and arrays are treated as non-translatable content and are preserved verbatim | |
| - When `output_mode="replaced"`, translated values are written back into the original field path | |
| - When `output_mode="raw"` or `output_mode="both"`, Curator also emits translation metadata with whole-text and segmented mappings |
|
|
||
| - `segmentation_mode="coarse"` keeps line-level splitting with code-block awareness | ||
| - `segmentation_mode="fine"` uses sentence-level segmentation with structure preservation | ||
| - `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff |
There was a problem hiding this comment.
| - `min_segment_chars` lets you explicitly bypass segmentation for short text instead of relying on a hidden cutoff | |
| - `min_segment_chars` lets you explicitly bypass segmentation for short text rather than relying on an implicit cutoff |
|
|
||
| ### Round-Trip Metrics | ||
|
|
||
| Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`: |
There was a problem hiding this comment.
| Backtranslation is composed from a second translation pass with reversed languages, followed by `TextQualityMetricStage`: | |
| Backtranslation uses a second translation pass with reversed languages, followed by `TextQualityMetricStage`: |
|
|
||
| ## Notes | ||
|
|
||
| - The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation. |
There was a problem hiding this comment.
| - The translation package is designed for pipeline execution. Avoid converting large datasets to Pandas on the driver just to orchestrate translation. | |
| - The translation package is designed for pipeline execution. Avoid converting large datasets to pandas on the driver just to orchestrate translation. |
There was a problem hiding this comment.
| ## How It Works |
| @@ -18,7 +18,7 @@ NeMo Curator provides a comprehensive suite of tools for processing text data as | |||
There was a problem hiding this comment.
| ## How It Works |
There was a problem hiding this comment.
| **Set your Hugging Face token** to avoid rate limiting when downloading models or datasets: |
There was a problem hiding this comment.
| ## How It Works |
| @@ -18,7 +18,7 @@ NeMo Curator provides a comprehensive suite of tools for processing text data as | |||
There was a problem hiding this comment.
| ## How It Works |
| raise ValueError( | ||
| f"Invalid output_mode '{self.output_mode}'. " | ||
| f"Must be one of: {sorted(_VALID_OUTPUT_MODES)}" | ||
| ) | ||
|
|
||
| if self.merge_scores and self.output_mode == "replaced": | ||
| raise ValueError( | ||
| "merge_scores=True requires output_mode in {'raw','both'}. " | ||
| "Got output_mode='replaced'. Set output_mode='both' explicitly." | ||
| ) | ||
|
|
||
| if self.merge_scores and not self.enable_faith_eval: | ||
| logger.warning( | ||
| "merge_scores=True but enable_faith_eval=False; " | ||
| "score merging will be skipped" | ||
| ) | ||
|
|
||
| super().__init__() | ||
| self.stages = self._build_stages() |
There was a problem hiding this comment.
ValueError when FAITH eval is requested with a non-LLM backend
When backend_type is "google", "aws", or "nmt", client is typically None. If enable_faith_eval=True is also set, _build_stages() passes client=self.client (i.e. None) to FaithEvalFilter, which immediately raises ValueError("FaithEvalFilter requires a non-None 'client'") at construction time. The warning guard on line 100 only fires when backend_type == "llm" and client is None for segment-level FAITH — non-LLM backends + FAITH produce no helpful warning before the crash.
A preflight check in __post_init__ would give a clearer message, e.g.:
if self.enable_faith_eval and self.client is None:
raise ValueError(
"enable_faith_eval=True requires an AsyncLLMClient for scoring. "
"Provide 'client=' even when using a non-LLM translation backend."
)
ayushdg
left a comment
There was a problem hiding this comment.
Thanks for adding a lot of new logic to support translation features. I've left a few comments
| from nemo_curator.stages.text.translation.backends._retry import MAX_RETRIES | ||
| from nemo_curator.stages.text.translation.backends.base import TranslationBackend | ||
|
|
||
| logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR) |
There was a problem hiding this comment.
can you elaborate on why this is needed?
There was a problem hiding this comment.
Yup, not needed. Removed
| _MAX_BACKOFF_SECONDS = 60.0 | ||
|
|
||
|
|
||
| async def retry_with_backoff( |
There was a problem hiding this comment.
This is being used by both aws and google backends. Do we want this for AsyncLLMClient as well?
| from ._retry import retry_with_backoff | ||
|
|
||
|
|
||
| class TranslationBackend(ABC): |
There was a problem hiding this comment.
TODO for later: See if some of this can be extended as an adapter to asynchronous llm client or Async TranslateClient
| source_lang: str = "en" | ||
| target_lang: str = "zh" |
There was a problem hiding this comment.
Do you prefer these defaults, or enforce the user to set these?
There was a problem hiding this comment.
Yes. Updated code to get users to input these.
|
|
||
| name: str = "FaithEvalFilter" | ||
| client: AsyncLLMClient | None = None | ||
| model_name: str = "" |
There was a problem hiding this comment.
This should be required when the class is created rather than an empty string.
There was a problem hiding this comment.
Yup. Added explicit checks in post init. It now raises clear error message.
| except ImportError as exc: # pragma: no cover - optional dependency | ||
| raise ImportError( | ||
| "sacrebleu is required for round-trip quality metrics. " | ||
| "Install with: pip install sacrebleu" |
There was a problem hiding this comment.
uv pip install sacrebleu.
Due to dependency conflicts in the project pip might fail to resolve in many cases.
There was a problem hiding this comment.
Yup, updated the files
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]): |
There was a problem hiding this comment.
Should we rename this class?
There was a problem hiding this comment.
Already renamed from TranslationPipeline to TranslationStage. Any other suggestion?
| min_segment_chars: int = 0 | ||
|
|
||
| client: AsyncLLMClient | None = None | ||
| model_name: str = "" |
There was a problem hiding this comment.
Should we allow an empty string here?
There was a problem hiding this comment.
Nope. Added explicit checks
| # Translation | ||
| "aiohttp>=3.13.3", | ||
| "boto3>=1.35", | ||
| "google-cloud-translate", |
There was a problem hiding this comment.
can we make this optional since only one codepath depends on this?
There was a problem hiding this comment.
Yes, made these changes
| "iso639-lang>=2.6.0", | ||
| "requests", | ||
| "sacrebleu>=2.6.0", | ||
| "spacy>=3.8,<3.9", |
There was a problem hiding this comment.
Is space <3.9 a hard requirement? In the past space has caused a few conflicts w.r.t cuda versions etc.
There was a problem hiding this comment.
Havent tested with later spacy, so let's keep this for now. Will do some more tests to update this.
rkalaniNV
left a comment
There was a problem hiding this comment.
Pushed new commit
| from nemo_curator.stages.text.translation.backends._retry import MAX_RETRIES | ||
| from nemo_curator.stages.text.translation.backends.base import TranslationBackend | ||
|
|
||
| logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR) |
There was a problem hiding this comment.
Yup, not needed. Removed
| _MAX_BACKOFF_SECONDS = 60.0 | ||
|
|
||
|
|
||
| async def retry_with_backoff( |
There was a problem hiding this comment.
This is being used by both aws and google backends. Do we want this for AsyncLLMClient as well?
|
|
||
| name: str = "FaithEvalFilter" | ||
| client: AsyncLLMClient | None = None | ||
| model_name: str = "" |
There was a problem hiding this comment.
Yup. Added explicit checks in post init. It now raises clear error message.
| source_lang: str = "en" | ||
| target_lang: str = "zh" |
There was a problem hiding this comment.
Yes. Updated code to get users to input these.
|
|
||
| def process(self, batch: DocumentBatch) -> DocumentBatch: | ||
| """Score each translation row and filter rows below threshold.""" | ||
| df = batch.to_pandas().copy() |
There was a problem hiding this comment.
Not by pyarrow but just to be safe for pandas dataframe.
|
|
||
|
|
||
| @dataclass | ||
| class TextQualityMetricStage(ProcessingStage[DocumentBatch, DocumentBatch]): |
There was a problem hiding this comment.
Specific to translation. Given the OriginalText and BackTranslatedText we compute these metrics.
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class TranslationStage(CompositeStage[DocumentBatch, DocumentBatch]): |
There was a problem hiding this comment.
Already renamed from TranslationPipeline to TranslationStage. Any other suggestion?
| min_segment_chars: int = 0 | ||
|
|
||
| client: AsyncLLMClient | None = None | ||
| model_name: str = "" |
There was a problem hiding this comment.
Nope. Added explicit checks
| # Translation | ||
| "aiohttp>=3.13.3", | ||
| "boto3>=1.35", | ||
| "google-cloud-translate", |
There was a problem hiding this comment.
Yes, made these changes
| "iso639-lang>=2.6.0", | ||
| "requests", | ||
| "sacrebleu>=2.6.0", | ||
| "spacy>=3.8,<3.9", |
There was a problem hiding this comment.
Havent tested with later spacy, so let's keep this for now. Will do some more tests to update this.
| for idx, seg in enumerate(segments): | ||
| if not seg or not seg.strip(): | ||
| continue | ||
| if not is_line_translatable_content(seg): | ||
| continue | ||
| start = time.time() |
There was a problem hiding this comment.
Fallback loop silently drops non-translatable content
When the bulk translation call fails and the code falls to the per-segment fallback, continue is reached for non-translatable segments (JSON blobs, XML tags, etc.) without writing the original text back to translated[idx]. Because translated was initialized to [""] * len(segments) and the passthrough assignment translated[idx] = seg only happens in the batch-path loop (lines 299–301), non-translatable content is silently replaced with "" in the fallback path.
if not is_line_translatable_content(seg):
translated[idx] = seg # preserve passthrough in fallback too
continue| nonempty_mask = df.apply( | ||
| lambda row: bool(str(row.get(self.source_text_field, "")).strip()) | ||
| or bool(str(row.get(self.translated_text_field, "")).strip()), | ||
| axis=1, | ||
| ) |
There was a problem hiding this comment.
or mask scores rows with failed/empty translations
nonempty_mask uses or, so any row where the source text is non-empty (but translation is "" due to a failure) is still sent to the LLM for FAITH scoring. The LLM evaluates a real source against an empty string, returns a near-zero score, and the row is filtered by the threshold — silently discarding it for the wrong reason (FAITH failure rather than translation error). The fix is to require both fields to be non-empty:
| nonempty_mask = df.apply( | |
| lambda row: bool(str(row.get(self.source_text_field, "")).strip()) | |
| or bool(str(row.get(self.translated_text_field, "")).strip()), | |
| axis=1, | |
| ) | |
| nonempty_mask = df.apply( | |
| lambda row: bool(str(row.get(self.source_text_field, "")).strip()) | |
| and bool(str(row.get(self.translated_text_field, "")).strip()), | |
| axis=1, | |
| ) |
Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
mikemckiernan
left a comment
There was a problem hiding this comment.
Raunak, actually, this repo covers the docs in the /fern directory. I made a proposed PR with these updates in #1903. You are welcome to make similar changes and we can close 1903, or we can use 1903 (likely with changes that I don't know to make).
Whatever works best for everyone.
| pipeline.add_stage( | ||
| TranslationStage( | ||
| client=client, | ||
| model_name="openai/openai/gpt-5.1", |
There was a problem hiding this comment.
is it OK to change this to openai/gpt-oss-120b? I can't find gpt-5.1.
Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Signed-off-by: rkalani <rkalani@nvidia.com>
| def _write_multi_field_payload( | ||
| self, | ||
| out_row: dict[str, Any], | ||
| reassembled_by_path: dict[str, list[str]], | ||
| translation_map: dict[str, Any], | ||
| ) -> None: | ||
| """Write reassembled multi-field output payload into ``out_row``.""" | ||
| output_payload: Any = None | ||
| for field_path, texts in reassembled_by_path.items(): | ||
| output_payload = self._write_one_field_payload(out_row, field_path, texts) | ||
|
|
||
| if len(reassembled_by_path) == 1 and output_payload is not None: | ||
| out_row[self.output_field] = output_payload | ||
| else: | ||
| out_row[self.output_field] = translation_map |
There was a problem hiding this comment.
Empty output field set to
{} instead of ""
When a document has no extractable text in any configured field path (all source fields are empty strings or missing), _collect_multi_field_outputs returns an empty reassembled_by_path = {} and an empty translation_map = {}. In _write_multi_field_payload, because len(reassembled_by_path) != 1, it falls into the else branch and sets out_row[self.output_field] = {} (an empty dict). Downstream stages like FormatTranslationOutputStage._build_translated_messages receive this dict as translated_text, call str({}) to get "{}", and inject the string literal "{}" into every message's content field rather than an empty string.
| def _write_multi_field_payload( | |
| self, | |
| out_row: dict[str, Any], | |
| reassembled_by_path: dict[str, list[str]], | |
| translation_map: dict[str, Any], | |
| ) -> None: | |
| """Write reassembled multi-field output payload into ``out_row``.""" | |
| output_payload: Any = None | |
| for field_path, texts in reassembled_by_path.items(): | |
| output_payload = self._write_one_field_payload(out_row, field_path, texts) | |
| if len(reassembled_by_path) == 1 and output_payload is not None: | |
| out_row[self.output_field] = output_payload | |
| else: | |
| out_row[self.output_field] = translation_map | |
| if not reassembled_by_path: | |
| out_row[self.output_field] = "" | |
| elif len(reassembled_by_path) == 1 and output_payload is not None: | |
| out_row[self.output_field] = output_payload | |
| else: | |
| out_row[self.output_field] = translation_map |
Signed-off-by: rkalani <rkalani@nvidia.com>
| --- | ||
| description: "Translate flat and structured text fields with Curator's experimental translation pipeline, quality scoring, and backend integrations" | ||
| categories: ["how-to-guides"] | ||
| tags: ["translation", "experimental", "multilingual", "faith", "text-quality", "llm", "nmt"] | ||
| personas: ["data-scientist-focused", "mle-focused"] | ||
| difficulty: "intermediate" | ||
| content_type: "how-to" | ||
| modality: "text-only" | ||
| --- | ||
|
|
There was a problem hiding this comment.
PR description usage example has wrong import path, wrong class name, and a non-existent parameter
The PR description shows from nemo_curator.stages.text.translation import TranslationPipeline with preserve_segment_pairs=True. Three problems: (1) the module is nemo_curator.stages.text.experimental.translation, not nemo_curator.stages.text.translation; (2) the public class is TranslationStage, not TranslationPipeline; (3) preserve_segment_pairs is not a valid parameter on TranslationStage. The documentation files themselves are correct — this discrepancy is in the PR description only.
Signed-off-by: rkalani <rkalani@nvidia.com>
|
/ok to test 6e32a16 |
|
/ok to test 823c707 |
| if wildcard_path: | ||
| translation_map.setdefault(field_key, []).append(reassembled) | ||
| segmented_map.setdefault(field_key, []).extend(current_pairs) | ||
| else: | ||
| translation_map[field_key] = reassembled | ||
| segmented_map[field_key] = current_pairs |
There was a problem hiding this comment.
Leaf-key collision silently drops translations for same-leaf multi-path configs
When text_field contains two non-wildcard paths that share the same trailing component (e.g. ["user.content", "assistant.content"]), _leaf_field_key returns "content" for both. The second path's translation_map[field_key] = reassembled call overwrites the first, and segmented_map[field_key] = current_pairs does the same. When there are multiple field paths in reassembled_by_path, out_row[self.output_field] is set to this incomplete translation_map, silently discarding every path except the last one that shares a leaf name. The _translation_map metadata column is equally affected.
| if wildcard_path: | |
| translation_map.setdefault(field_key, []).append(reassembled) | |
| segmented_map.setdefault(field_key, []).extend(current_pairs) | |
| else: | |
| translation_map[field_key] = reassembled | |
| segmented_map[field_key] = current_pairs | |
| if wildcard_path: | |
| translation_map.setdefault(field_key, []).append(reassembled) | |
| segmented_map.setdefault(field_key, []).extend(current_pairs) | |
| else: | |
| # Use the full field_path as key to avoid collisions when | |
| # multiple paths share the same leaf name (e.g. user.content | |
| # and assistant.content both have leaf "content"). | |
| translation_map[field_path] = reassembled | |
| segmented_map[field_path] = current_pairs |
| nonempty_mask = df.apply( | ||
| lambda row: bool(str(row.get(self.source_text_field, "")).strip()) | ||
| or bool(str(row.get(self.translated_text_field, "")).strip()), | ||
| axis=1, | ||
| ) |
There was a problem hiding this comment.
or mask scores rows where translation silently failed
nonempty_mask uses or, so any row where the source text is non-empty but _translated is "" (a failed translation) is still submitted to the LLM for FAITH scoring. The model evaluates a real source against an empty string and typically returns a near-zero score, causing the row to be dropped by the threshold filter for the wrong reason — appearing as a quality failure rather than a translation error. Using and ensures only rows with both a source and a translation are scored.
| nonempty_mask = df.apply( | |
| lambda row: bool(str(row.get(self.source_text_field, "")).strip()) | |
| or bool(str(row.get(self.translated_text_field, "")).strip()), | |
| axis=1, | |
| ) | |
| nonempty_mask = df.apply( | |
| lambda row: bool(str(row.get(self.source_text_field, "")).strip()) | |
| and bool(str(row.get(self.translated_text_field, "")).strip()), | |
| axis=1, | |
| ) |
| if self.output_mode in ("raw", "both"): | ||
| self._build_metadata_column(df) | ||
|
|
||
| if self.output_mode == "raw" and self.output_field in df.columns: | ||
| df = df.drop(columns=[self.output_field]) | ||
|
|
||
| if self.reconstruct_messages and self.messages_field in df.columns: | ||
| self._build_translated_messages(df) |
There was a problem hiding this comment.
output_field column dropped before _build_translated_messages reads it
When output_mode="raw" and reconstruct_messages=True, the output_field column is removed on line 75 before _build_translated_messages is called on line 78. Inside that method, df.iloc[idx].get(self.output_field, "") returns "" for every row because the column no longer exists, so reconstruct_messages_with_translation is called with an empty string and every message ends up with empty translated content. The _build_translated_messages call should be moved before the column drop.
| if self.output_mode in ("raw", "both"): | |
| self._build_metadata_column(df) | |
| if self.output_mode == "raw" and self.output_field in df.columns: | |
| df = df.drop(columns=[self.output_field]) | |
| if self.reconstruct_messages and self.messages_field in df.columns: | |
| self._build_translated_messages(df) | |
| if self.output_mode in ("raw", "both"): | |
| self._build_metadata_column(df) | |
| if self.reconstruct_messages and self.messages_field in df.columns: | |
| self._build_translated_messages(df) | |
| if self.output_mode == "raw" and self.output_field in df.columns: | |
| df = df.drop(columns=[self.output_field]) |
Description
This PR adds and stabilizes the text translation package in Curator compared to main.
It introduces a reusable translation workflow centered on TranslationPipeline, with explicit stages for segmentation, translation, reassembly, formatting, and
evaluation. The package now supports structured and nested inputs, including wildcard field paths such as messages.*.content, and can emit translated output in
replaced, raw, or both modes.
The branch also adds reusable translation backends and helpers:
In addition to adding the package, this branch simplifies and cleans up the implementation:
This branch also updates the translation test suite to cover the new module layout and behavior, including segmentation, reassembly, metadata emission, prompt
loading, and backend translation paths.
Usage
from nemo_curator.stages.text.translation import TranslationPipeline
pipeline = TranslationPipeline(
source_lang="en",
target_lang="hi",
text_field="messages.*.content",
output_mode="both",
preserve_segment_pairs=True,
enable_faith_eval=True,
model_name="openai/openai/gpt-5.1",
)
Checklist