Fix recurring AI translation output bugs (fences, dup keys, truncation) by beagandica · Pull Request #7 · NuevoFoundation/WorkshopsAutoTranslation

beagandica · 2026-05-29T02:58:47Z

What

Fixes the two recurring AI translation output bugs we kept manually cleaning up before merge in NuevoFoundation/workshops PRs (most recently #606 and #658):

Code-fence wrapping (~30% of files): the LLM was wrapping the entire translated file in ```markdown ... ``` or ```yaml ... ```, which breaks Hugo's frontmatter parser. PR #658 had 31/78 files affected.
Duplicate YAML keys: some translated answer-key files contained two hidden: true lines, which the YAML parser rejects. PR #606 had 3 such files.

Both bugs required hand-cleanup on every translation PR. After this change, the tool produces clean output, and if the LLM still does something invalid the file is skipped with a clear error instead of silently writing broken markdown.

Changes

Stronger prompt (`Program.cs`)

Explicitly forbids wrapping output in any kind of code fence (markdown/md/yaml/yml).
Forbids preamble or commentary.
Requires translated frontmatter to have the same key set as the source (no additions, no removals, no duplicates).
Spells out which frontmatter values may be translated (title, description, summary) vs which must be preserved byte-for-byte.

Inference settings

Temperature: 1.0 → 0.2 (translation is deterministic, high temperature was driving the structural defects).
MaxTokens: 1500 → 8192 (multiple workshop pages exceed 1500 tokens and were being silently truncated).

Defense-in-depth post-processing

Strips a single layer of outer code-fence wrapping if the LLM still adds one. Conservative: only strips when the first and last non-empty lines are a matching fence pair, so legitimate code blocks inside the body are preserved.
Also handles the rarer "fence wraps frontmatter only" pattern.
Strips a leading BOM.
Logs every cleanup it performs.

Validation (fail fast, never write broken output)

Checks FinishReason — if the model hit the token limit, skip the file and exit non-zero.
Parses translated frontmatter with YamlDotNet (WithDuplicateKeyChecking()), which throws on the exact dup-key pattern from #606.
Verifies translated frontmatter has the same key set as source.
Rejects translations whose body is empty when source body was not.
Rejects translations that still start with a code fence.

Path bug

Replaced filePath.Replace("english", newFolder) with a segment-aware path replace. The old version would corrupt any path whose filename contained the substring "english" (e.g. a workshop named "my-english-club").

Exit behavior

Files that fail validation are skipped (no broken output written).
The process exits with code 2 so the workflow surfaces the problem instead of silently producing bad PRs.

Dependency

Added YamlDotNet 16.2.1 for safe frontmatter parsing.

Validation

Locally exercised the post-processor and validator against the real bug patterns we hit in PRs #606 and #658:

PASS  strip-markdown-fence-wrap            (PR #658 pattern)
PASS  strip-yaml-fence-wrap                (PR #658 pattern)
PASS  strip-partial-yaml-fence-wrap        (PR #658 pattern, rare)
PASS  passthrough-clean                    (no false positives on good output)
PASS  strip-bom
PASS  preserve-internal-code-fences        (does not strip legitimate body code blocks)
PASS  empty-response-throws
PASS  detect-key-mismatch
PASS  detect-duplicate-key                 (PR #606 pattern)
PASS  reject-residual-fence
PASS  path-replace-segment-only
PASS  reject-empty-body

Follow-up

Once this is merged, the submodule pointer in NuevoFoundation/workshops needs to be bumped via a separate PR. That PR will also need to bump the workflow's .NET SDK from 8.0.x to 10.0.x (the repo upgraded the project to .NET 10 in #6 but never bumped the submodule pointer or workflow).

Tracking issue: NuevoFoundation/workshops#662 (assigned to @ozhang22).

Addresses the two failure modes seen repeatedly in auto-translation PRs (observed in workshops PRs #606 and #658): 1. ~30% of translated files were wrapped in a code-fence block (```markdown ... ``` or ```yaml ... ```) around the whole file, breaking Hugo's frontmatter parser. 2. Some translated answer-key files contained duplicate YAML keys (e.g. two `hidden: true` lines), which YAML parsers reject. Both bugs required manual cleanup before each merge. This change makes the tool produce clean output and refuses to write broken output. Changes ------- Prompt * Rewritten to explicitly forbid wrapping the response in any kind of code fence (markdown/md/yaml/yml), forbid extra commentary, and require the translated frontmatter to have the exact same key set as the source (no additions, no removals, no duplicates). * Spells out which frontmatter values may be translated (title, description, summary) and which must be preserved byte-for-byte. Inference settings * Temperature lowered from 1.0 to 0.2 (deterministic task). * MaxTokens raised from 1500 to 8192 (several pages exceed 1500). Defense-in-depth post-processing * Strips a single layer of outer code-fence wrapping if the LLM still adds one. Conservative: only when the first and last non-empty lines are a matching fence pair; never touches code blocks inside the body. * Strips a leading BOM. * Logs every cleanup it performs. Validation (fail fast, never write broken output) * Checks the LLM's FinishReason. If truncated, skip and exit non-zero. * Parses translated frontmatter with YamlDotNet using WithDuplicateKeyChecking(), which throws on the exact dup-key pattern that caused PR #606's manual fixes. * Verifies the translated frontmatter has the same key set as the source. * Rejects translations whose body is empty when source body was not. * Rejects translations that still start with a code fence. Path bug * Replaced filePath.Replace("english", newFolder) with a segment-aware replace. The old version would corrupt any path whose filename contained the substring "english" (e.g. a workshop named "my-english-club"). Behavior on validation failure * The offending file is skipped (no broken output written). * The process exits with code 2 so the workflow surfaces the problem instead of silently producing bad PRs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

beagandica mentioned this pull request May 29, 2026

Auto-translation tool produces malformed output (code-fence wraps, duplicate YAML keys) NuevoFoundation/workshops#662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix recurring AI translation output bugs (fences, dup keys, truncation)#7

Fix recurring AI translation output bugs (fences, dup keys, truncation)#7
beagandica wants to merge 1 commit into
mainfrom
fix/translation-quality-and-validation

beagandica commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beagandica commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Stronger prompt (Program.cs)

Inference settings

Defense-in-depth post-processing

Validation (fail fast, never write broken output)

Path bug

Exit behavior

Dependency

Validation

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beagandica commented May 29, 2026 •

edited

Loading

Stronger prompt (`Program.cs`)