Skip to content

Fix recurring AI translation output bugs (fences, dup keys, truncation)#7

Open
beagandica wants to merge 1 commit into
mainfrom
fix/translation-quality-and-validation
Open

Fix recurring AI translation output bugs (fences, dup keys, truncation)#7
beagandica wants to merge 1 commit into
mainfrom
fix/translation-quality-and-validation

Conversation

@beagandica

@beagandica beagandica commented May 29, 2026

Copy link
Copy Markdown
Member

What

Fixes the two recurring AI translation output bugs we kept manually cleaning up before merge in NuevoFoundation/workshops PRs (most recently #606 and #658):

  1. Code-fence wrapping (~30% of files): the LLM was wrapping the entire translated file in ```markdown ... ``` or ```yaml ... ```, which breaks Hugo's frontmatter parser. PR #658 had 31/78 files affected.
  2. Duplicate YAML keys: some translated answer-key files contained two hidden: true lines, which the YAML parser rejects. PR #606 had 3 such files.

Both bugs required hand-cleanup on every translation PR. After this change, the tool produces clean output, and if the LLM still does something invalid the file is skipped with a clear error instead of silently writing broken markdown.

Changes

Stronger prompt (Program.cs)

  • Explicitly forbids wrapping output in any kind of code fence (markdown/md/yaml/yml).
  • Forbids preamble or commentary.
  • Requires translated frontmatter to have the same key set as the source (no additions, no removals, no duplicates).
  • Spells out which frontmatter values may be translated (title, description, summary) vs which must be preserved byte-for-byte.

Inference settings

  • Temperature: 1.00.2 (translation is deterministic, high temperature was driving the structural defects).
  • MaxTokens: 15008192 (multiple workshop pages exceed 1500 tokens and were being silently truncated).

Defense-in-depth post-processing

  • Strips a single layer of outer code-fence wrapping if the LLM still adds one. Conservative: only strips when the first and last non-empty lines are a matching fence pair, so legitimate code blocks inside the body are preserved.
  • Also handles the rarer "fence wraps frontmatter only" pattern.
  • Strips a leading BOM.
  • Logs every cleanup it performs.

Validation (fail fast, never write broken output)

  • Checks FinishReason — if the model hit the token limit, skip the file and exit non-zero.
  • Parses translated frontmatter with YamlDotNet (WithDuplicateKeyChecking()), which throws on the exact dup-key pattern from #606.
  • Verifies translated frontmatter has the same key set as source.
  • Rejects translations whose body is empty when source body was not.
  • Rejects translations that still start with a code fence.

Path bug

  • Replaced filePath.Replace("english", newFolder) with a segment-aware path replace. The old version would corrupt any path whose filename contained the substring "english" (e.g. a workshop named "my-english-club").

Exit behavior

  • Files that fail validation are skipped (no broken output written).
  • The process exits with code 2 so the workflow surfaces the problem instead of silently producing bad PRs.

Dependency

  • Added YamlDotNet 16.2.1 for safe frontmatter parsing.

Validation

Locally exercised the post-processor and validator against the real bug patterns we hit in PRs #606 and #658:

PASS  strip-markdown-fence-wrap            (PR #658 pattern)
PASS  strip-yaml-fence-wrap                (PR #658 pattern)
PASS  strip-partial-yaml-fence-wrap        (PR #658 pattern, rare)
PASS  passthrough-clean                    (no false positives on good output)
PASS  strip-bom
PASS  preserve-internal-code-fences        (does not strip legitimate body code blocks)
PASS  empty-response-throws
PASS  detect-key-mismatch
PASS  detect-duplicate-key                 (PR #606 pattern)
PASS  reject-residual-fence
PASS  path-replace-segment-only
PASS  reject-empty-body

Follow-up

Once this is merged, the submodule pointer in NuevoFoundation/workshops needs to be bumped via a separate PR. That PR will also need to bump the workflow's .NET SDK from 8.0.x to 10.0.x (the repo upgraded the project to .NET 10 in #6 but never bumped the submodule pointer or workflow).


Tracking issue: NuevoFoundation/workshops#662 (assigned to @ozhang22).

Addresses the two failure modes seen repeatedly in auto-translation PRs
(observed in workshops PRs #606 and #658):

1. ~30% of translated files were wrapped in a code-fence block
   (```markdown ... ``` or ```yaml ... ```) around the whole file,
   breaking Hugo's frontmatter parser.
2. Some translated answer-key files contained duplicate YAML keys
   (e.g. two `hidden: true` lines), which YAML parsers reject.

Both bugs required manual cleanup before each merge. This change makes
the tool produce clean output and refuses to write broken output.

Changes
-------

Prompt
* Rewritten to explicitly forbid wrapping the response in any kind of
  code fence (markdown/md/yaml/yml), forbid extra commentary, and
  require the translated frontmatter to have the exact same key set
  as the source (no additions, no removals, no duplicates).
* Spells out which frontmatter values may be translated (title,
  description, summary) and which must be preserved byte-for-byte.

Inference settings
* Temperature lowered from 1.0 to 0.2 (deterministic task).
* MaxTokens raised from 1500 to 8192 (several pages exceed 1500).

Defense-in-depth post-processing
* Strips a single layer of outer code-fence wrapping if the LLM
  still adds one. Conservative: only when the first and last
  non-empty lines are a matching fence pair; never touches code
  blocks inside the body.
* Strips a leading BOM.
* Logs every cleanup it performs.

Validation (fail fast, never write broken output)
* Checks the LLM's FinishReason. If truncated, skip and exit non-zero.
* Parses translated frontmatter with YamlDotNet using
  WithDuplicateKeyChecking(), which throws on the exact dup-key
  pattern that caused PR #606's manual fixes.
* Verifies the translated frontmatter has the same key set as the
  source.
* Rejects translations whose body is empty when source body was not.
* Rejects translations that still start with a code fence.

Path bug
* Replaced filePath.Replace("english", newFolder) with a
  segment-aware replace. The old version would corrupt any path
  whose filename contained the substring "english"
  (e.g. a workshop named "my-english-club").

Behavior on validation failure
* The offending file is skipped (no broken output written).
* The process exits with code 2 so the workflow surfaces the
  problem instead of silently producing bad PRs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant