Fix recurring AI translation output bugs (fences, dup keys, truncation)#7
Open
beagandica wants to merge 1 commit into
Open
Fix recurring AI translation output bugs (fences, dup keys, truncation)#7beagandica wants to merge 1 commit into
beagandica wants to merge 1 commit into
Conversation
Addresses the two failure modes seen repeatedly in auto-translation PRs
(observed in workshops PRs #606 and #658):
1. ~30% of translated files were wrapped in a code-fence block
(```markdown ... ``` or ```yaml ... ```) around the whole file,
breaking Hugo's frontmatter parser.
2. Some translated answer-key files contained duplicate YAML keys
(e.g. two `hidden: true` lines), which YAML parsers reject.
Both bugs required manual cleanup before each merge. This change makes
the tool produce clean output and refuses to write broken output.
Changes
-------
Prompt
* Rewritten to explicitly forbid wrapping the response in any kind of
code fence (markdown/md/yaml/yml), forbid extra commentary, and
require the translated frontmatter to have the exact same key set
as the source (no additions, no removals, no duplicates).
* Spells out which frontmatter values may be translated (title,
description, summary) and which must be preserved byte-for-byte.
Inference settings
* Temperature lowered from 1.0 to 0.2 (deterministic task).
* MaxTokens raised from 1500 to 8192 (several pages exceed 1500).
Defense-in-depth post-processing
* Strips a single layer of outer code-fence wrapping if the LLM
still adds one. Conservative: only when the first and last
non-empty lines are a matching fence pair; never touches code
blocks inside the body.
* Strips a leading BOM.
* Logs every cleanup it performs.
Validation (fail fast, never write broken output)
* Checks the LLM's FinishReason. If truncated, skip and exit non-zero.
* Parses translated frontmatter with YamlDotNet using
WithDuplicateKeyChecking(), which throws on the exact dup-key
pattern that caused PR #606's manual fixes.
* Verifies the translated frontmatter has the same key set as the
source.
* Rejects translations whose body is empty when source body was not.
* Rejects translations that still start with a code fence.
Path bug
* Replaced filePath.Replace("english", newFolder) with a
segment-aware replace. The old version would corrupt any path
whose filename contained the substring "english"
(e.g. a workshop named "my-english-club").
Behavior on validation failure
* The offending file is skipped (no broken output written).
* The process exits with code 2 so the workflow surfaces the
problem instead of silently producing bad PRs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes the two recurring AI translation output bugs we kept manually cleaning up before merge in
NuevoFoundation/workshopsPRs (most recently #606 and #658):```markdown ... ```or```yaml ... ```, which breaks Hugo's frontmatter parser. PR #658 had 31/78 files affected.hidden: truelines, which the YAML parser rejects. PR #606 had 3 such files.Both bugs required hand-cleanup on every translation PR. After this change, the tool produces clean output, and if the LLM still does something invalid the file is skipped with a clear error instead of silently writing broken markdown.
Changes
Stronger prompt (
Program.cs)Inference settings
Temperature:1.0→0.2(translation is deterministic, high temperature was driving the structural defects).MaxTokens:1500→8192(multiple workshop pages exceed 1500 tokens and were being silently truncated).Defense-in-depth post-processing
Validation (fail fast, never write broken output)
FinishReason— if the model hit the token limit, skip the file and exit non-zero.YamlDotNet(WithDuplicateKeyChecking()), which throws on the exact dup-key pattern from #606.Path bug
filePath.Replace("english", newFolder)with a segment-aware path replace. The old version would corrupt any path whose filename contained the substring "english" (e.g. a workshop named "my-english-club").Exit behavior
2so the workflow surfaces the problem instead of silently producing bad PRs.Dependency
YamlDotNet 16.2.1for safe frontmatter parsing.Validation
Locally exercised the post-processor and validator against the real bug patterns we hit in PRs #606 and #658:
Follow-up
Once this is merged, the submodule pointer in
NuevoFoundation/workshopsneeds to be bumped via a separate PR. That PR will also need to bump the workflow's .NET SDK from 8.0.x to 10.0.x (the repo upgraded the project to .NET 10 in #6 but never bumped the submodule pointer or workflow).Tracking issue: NuevoFoundation/workshops#662 (assigned to @ozhang22).