Closed
Conversation
HuggingFace datasets may serialize heterogeneous content-part dicts (e.g. text + image_url) to JSON strings when storing in Arrow tables. Recover these by attempting json.loads on string content parts during message normalization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| try: | ||
| part = json.loads(part) | ||
| except (json.JSONDecodeError, TypeError): | ||
| pass |
There was a problem hiding this comment.
json.loads silently converts strings to non-dict types
Medium Severity
json.loads on a string content part can return non-dict types (e.g., "null" → None, "123" → 123, "true" → True, "[1,2]" → list). These silently replace the original string in part, bypass the isinstance(part, dict) guard, and get appended as invalid ContentPart types. The parsed result is only useful when it's a dict, so the assignment to part needs to be guarded — e.g., only replacing part when isinstance(parsed, dict).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Description
Starting with v4.7.0, HuggingFace datasets serialize heterogeneous content-part dicts (e.g. text + image_url) to JSON strings when storing in Arrow tables. Recover these by attempting
json.loadson string content parts during message normalization.This is a quick and dirty fix for now. #1002 could be an opportunity for a more general fix.
Type of Change
Testing
uv run pytestlocally.Checklist
Note
Low Risk
Small, localized normalization change with best-effort JSON parsing; main risk is subtle behavior changes if arbitrary string content happens to be valid JSON.
Overview
Fixes message normalization for HuggingFace
datasetsv4.7+ where heterogeneouscontentpart dicts may be persisted as JSON strings.During
_normalize_raw_message_content, string parts are now best-effortjson.loads-decoded and only dict parts with atypefield are converted viafrom_raw_content_part, improving robustness when reading dataset-backed messages.Written by Cursor Bugbot for commit 5817f8c. This will update automatically on new commits. Configure here.