fix CI for datasets v4.7.0 by snimu · Pull Request #1004 · PrimeIntellect-ai/verifiers

snimu · 2026-03-10T22:08:18Z

Description

Starting with v4.7.0, HuggingFace datasets serialize heterogeneous content-part dicts (e.g. text + image_url) to JSON strings when storing in Arrow tables. Recover these by attempting json.loads on string content parts during message normalization.

This is a quick and dirty fix for now. #1002 could be an opportunity for a more general fix.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Low Risk
Small, localized normalization change with best-effort JSON parsing; main risk is subtle behavior changes if arbitrary string content happens to be valid JSON.

Overview
Fixes message normalization for HuggingFace datasets v4.7+ where heterogeneous content part dicts may be persisted as JSON strings.

During _normalize_raw_message_content, string parts are now best-effort json.loads-decoded and only dict parts with a type field are converted via from_raw_content_part, improving robustness when reading dataset-backed messages.

^{Written by Cursor Bugbot for commit 5817f8c. This will update automatically on new commits. Configure here.}

HuggingFace datasets may serialize heterogeneous content-part dicts (e.g. text + image_url) to JSON strings when storing in Arrow tables. Recover these by attempting json.loads on string content parts during message normalization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-10T22:11:15Z

+                try:
+                    part = json.loads(part)
+                except (json.JSONDecodeError, TypeError):
+                    pass


json.loads silently converts strings to non-dict types

Medium Severity

json.loads on a string content part can return non-dict types (e.g., "null" → None, "123" → 123, "true" → True, "[1,2]" → list). These silently replace the original string in part, bypass the isinstance(part, dict) guard, and get appended as invalid ContentPart types. The parsed result is only useful when it's a dict, so the assignment to part needs to be guarded — e.g., only replacing part when isinstance(parsed, dict).

snimu and others added 2 commits March 10, 2026 14:52

Simplify _normalize_raw_message_content JSON string recovery

5817f8c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

snimu changed the title ~~Sebastian/fix datasets nonsense 2026 03 10~~ fix CI for datasets v4.7.0 Mar 10, 2026

cursor bot reviewed Mar 10, 2026

View reviewed changes

snimu closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix CI for datasets v4.7.0#1004

fix CI for datasets v4.7.0#1004
snimu wants to merge 2 commits intomainfrom
sebastian/fix-datasets-nonsense-2026-03-10

snimu commented Mar 10, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

snimu commented Mar 10, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 10, 2026

Choose a reason for hiding this comment

json.loads silently converts strings to non-dict types

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

snimu commented Mar 10, 2026 •

edited by cursor bot

Loading