Skip to content

fix: handle encoding errors in sitemap URL seeding gracefully#1784

Open
Br1an67 wants to merge 1 commit intounclecode:mainfrom
Br1an67:fix/issue-1542-sitemap-error-handling
Open

fix: handle encoding errors in sitemap URL seeding gracefully#1784
Br1an67 wants to merge 1 commit intounclecode:mainfrom
Br1an67:fix/issue-1542-sitemap-error-handling

Conversation

@Br1an67
Copy link

@Br1an67 Br1an67 commented Mar 1, 2026

Summary

When a sitemap contains URLs with Unicode characters like zero-width spaces (U+200B), the producer in async_url_seeder.py crashes with a UnicodeEncodeError on Windows (charmap codec), which terminates URL seeding prematurely.

This PR:

  1. Adds per-URL error handling in the producer loop so encoding errors skip the problematic URL instead of killing the entire producer
  2. Strips zero-width space (U+200B) and BOM (U+FEFF) characters from sitemap URLs during normalization in both _iter_sitemap_content() and _iter_sitemap()

Fixes #1542

List of files changed and why

  • crawl4ai/async_url_seeder.py — Added UnicodeEncodeError catch in producer loop; strip invisible Unicode chars in _normalize_loc()

How Has This Been Tested?

Verified that the producer continues processing after encountering URLs with zero-width spaces, and that _normalize_loc correctly strips invisible characters.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Add per-URL error handling in the producer loop so a single URL with
problematic characters (e.g. zero-width spaces) doesn't crash the
entire sitemap processing. Also strip zero-width space (U+200B) and
BOM (U+FEFF) characters from sitemap URLs during normalization.

Fixes unclecode#1542
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Sitemap error causes url seeding to end prematurely

1 participant