Skip to content

fix: strip port from URL domain in is_external_url comparison#1783

Open
Br1an67 wants to merge 1 commit intounclecode:mainfrom
Br1an67:fix/issue-1503-external-url-port
Open

fix: strip port from URL domain in is_external_url comparison#1783
Br1an67 wants to merge 1 commit intounclecode:mainfrom
Br1an67:fix/issue-1503-external-url-port

Conversation

@Br1an67
Copy link

@Br1an67 Br1an67 commented Mar 1, 2026

Summary

is_external_url() compared parsed.netloc (which includes port, e.g. localhost:8000) against base_domain (which has port stripped by get_base_domain()). This caused URLs like http://localhost:8000/page to be wrongly classified as external when the base domain is localhost, making the crawler stop after the first page when exclude_external_links=True.

The fix strips the port from parsed.netloc before comparison, consistent with how get_base_domain() handles ports.

Fixes #1503

List of files changed and why

  • crawl4ai/utils.py — Strip port from URL netloc in is_external_url() before domain comparison

How Has This Been Tested?

Verified locally with test cases:

  • http://localhost:8000/page with base localhost → correctly returns internal
  • http://localhost:8000/page with base localhost:8000 → correctly returns internal
  • Standard external/internal URL comparisons continue to work

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

The is_external_url function compared the full netloc (including port)
against base_domain (which has port stripped by get_base_domain).
This caused URLs like http://localhost:8000/page to be wrongly
classified as external when base_domain is 'localhost'.

Strip the port from parsed.netloc before comparison.

Fixes unclecode#1503
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: When crawling site with port specified, is_external_url returns True wrongly

1 participant