Skip to content

add support for base64 embedded images#56

Open
blamster19 wants to merge 1 commit intobauripalash:mainfrom
blamster19:patch-base64
Open

add support for base64 embedded images#56
blamster19 wants to merge 1 commit intobauripalash:mainfrom
blamster19:patch-base64

Conversation

@blamster19
Copy link
Copy Markdown

@blamster19 blamster19 commented Sep 30, 2025

The HTML tag supports images embedded directly into src attribute using base64 encoding, see for example this question on SO. I've added that trivial functionality to the parser since I needed it in my own project using tkhtmlview.

Summary by Sourcery

New Features:

  • Support decoding and rendering of base64-encoded JPEG, PNG, and GIF images in tags

The HTML <img> tag supports images embedded directly into HTML body using base64 encoding. This commit adds support for MIME types Image/jpeg, Image/png and Image/gif.
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Sep 30, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Adds support for base64-encoded images (JPEG, PNG, GIF) in the HTML parser by detecting, decoding, loading, and caching embedded image data before existing cache and file lookups.

Sequence diagram for handling base64-encoded images in HTML parser

sequenceDiagram
    participant Parser
    participant ImageLib as "Image.open()"
    participant Base64 as "base64.b64decode()"
    participant Cache as "cached_images"
    Parser->>Parser: handle_starttag(<img>, attrs)
    alt src starts with data:image/jpeg/png/gif;base64,
        Parser->>Base64: decode base64 from src
        Base64-->>Parser: decoded bytes
        Parser->>ImageLib: open(BytesIO(decoded bytes))
        ImageLib-->>Parser: image object
        Parser->>Cache: cache image with src as key
    else src in cached_images
        Parser->>Cache: retrieve cached image
    else src is file path
        Parser->>ImageLib: open(src file)
        ImageLib-->>Parser: image object
        Parser->>Cache: cache image
    end
Loading

Class diagram for updated image handling in HTML parser

classDiagram
    class HTMLParser {
        +cached_images: dict
        handle_starttag(tag, attrs)
    }
    class Image
    class base64
    class BytesIO
    HTMLParser --> Image : uses
    HTMLParser --> base64 : uses
    HTMLParser --> BytesIO : uses
    HTMLParser : +handle_starttag() now supports base64 src
    HTMLParser : +caches images from base64 src
Loading

File-Level Changes

Change Details Files
Support base64-encoded images in src attributes
  • Detect data:image/jpeg;base64 prefix, decode payload, open with PIL, and cache image
  • Detect data:image/png;base64 and data:image/gif;base64 prefixes, decode payload, open with PIL, and cache image
  • Insert new handling logic before existing cache/file checks
  • Wrap decoding and image loading in try/except to avoid failures
tkhtmlview/html_parser.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider unifying the duplicate base64‐decode blocks by dynamically splitting on the first comma rather than hardcoding offsets for each mime type.
  • Avoid using bare excepts—catch specific exceptions (e.g. binascii.Error, PIL.UnidentifiedImageError) and consider logging a warning when decoding fails.
  • To make this more extensible, parse the MIME type from the data URL (everything before ‘;base64’) instead of hardcoding jpeg/png/gif support.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider unifying the duplicate base64‐decode blocks by dynamically splitting on the first comma rather than hardcoding offsets for each mime type.
- Avoid using bare excepts—catch specific exceptions (e.g. binascii.Error, PIL.UnidentifiedImageError) and consider logging a warning when decoding fails.
- To make this more extensible, parse the MIME type from the data URL (everything before ‘;base64’) instead of hardcoding jpeg/png/gif support.

## Individual Comments

### Comment 1
<location> `tkhtmlview/html_parser.py:547-556` </location>
<code_context>
                     except:
                         pass

+            if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")):
+                try:
+                    image = Image.open(
+                        BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8")))
+                    )
+                    self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)
+                except:
+                    pass
+
+            if attrs[HTML.Attrs.SRC].startswith(("data:image/png;base64,", "data:image/gif;base64,")):
+                try:
+                    image = Image.open(
</code_context>

<issue_to_address>
**suggestion:** The code for handling base64 image decoding is duplicated for each image type.

Refactor the base64 decoding logic into a shared helper to simplify maintenance and future extensions.

Suggested implementation:

```python
from collections import OrderedDict
import requests
from io import BytesIO
import base64
from PIL import Image
from copy import deepcopy

def decode_base64_image(src: str) -> "Image.Image|None":
    """
    Decodes a base64-encoded image from a data URI.
    Supports JPEG, PNG, and GIF formats.
    Returns a PIL Image or None if decoding fails.
    """
    prefixes = {
        "data:image/jpeg;base64,": 23,
        "data:image/png;base64,": 22,
        "data:image/gif;base64,": 22,
    }
    for prefix, offset in prefixes.items():
        if src.startswith(prefix):
            try:
                image_data = base64.b64decode(src[offset:].encode("utf-8"))
                return Image.open(BytesIO(image_data))
            except Exception:
                return None
    return None

```

```python
            image = decode_base64_image(attrs[HTML.Attrs.SRC])
            if image is not None:
                self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)

```
</issue_to_address>

### Comment 2
<location> `tkhtmlview/html_parser.py:544` </location>
<code_context>
                     except:
                         pass

</code_context>

<issue_to_address>
**issue (bug_risk):** Bare except statements are used, which can hide unexpected errors.

Catching all exceptions makes it harder to identify and address real issues. Please catch only the relevant exceptions to improve error handling.
</issue_to_address>

### Comment 3
<location> `tkhtmlview/html_parser.py:550` </location>
<code_context>
+            if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")):
+                try:
+                    image = Image.open(
+                        BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8")))
+                    )
+                    self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)
</code_context>

<issue_to_address>
**suggestion:** The code uses hardcoded offsets for slicing the base64 string.

Hardcoded offsets are brittle and may fail if the prefix changes or new formats are introduced. Use string splitting to extract the base64 data for better reliability.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread tkhtmlview/html_parser.py
Comment on lines +547 to +556
if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")):
try:
image = Image.open(
BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8")))
)
self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)
except:
pass

if attrs[HTML.Attrs.SRC].startswith(("data:image/png;base64,", "data:image/gif;base64,")):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The code for handling base64 image decoding is duplicated for each image type.

Refactor the base64 decoding logic into a shared helper to simplify maintenance and future extensions.

Suggested implementation:

from collections import OrderedDict
import requests
from io import BytesIO
import base64
from PIL import Image
from copy import deepcopy

def decode_base64_image(src: str) -> "Image.Image|None":
    """
    Decodes a base64-encoded image from a data URI.
    Supports JPEG, PNG, and GIF formats.
    Returns a PIL Image or None if decoding fails.
    """
    prefixes = {
        "data:image/jpeg;base64,": 23,
        "data:image/png;base64,": 22,
        "data:image/gif;base64,": 22,
    }
    for prefix, offset in prefixes.items():
        if src.startswith(prefix):
            try:
                image_data = base64.b64decode(src[offset:].encode("utf-8"))
                return Image.open(BytesIO(image_data))
            except Exception:
                return None
    return None
            image = decode_base64_image(attrs[HTML.Attrs.SRC])
            if image is not None:
                self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)

Comment thread tkhtmlview/html_parser.py
@@ -543,6 +544,24 @@ def handle_starttag(self, tag, attrs):
except:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Bare except statements are used, which can hide unexpected errors.

Catching all exceptions makes it harder to identify and address real issues. Please catch only the relevant exceptions to improve error handling.

Comment thread tkhtmlview/html_parser.py
if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")):
try:
image = Image.open(
BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8")))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The code uses hardcoded offsets for slicing the base64 string.

Hardcoded offsets are brittle and may fail if the prefix changes or new formats are introduced. Use string splitting to extract the base64 data for better reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant