add support for base64 embedded images#56
add support for base64 embedded images#56blamster19 wants to merge 1 commit intobauripalash:mainfrom
Conversation
The HTML <img> tag supports images embedded directly into HTML body using base64 encoding. This commit adds support for MIME types Image/jpeg, Image/png and Image/gif.
Reviewer's guide (collapsed on small PRs)Reviewer's GuideAdds support for base64-encoded images (JPEG, PNG, GIF) in the HTML parser by detecting, decoding, loading, and caching embedded image data before existing cache and file lookups. Sequence diagram for handling base64-encoded images in HTML parsersequenceDiagram
participant Parser
participant ImageLib as "Image.open()"
participant Base64 as "base64.b64decode()"
participant Cache as "cached_images"
Parser->>Parser: handle_starttag(<img>, attrs)
alt src starts with data:image/jpeg/png/gif;base64,
Parser->>Base64: decode base64 from src
Base64-->>Parser: decoded bytes
Parser->>ImageLib: open(BytesIO(decoded bytes))
ImageLib-->>Parser: image object
Parser->>Cache: cache image with src as key
else src in cached_images
Parser->>Cache: retrieve cached image
else src is file path
Parser->>ImageLib: open(src file)
ImageLib-->>Parser: image object
Parser->>Cache: cache image
end
Class diagram for updated image handling in HTML parserclassDiagram
class HTMLParser {
+cached_images: dict
handle_starttag(tag, attrs)
}
class Image
class base64
class BytesIO
HTMLParser --> Image : uses
HTMLParser --> base64 : uses
HTMLParser --> BytesIO : uses
HTMLParser : +handle_starttag() now supports base64 src
HTMLParser : +caches images from base64 src
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Consider unifying the duplicate base64‐decode blocks by dynamically splitting on the first comma rather than hardcoding offsets for each mime type.
- Avoid using bare excepts—catch specific exceptions (e.g. binascii.Error, PIL.UnidentifiedImageError) and consider logging a warning when decoding fails.
- To make this more extensible, parse the MIME type from the data URL (everything before ‘;base64’) instead of hardcoding jpeg/png/gif support.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider unifying the duplicate base64‐decode blocks by dynamically splitting on the first comma rather than hardcoding offsets for each mime type.
- Avoid using bare excepts—catch specific exceptions (e.g. binascii.Error, PIL.UnidentifiedImageError) and consider logging a warning when decoding fails.
- To make this more extensible, parse the MIME type from the data URL (everything before ‘;base64’) instead of hardcoding jpeg/png/gif support.
## Individual Comments
### Comment 1
<location> `tkhtmlview/html_parser.py:547-556` </location>
<code_context>
except:
pass
+ if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")):
+ try:
+ image = Image.open(
+ BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8")))
+ )
+ self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)
+ except:
+ pass
+
+ if attrs[HTML.Attrs.SRC].startswith(("data:image/png;base64,", "data:image/gif;base64,")):
+ try:
+ image = Image.open(
</code_context>
<issue_to_address>
**suggestion:** The code for handling base64 image decoding is duplicated for each image type.
Refactor the base64 decoding logic into a shared helper to simplify maintenance and future extensions.
Suggested implementation:
```python
from collections import OrderedDict
import requests
from io import BytesIO
import base64
from PIL import Image
from copy import deepcopy
def decode_base64_image(src: str) -> "Image.Image|None":
"""
Decodes a base64-encoded image from a data URI.
Supports JPEG, PNG, and GIF formats.
Returns a PIL Image or None if decoding fails.
"""
prefixes = {
"data:image/jpeg;base64,": 23,
"data:image/png;base64,": 22,
"data:image/gif;base64,": 22,
}
for prefix, offset in prefixes.items():
if src.startswith(prefix):
try:
image_data = base64.b64decode(src[offset:].encode("utf-8"))
return Image.open(BytesIO(image_data))
except Exception:
return None
return None
```
```python
image = decode_base64_image(attrs[HTML.Attrs.SRC])
if image is not None:
self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)
```
</issue_to_address>
### Comment 2
<location> `tkhtmlview/html_parser.py:544` </location>
<code_context>
except:
pass
</code_context>
<issue_to_address>
**issue (bug_risk):** Bare except statements are used, which can hide unexpected errors.
Catching all exceptions makes it harder to identify and address real issues. Please catch only the relevant exceptions to improve error handling.
</issue_to_address>
### Comment 3
<location> `tkhtmlview/html_parser.py:550` </location>
<code_context>
+ if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")):
+ try:
+ image = Image.open(
+ BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8")))
+ )
+ self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)
</code_context>
<issue_to_address>
**suggestion:** The code uses hardcoded offsets for slicing the base64 string.
Hardcoded offsets are brittle and may fail if the prefix changes or new formats are introduced. Use string splitting to extract the base64 data for better reliability.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")): | ||
| try: | ||
| image = Image.open( | ||
| BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8"))) | ||
| ) | ||
| self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image) | ||
| except: | ||
| pass | ||
|
|
||
| if attrs[HTML.Attrs.SRC].startswith(("data:image/png;base64,", "data:image/gif;base64,")): |
There was a problem hiding this comment.
suggestion: The code for handling base64 image decoding is duplicated for each image type.
Refactor the base64 decoding logic into a shared helper to simplify maintenance and future extensions.
Suggested implementation:
from collections import OrderedDict
import requests
from io import BytesIO
import base64
from PIL import Image
from copy import deepcopy
def decode_base64_image(src: str) -> "Image.Image|None":
"""
Decodes a base64-encoded image from a data URI.
Supports JPEG, PNG, and GIF formats.
Returns a PIL Image or None if decoding fails.
"""
prefixes = {
"data:image/jpeg;base64,": 23,
"data:image/png;base64,": 22,
"data:image/gif;base64,": 22,
}
for prefix, offset in prefixes.items():
if src.startswith(prefix):
try:
image_data = base64.b64decode(src[offset:].encode("utf-8"))
return Image.open(BytesIO(image_data))
except Exception:
return None
return None image = decode_base64_image(attrs[HTML.Attrs.SRC])
if image is not None:
self.cached_images[attrs[HTML.Attrs.SRC]] = deepcopy(image)| @@ -543,6 +544,24 @@ def handle_starttag(self, tag, attrs): | |||
| except: | |||
There was a problem hiding this comment.
issue (bug_risk): Bare except statements are used, which can hide unexpected errors.
Catching all exceptions makes it harder to identify and address real issues. Please catch only the relevant exceptions to improve error handling.
| if attrs[HTML.Attrs.SRC].startswith(("data:image/jpeg;base64,")): | ||
| try: | ||
| image = Image.open( | ||
| BytesIO(base64.b64decode(attrs[HTML.Attrs.SRC][23:].encode("utf-8"))) |
There was a problem hiding this comment.
suggestion: The code uses hardcoded offsets for slicing the base64 string.
Hardcoded offsets are brittle and may fail if the prefix changes or new formats are introduced. Use string splitting to extract the base64 data for better reliability.
The HTML
tag supports images embedded directly into src attribute using base64 encoding, see for example this question on SO. I've added that trivial functionality to the parser since I needed it in my own project using tkhtmlview.
Summary by Sourcery
New Features: