Skip to content

get_page_images(): only return images actually referenced by the given page#4961

Closed
andreasntr wants to merge 7 commits intopymupdf:mainfrom
andreasntr:fix-page-images-duplication
Closed

get_page_images(): only return images actually referenced by the given page#4961
andreasntr wants to merge 7 commits intopymupdf:mainfrom
andreasntr:fix-page-images-duplication

Conversation

@andreasntr
Copy link
Copy Markdown

@andreasntr andreasntr commented Apr 3, 2026

As per Document.get_page_images() docs:

this is not the list of images that are actually displayed.

This fix allows getting only the images actually referenced by the given page by comparing xrefs returned by Page.get_image_info(xrefs=True).

The list of images xrefs for each page is saved at Document creation time, filtering is performed only when invoking get_page_images on the document.

### Request

I was able to write this code because i used this kind of workaround in a project of mine based on pymudf, however I'm not able to fully test it because the process of creating the test environment is not clear (pyproject.toml is empty for example). I'll be happy to test it when instructions are provided

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@andreasntr
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request Apr 3, 2026
@julian-smith-artifex-com
Copy link
Copy Markdown
Collaborator

[There are currently 5 failing tests with this PR.]

@andreasntr andreasntr force-pushed the fix-page-images-duplication branch from 5691f11 to ba2e3be Compare April 14, 2026 17:16
@julian-smith-artifex-com
Copy link
Copy Markdown
Collaborator

I might be missing something here, but could you explain why this new code is necessary?

  • It looks like it does quite a lot of up-front processing in the pymupdf.Document constructor which might cause an unnecessary slowdown in many cases.

  • We already have recommendation to use Document.get_image_info() if one wants to find only those images that are actually displayed.

Thanks.

@andreasntr
Copy link
Copy Markdown
Author

  • It looks like it does quite a lot of up-front processing in the pymupdf.Document constructor which might cause an unnecessary slowdown in many cases.

Yes, it adds latency but only at load time. Once the pdf is loaded, there is no additional latency.

  • We already have recommendation to use Document.get_image_info() if one wants to find only those images that are actually displayed.

I saw your suggestions and I tried to implement this as close as possible (bugs aside).

I think many people may be actually interested in the images included in the page rather than all the references in the whole document, because that's what the name of the method get_page_images suggests. But I get your point, there could be people who want the current behavior, and maybe we can make this additional pre-processing tax optional by adding a flag in the Document class?

@julian-smith-artifex-com
Copy link
Copy Markdown
Collaborator

julian-smith-artifex-com commented Apr 15, 2026

  • It looks like it does quite a lot of up-front processing in the pymupdf.Document constructor which might cause an unnecessary slowdown in many cases.

Yes, it adds latency but only at load time. Once the pdf is loaded, there is no additional latency.

It's exactly this load-time latency that i'm concerned about. If i open a 100 page document and only want to look at one page, the load-time delay could be very significant.

  • We already have recommendation to use Document.get_image_info() if one wants to find only those images that are actually displayed.

I saw your suggestions and I tried to implement this as close as possible (bugs aside).

I think many people may be actually interested in the images included in the page rather than all the references in the whole document, because that's what the name of the method get_page_images suggests. But I get your point, there could be people who want the current behavior, and maybe we can make this additional pre-processing tax optional by adding a flag in the Document class?

There are a lot of things that could potentially be cached in the Document constructor, but we generally do not do such processing up front - if complicates the API and runs the risk of breaking if/when the document is modified.

Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using ~~~Document.get_image_info()~~~ Page.get_image_info()?

@andreasntr
Copy link
Copy Markdown
Author

Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using Document.get_image_info()?

Nothing wrong, it's just easier to deal with for people seeking to iterate over images in a page

@julian-smith-artifex-com
Copy link
Copy Markdown
Collaborator

Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using Document.get_image_info()?

Nothing wrong, it's just easier to deal with for people seeking to iterate over images in a page

I'm grateful for your submission of this PR, but i'm afraid we're going to reject it.

I don't think there's a clear enough motivation for the changes, especially in view of the extra overhead when opening a document, and the potential for the new cached information to break if the document is modified.

Thanks again,

  • Julian

@github-actions github-actions Bot locked and limited conversation to collaborators Apr 16, 2026
@andreasntr andreasntr deleted the fix-page-images-duplication branch April 16, 2026 10:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants