get_page_images(): only return images actually referenced by the given page by andreasntr · Pull Request #4961 · pymupdf/PyMuPDF

andreasntr · 2026-04-03T17:02:37Z

As per Document.get_page_images() docs:

this is not the list of images that are actually displayed.

This fix allows getting only the images actually referenced by the given page by comparing xrefs returned by Page.get_image_info(xrefs=True).

The list of images xrefs for each page is saved at Document creation time, filtering is performed only when invoking get_page_images on the document.

~~### Request~~

I was able to write this code because i used this kind of workaround in a project of mine based on pymudf, however I'm not able to fully test it because the process of creating the test environment is not clear (pyproject.toml is empty for example). I'll be happy to test it when instructions are provided

github-actions · 2026-04-03T17:02:47Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

andreasntr · 2026-04-03T17:03:14Z

I have read the CLA Document and I hereby sign the CLA

julian-smith-artifex-com · 2026-04-14T10:39:59Z

[There are currently 5 failing tests with this PR.]

julian-smith-artifex-com · 2026-04-15T11:33:51Z

I might be missing something here, but could you explain why this new code is necessary?

It looks like it does quite a lot of up-front processing in the pymupdf.Document constructor which might cause an unnecessary slowdown in many cases.
We already have recommendation to use Document.get_image_info() if one wants to find only those images that are actually displayed.

Thanks.

andreasntr · 2026-04-15T11:49:15Z

It looks like it does quite a lot of up-front processing in the pymupdf.Document constructor which might cause an unnecessary slowdown in many cases.

Yes, it adds latency but only at load time. Once the pdf is loaded, there is no additional latency.

We already have recommendation to use Document.get_image_info() if one wants to find only those images that are actually displayed.

I saw your suggestions and I tried to implement this as close as possible (bugs aside).

I think many people may be actually interested in the images included in the page rather than all the references in the whole document, because that's what the name of the method get_page_images suggests. But I get your point, there could be people who want the current behavior, and maybe we can make this additional pre-processing tax optional by adding a flag in the Document class?

julian-smith-artifex-com · 2026-04-15T13:06:19Z

It looks like it does quite a lot of up-front processing in the pymupdf.Document constructor which might cause an unnecessary slowdown in many cases.

Yes, it adds latency but only at load time. Once the pdf is loaded, there is no additional latency.

It's exactly this load-time latency that i'm concerned about. If i open a 100 page document and only want to look at one page, the load-time delay could be very significant.

We already have recommendation to use Document.get_image_info() if one wants to find only those images that are actually displayed.

I saw your suggestions and I tried to implement this as close as possible (bugs aside).

I think many people may be actually interested in the images included in the page rather than all the references in the whole document, because that's what the name of the method get_page_images suggests. But I get your point, there could be people who want the current behavior, and maybe we can make this additional pre-processing tax optional by adding a flag in the Document class?

There are a lot of things that could potentially be cached in the Document constructor, but we generally do not do such processing up front - if complicates the API and runs the risk of breaking if/when the document is modified.

Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using ~~~Document.get_image_info()~~~ Page.get_image_info()?

andreasntr · 2026-04-15T15:08:48Z

Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using Document.get_image_info()?

Nothing wrong, it's just easier to deal with for people seeking to iterate over images in a page

julian-smith-artifex-com · 2026-04-16T09:46:15Z

Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using Document.get_image_info()?

Nothing wrong, it's just easier to deal with for people seeking to iterate over images in a page

I'm grateful for your submission of this PR, but i'm afraid we're going to reject it.

I don't think there's a clear enough motivation for the changes, especially in view of the extra overhead when opening a document, and the potential for the new cached information to break if the document is modified.

Thanks again,

Julian

github-actions Bot added a commit that referenced this pull request Apr 3, 2026

@andreasntr has signed the CLA in #4961

a771f23

andreasntr added 3 commits April 14, 2026 19:16

add duplicate images handling

fd48309

resolve circular dependency on first images duplication check

b218bbf

fix length check over page_count

ba2e3be

andreasntr force-pushed the fix-page-images-duplication branch from 5691f11 to ba2e3be Compare April 14, 2026 17:16

andreasntr added 4 commits April 14, 2026 22:19

perform images duplication check only for pdfs

cd3bf19

revert uv push

fdd21bf

revert tests gitignore addition

ceb4a8a

revert uv push

3b29476

julian-smith-artifex-com closed this Apr 16, 2026

github-actions Bot locked and limited conversation to collaborators Apr 16, 2026

andreasntr deleted the fix-page-images-duplication branch April 16, 2026 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_page_images(): only return images actually referenced by the given page#4961

get_page_images(): only return images actually referenced by the given page#4961
andreasntr wants to merge 7 commits intopymupdf:mainfrom
andreasntr:fix-page-images-duplication

andreasntr commented Apr 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

andreasntr commented Apr 3, 2026

Uh oh!

julian-smith-artifex-com commented Apr 14, 2026

Uh oh!

julian-smith-artifex-com commented Apr 15, 2026

Uh oh!

andreasntr commented Apr 15, 2026

Uh oh!

julian-smith-artifex-com commented Apr 15, 2026 •

edited

Loading

Uh oh!

andreasntr commented Apr 15, 2026

Uh oh!

julian-smith-artifex-com commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreasntr commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreasntr commented Apr 3, 2026

Uh oh!

julian-smith-artifex-com commented Apr 14, 2026

Uh oh!

julian-smith-artifex-com commented Apr 15, 2026

Uh oh!

andreasntr commented Apr 15, 2026

Uh oh!

julian-smith-artifex-com commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreasntr commented Apr 15, 2026

Uh oh!

julian-smith-artifex-com commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreasntr commented Apr 3, 2026 •

edited

Loading

github-actions Bot commented Apr 3, 2026 •

edited

Loading

julian-smith-artifex-com commented Apr 15, 2026 •

edited

Loading