get_page_images(): only return images actually referenced by the given page#4961
get_page_images(): only return images actually referenced by the given page#4961andreasntr wants to merge 7 commits intopymupdf:mainfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
[There are currently 5 failing tests with this PR.] |
5691f11 to
ba2e3be
Compare
|
I might be missing something here, but could you explain why this new code is necessary?
Thanks. |
Yes, it adds latency but only at load time. Once the pdf is loaded, there is no additional latency.
I saw your suggestions and I tried to implement this as close as possible (bugs aside). I think many people may be actually interested in the images included in the page rather than all the references in the whole document, because that's what the name of the method |
It's exactly this load-time latency that i'm concerned about. If i open a 100 page document and only want to look at one page, the load-time delay could be very significant.
There are a lot of things that could potentially be cached in the Document constructor, but we generally do not do such processing up front - if complicates the API and runs the risk of breaking if/when the document is modified. Separate from that, i'm still not understanding the motivation here. Perhaps it would help if you could explain to me what is wrong with simply using ~~~ |
Nothing wrong, it's just easier to deal with for people seeking to iterate over images in a page |
I'm grateful for your submission of this PR, but i'm afraid we're going to reject it. I don't think there's a clear enough motivation for the changes, especially in view of the extra overhead when opening a document, and the potential for the new cached information to break if the document is modified. Thanks again,
|
As per
Document.get_page_images()docs:This fix allows getting only the images actually referenced by the given page by comparing xrefs returned by
Page.get_image_info(xrefs=True).The list of images xrefs for each page is saved at
Documentcreation time, filtering is performed only when invokingget_page_imageson the document.### RequestI was able to write this code because i used this kind of workaround in a project of mine based on pymudf, however I'm not able to fully test it because the process of creating the test environment is not clear (pyproject.tomlis empty for example). I'll be happy to test it when instructions are provided