GNOME Bugzilla – Bug 720742
Use libarchive to unpack comics
Last modified: 2017-03-27 16:06:47 UTC
Instead of shelling out to unzip. Unrar support isn't quite there yet: http://code.google.com/p/libarchive/issues/detail?id=40 But better support for it would mean using a single codepath for both cbz and cbr file types.
Yes, I know. I am working on it :-)
*** Bug 729797 has been marked as a duplicate of this bug. ***
RAR files are supported in libarchive 3.0 onwards: https://github.com/libarchive/libarchive/wiki/ReleaseNotes#libarchive-302
(In reply to comment #1) > Yes, I know. I am working on it :-) Has any progress been made here? Thumbnailing 4GB of comics over Wi-Fi will take hours right now, as the backend seems to unpack the whole archive before finding that it only needs one of the pages from it. 4GB isn't that much as well, as it's about the amount of comics you get in a Humble Bundle. Note that it might be something that's impossible to fix without changing the Evince plugins, as I don't know the semantics of the ->load vfunc.
Created attachment 295770 [details] [review] comics: Port to using libarchive for unarchiving Includes a couple of FIXMEs, including one in libarchive. To fix: - https://code.google.com/p/libarchive/issues/detail?id=262 - Use render_pixbuf_size_prepared_cb() - Make evince stop downloading the whole file when opening something remote
Note that, for the review, it might be easier to read through the 500 lines of code, instead of the horrible diff. The performance isn't quite as interesting as I thought it might be, surely there's some caching to be done somewhere...
Bastien, thanks for this. The EvView widget does its own caching, but by default only caches one or two pages (depending on some memory limitations)
For me is ok to push this patch in order to have ready the libarchive support. I'm still working on my patch that have another approach for this problem. I try to do streaming from memory using a custom read function instead of using archive_read_open_filename. After finishing my patch I will submit if I think it impoves Bastian's code.
Review of attachment 295770 [details] [review]: Thanks for the patch. I understand that keeping both command line tools and libarchive support would be a pain, but as long as there are regressions (like not supporting rar) I think we have no choice. ::: backend/comics/comics-document.c @@ +53,3 @@ EvDocument parent_instance; + struct archive *a; Can we use a more descriptive name than 'a'? like archive? or does archive clashes with any libarchive thing? @@ +75,3 @@ +static void +comics_document_reset_archive (ComicsDocument *comics_document) reset_archive_reader? @@ +80,3 @@ + archive_read_free (comics_document->a); + comics_document->a = archive_read_new (); + g_assert ((comics_document->archive_format_func) (comics_document->a) == ARCHIVE_OK); Is the archive_format_func called the first time? The archive reader is created in the init, but rchive_format_func is only called when it's reset. @@ +92,3 @@ + r = archive_read_open_filename (comics_document->a, comics_document->archive_path, BLOCK_SIZE); + if (r != ARCHIVE_OK) + goto out; I'm not a fan of gotos, do we really need to reset the reader, why don't we free it? If we need to create/free the reader every time we need to read, why don't we do that instead of keeping an always opened archive reader? @@ +105,3 @@ + if (r != ARCHIVE_EOF && r == ARCHIVE_FATAL) + g_warning ("Fatal error handling archive: %s", archive_error_string (comics_document->a)); + break; Maybe we could ignore invalid pages, instead of making it fatal, or isn't it possible to continue reading in this case? @@ +110,2 @@ + name = archive_entry_pathname (entry); + g_ptr_array_add (array, g_strdup (name)); We could use archive_entry_pathname() and we don't need the local variable. @@ +194,3 @@ + file = g_file_new_for_uri (uri); + comics_document->archive_path = g_file_get_path (file); + g_object_unref (file); Do we really need to create a GFile just to get its path? Why not keeping the g_filename_from_uri()? @@ +199,1 @@ return FALSE; This is a problem of not using g_filename_from_uri(), we are returning FALSE here without filling the GError @@ +199,3 @@ return FALSE; + comics_document->archive_uri = g_strdup (uri); Why do we need to keep both the uri and the path? We know the uri is a local path, and we already save the path. If we need this anyway, it would only be useful if the load success, so this could be initialized later. @@ +214,3 @@ /* Get list of files in archive */ + cb_files = comics_document_list (comics_document); + if (!cb_files) { Why need a way to differentiate between cb_files is NULL because of an error reading the archive, or because it has no (valid) pages. We could either pass the GError to comics_document_list to be filled when returning NULL, or return the GArray from comics_document_list(), if it's NULL it's a failure otherwise we check here if the array is empty. @@ +224,1 @@ if (!cb_files) { We have already returned when cb_files is NULL, assuming it was a failure reading the archive. @@ +234,3 @@ supported_extensions = get_supported_image_extensions (); for (i = 0; cb_files[i] != NULL; i++) { + cb_file = cb_files[i]; Why don't we check whether the page is supported or not in comics_document_list? instead of iterating twice? @@ +272,3 @@ ComicsDocument *comics_document = COMICS_DOCUMENT (document); + return ev_xfer_uri_simple (comics_document->archive_uri, uri, error); I see now, ev_xfer_uri_simple() expects a URI not a local path, but it seems g_file_new_for_uri does the right thing when receiving a local path anyway. @@ +302,3 @@ + if (r != ARCHIVE_OK) { + g_warning ("Fatal error opening archive: %s", archive_error_string (comics_document->a)); + goto out; Is this really possible? Can't we assume that if it didn't fail in document_load, it shouldn't fail here? Maybe we could use an assert here instead. @@ +320,3 @@ + if (r != ARCHIVE_OK) { + if (r != ARCHIVE_EOF && r == ARCHIVE_FATAL) + g_warning ("Fatal error handling archive: %s", archive_error_string (comics_document->a)); Ditto. @@ +408,3 @@ + + buf = g_malloc (size); + read = archive_read_data (comics_document->a, buf, size); So, does archive_read_data ensures that buffer is filled with all the passed size? @@ +416,3 @@ + g_warning ("Read an empty file from the archive"); + } else { + gdk_pixbuf_loader_write (loader, (guchar *) buf, size, NULL); The buffer is leaked in this case. @@ +435,1 @@ + /* FIXME: Use render_pixbuf_size_prepared_cb() */ This should be easy to do no?
(In reply to Carlos Garcia Campos from comment #9) > Review of attachment 295770 [details] [review] [review]: > > Thanks for the patch. I understand that keeping both command line tools and > libarchive support would be a pain, but as long as there are regressions > (like not supporting rar) I think we have no choice. If somebody wants to do that, be my guest, but I'm not interested in doing that. I'd rather have libarchive fixed :) Apart from all the completely correct comments about my patch, I'll note: <snip> > @@ +194,3 @@ > + file = g_file_new_for_uri (uri); > + comics_document->archive_path = g_file_get_path (file); > + g_object_unref (file); > > Do we really need to create a GFile just to get its path? Why not keeping > the g_filename_from_uri()? g_file_get_path() on a local file will give the same result as g_filename_from_uri(). Except that we want to handle smb://, nfs://, mtp://, afc://, etc. g_file_get_path() will give us the fuse path so that libarchive can poke directly at it. <snip> > @@ +435,1 @@ > + /* FIXME: Use render_pixbuf_size_prepared_cb() */ > > This should be easy to do no? But I could not be bothered just yet ;) So in addition to fixing all the bugs you mentioned we should also find why it's not any faster than launching unzip by hand. And the items noted in comment 5.
(In reply to Bastien Nocera from comment #10) > (In reply to Carlos Garcia Campos from comment #9) > > Review of attachment 295770 [details] [review] [review] [review]: > > > > Thanks for the patch. I understand that keeping both command line tools and > > libarchive support would be a pain, but as long as there are regressions > > (like not supporting rar) I think we have no choice. > > If somebody wants to do that, be my guest, but I'm not interested in doing > that. I'd rather have libarchive fixed :) Yes, having libarchive fixed would be the best solution, of course. > Apart from all the completely correct comments about my patch, I'll note: > > <snip> > > @@ +194,3 @@ > > + file = g_file_new_for_uri (uri); > > + comics_document->archive_path = g_file_get_path (file); > > + g_object_unref (file); > > > > Do we really need to create a GFile just to get its path? Why not keeping > > the g_filename_from_uri()? > > g_file_get_path() on a local file will give the same result as > g_filename_from_uri(). Except that we want to handle smb://, nfs://, mtp://, > afc://, etc. g_file_get_path() will give us the fuse path so that libarchive > can poke directly at it. Ok, then we just need to fill the GError in case g_file_get_path returns NULL. > <snip> > > @@ +435,1 @@ > > + /* FIXME: Use render_pixbuf_size_prepared_cb() */ > > > > This should be easy to do no? > > But I could not be bothered just yet ;) > > So in addition to fixing all the bugs you mentioned we should also find why > it's not any faster than launching unzip by hand. And the items noted in > comment 5. Could it be that we need to iterate the archive to access every page individually? I don't know what tools do when you ask them to extract only one file, but maybe it's more efficient.
(In reply to Bastien Nocera from comment #5) > Created attachment 295770 [details] [review] [review] > comics: Port to using libarchive for unarchiving > > Includes a couple of FIXMEs, including one in libarchive. > > To fix: > - https://code.google.com/p/libarchive/issues/detail?id=262 Now at: https://github.com/libarchive/libarchive/issues/373
Created attachment 316109 [details] [review] comics: Port to using libarchive for unarchiving Includes a couple of FIXMEs, including one in libarchive. To fix: - https://github.com/libarchive/libarchive/issues/373 - Use render_pixbuf_size_prepared_cb() - Make evince stop downloading the whole file when opening something remote v2: - Rebase against latest evince - Update libarchive bug URL
Created attachment 346979 [details] [review] comics: Port to using libarchive for unarchiving Includes a couple of FIXMEs. To fix: - Use render_pixbuf_size_prepared_cb() - Make evince stop downloading the whole file when opening something remote v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
(In reply to Bastien Nocera from comment #14) > Created attachment 346979 [details] [review] [review] > comics: Port to using libarchive for unarchiving > > Includes a couple of FIXMEs. > > To fix: > - Use render_pixbuf_size_prepared_cb() > - Make evince stop downloading the whole file when opening > something remote > > v3: > - Rebase against latest evince, making sure to bring back: > - Use Unicode in translatable strings > - Sort pages in natural order > - Fix mime-type comparisons > - https://github.com/libarchive/libarchive/issues/373 looks > like it's fixed! > > v2: > - Rebase against latest evince > - Update libarchive bug URL Is this really the latest version of the patch? The libarchive bug URL is still the old one. Also I guess we will remove that FIXME now that the bug is fixed, no? or are we waiting to be able to bump libarchive requirements? I think the bug is fixed in 3.3.0 but I'm afraid most distros don't ship it yet.
I'm not sure whether the bug is fixed, might have been too quick on this. I'll investigate. The other option is using: https://github.com/zeniko/unarr which is an LGPLv3+ copy-paste library which does support the RARv5 format. If you have any ideas about: > - Make evince stop downloading the whole file when opening > something remote please let me know, that's another blocker for this to be merged.
Created attachment 347522 [details] [review] comics: Port to using libarchive for unarchiving Includes a couple of FIXMEs. To fix: - Use render_pixbuf_size_prepared_cb() - Make evince stop downloading the whole file when opening something remote v4: - Fix crash caused by a bug in comics_document_list() (the array was not NULL terminated) - Remove duplicate "!cb_files" check - Use "size-prepared" instead of "area-prepared" to get the doc size - Fix link to libarchive bug in code, not working yet :/ v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
Created attachment 348016 [details] [review] comics: Port to using libarchive for unarchiving Includes a couple of FIXMEs. To fix: - Use render_pixbuf_size_prepared_cb() - Make evince stop downloading the whole file when opening something remote v5: - Remove unused members of ComicsDocument struct - Split archive handling into an EvArchive object - Fix copy/paste error in configure.ac v4: - Fix crash caused by a bug in comics_document_list() (the array was not NULL terminated) - Remove duplicate "!cb_files" check - Use "size-prepared" instead of "area-prepared" to get the doc size - Fix link to libarchive bug in code, not working yet :/ v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
Created attachment 348023 [details] [review] comics: Port to using libarchive for unarchiving Includes a couple of FIXMEs. To fix: - Use render_pixbuf_size_prepared_cb() - Make evince stop downloading the whole file when opening something remote v5: - Remove unused members of ComicsDocument struct - Split archive handling into an EvArchive object - Fix copy/paste error in configure.ac v4: - Fix crash caused by a bug in comics_document_list() (the array was not NULL terminated) - Remove duplicate "!cb_files" check - Use "size-prepared" instead of "area-prepared" to get the doc size - Fix link to libarchive bug in code, not working yet :/ v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
Created attachment 348024 [details] [review] comics: Add unarr copy/paste To allow us to decompress CBR comics compressed with recent versions of RAR, from https://github.com/sumatrapdfreader/sumatrapdf/tree/master/ext/unarr
Created attachment 348025 [details] [review] comics: Fix "function declaration isn’t a prototype" errors Need to use "void" as the arguments when declaring. FIXME: Should be upstreamed!
Created attachment 348026 [details] [review] comics: Fix "no previous prototype for..." errors Functions that aren't used outside the C file should be marked as static.
Created attachment 348027 [details] [review] comics: Add unarr support We support all the type of RAR archives now!
Please review so we can land this ASAP for the next devel release, and so we can fix everything up.
Created attachment 348200 [details] [review] comics: Port to using libarchive for unarchiving v6: - Fix 2 pretty big memory leaks - Remove unneeded archive_read_data_skip() calls - Optimise the "no rotation" case (could also be done for gnome-3-24) - Use render_pixbuf_size_prepared_cb() - Add debug to "next header" archive calls v5: - Remove unused members of ComicsDocument struct - Split archive handling into an EvArchive object - Fix copy/paste error in configure.ac v4: - Fix crash caused by a bug in comics_document_list() (the array was not NULL terminated) - Remove duplicate "!cb_files" check - Use "size-prepared" instead of "area-prepared" to get the doc size - Fix link to libarchive bug in code, not working yet :/ v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
Created attachment 348333 [details] [review] comics: Add unarr copy/paste To allow us to decompress CBR comics compressed with recent versions of RAR, from https://github.com/sumatrapdfreader/sumatrapdf/tree/master/ext/unarr
Created attachment 348391 [details] [review] comics: Port to using libarchive for unarchiving v7: - Bump buffer size in ev-archive, good performance increase for local files v6: - Fix 2 pretty big memory leaks - Remove unneeded archive_read_data_skip() calls - Optimise the "no rotation" case (could also be done for gnome-3-24) - Use render_pixbuf_size_prepared_cb() - Add debug to "next header" archive calls v5: - Remove unused members of ComicsDocument struct - Split archive handling into an EvArchive object - Fix copy/paste error in configure.ac v4: - Fix crash caused by a bug in comics_document_list() (the array was not NULL terminated) - Remove duplicate "!cb_files" check - Use "size-prepared" instead of "area-prepared" to get the doc size - Fix link to libarchive bug in code, not working yet :/ v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
FWIW, the "list all the files in archive" part of the work is much faster with unzip than with libarchive: https://github.com/libarchive/libarchive/issues/889 But the cleanliness of the code, and the fact that we don't need to depend on external tools, means that I'd rather live with the slight regression than have us keep plodding along with calling out tools with weird quirks.
Created attachment 348410 [details] [review] comics: Port to using libarchive for unarchiving v8: - Fix double-free in error path when decompressing images v7: - Bump buffer size in ev-archive, good performance increase for local files v6: - Fix 2 pretty big memory leaks - Remove unneeded archive_read_data_skip() calls - Optimise the "no rotation" case (could also be done for gnome-3-24) - Use render_pixbuf_size_prepared_cb() - Add debug to "next header" archive calls v5: - Remove unused members of ComicsDocument struct - Split archive handling into an EvArchive object - Fix copy/paste error in configure.ac v4: - Fix crash caused by a bug in comics_document_list() (the array was not NULL terminated) - Remove duplicate "!cb_files" check - Use "size-prepared" instead of "area-prepared" to get the doc size - Fix link to libarchive bug in code, not working yet :/ v3: - Rebase against latest evince, making sure to bring back: - Use Unicode in translatable strings - Sort pages in natural order - Fix mime-type comparisons - https://github.com/libarchive/libarchive/issues/373 looks like it's fixed! v2: - Rebase against latest evince - Update libarchive bug URL
Comment on attachment 348410 [details] [review] comics: Port to using libarchive for unarchiving Pushed and added some cosmetic changes in a follow up.
Comment on attachment 348333 [details] [review] comics: Add unarr copy/paste Pushed, but moved the third party sources to our cut-n-paste dir
Comment on attachment 348025 [details] [review] comics: Fix "function declaration isn’t a prototype" errors Pushed
Comment on attachment 348026 [details] [review] comics: Fix "no previous prototype for..." errors Pushed, this one should be upstreamed too, it's a build failure.
Comment on attachment 348027 [details] [review] comics: Add unarr support Pushed and also added a couple of follow ups. I'm seeing an error with a cbr I have here: ! rar.c:169: Requesting too much data (10083 < 10240) ** (evince:3561): WARNING **: Fatal error reading 'z_blamf2.jpg' in archive: Failed to decompress RAR data We can fix that in a separate bug.
This is fixed now. Thanks!
(In reply to Carlos Garcia Campos from comment #35) > This is fixed now. Thanks! What commit was it fixed by? (In reply to Bastien Nocera from comment #29) > Created attachment 348410 [details] [review] [review] > comics: Port to using libarchive for unarchiving > > v8: > - Fix double-free in error path when decompressing images <snip> I wish you'd remove the intra-patch version changelog, as it doesn't really apply to what's committed, but no big deal I guess. (In reply to Carlos Garcia Campos from comment #31) > Comment on attachment 348333 [details] [review] [review] > comics: Add unarr copy/paste > > Pushed, but moved the third party sources to our cut-n-paste dir The dvi backend had "mdvi-lib" directly in the backend, which I think made more sense in terms of linking. Again, no big deal.
(In reply to Bastien Nocera from comment #36) > (In reply to Carlos Garcia Campos from comment #35) > > This is fixed now. Thanks! > > What commit was it fixed by? What do you mean? Do you want me to list sha1 commits here? > (In reply to Bastien Nocera from comment #29) > > Created attachment 348410 [details] [review] [review] [review] > > comics: Port to using libarchive for unarchiving > > > > v8: > > - Fix double-free in error path when decompressing images > <snip> > > I wish you'd remove the intra-patch version changelog, as it doesn't really > apply to what's committed, but no big deal I guess. Ah, right, I just forgot about it. It doesn't hurt in any case. > (In reply to Carlos Garcia Campos from comment #31) > > Comment on attachment 348333 [details] [review] [review] [review] > > comics: Add unarr copy/paste > > > > Pushed, but moved the third party sources to our cut-n-paste dir > > The dvi backend had "mdvi-lib" directly in the backend, which I think made > more sense in terms of linking. Again, no big deal. I considered unarr more generic, it's currently used only by the comics backend, but it could be used by any other part of the code that might need to uncompress rar, like we do in ev-document-factory for compressed documents.
(In reply to Carlos Garcia Campos from comment #35) > This is fixed now. Thanks! Yes, thanks a lot!
(In reply to Carlos Garcia Campos from comment #37) > (In reply to Bastien Nocera from comment #36) > > (In reply to Carlos Garcia Campos from comment #35) > > > This is fixed now. Thanks! > > > > What commit was it fixed by? > > What do you mean? Do you want me to list sha1 commits here? The sha1 (which bugzilla would automatically transform into a link), or the subject of the commit would have been nice. It wasn't super obvious from reading the logs. > > (In reply to Carlos Garcia Campos from comment #31) > > > Comment on attachment 348333 [details] [review] [review] [review] [review] > > > comics: Add unarr copy/paste > > > > > > Pushed, but moved the third party sources to our cut-n-paste dir > > > > The dvi backend had "mdvi-lib" directly in the backend, which I think made > > more sense in terms of linking. Again, no big deal. > > I considered unarr more generic, it's currently used only by the comics > backend, but it could be used by any other part of the code that might need > to uncompress rar, like we do in ev-document-factory for compressed > documents. I don't think it would be needed for any of the compression formats listed there, we only support rar in our copy of unarr. For the rest, libarchive would do just as well.