GNOME Bugzilla – Bug 742670
100% CPU usage during photo import in large collection (no indexes in DB (!))
Last modified: 2016-11-10 16:46:24 UTC
I have straced shotwell, see that man-many-many reads on sqlite DB file. Next, I figure out what cause that. I choose GDB and see that many times Shotwell spend in function detecting duplicates. It executes query like SELECT id FROM PhotoTable WHERE filename=? OR ((thumbnail_md5=? or md5=?) and file_format=?)) (not all conditions are always used). Next I see indexes on that table, and discover only ONE (!): CREATE INDEX PhotoEventIDIndex ON PhotoTable (event_id); Next, I added indexes by hand CREATE INDEX PhotoEventIDIndex ON PhotoTable (event_id); sqlite> CREATE unique INDEX mmarkk1 on PhotoTable (filename); sqlite> CREATE unique INDEX mmarkk2 on PhotoTable (thumbnail_md5, file_format); Error: UNIQUE constraint failed: PhotoTable.thumbnail_md5, PhotoTable.file_format; <--- WTF?, but that is another history... sqlite> CREATE INDEX mmarkk2 on PhotoTable (thumbnail_md5, file_format); sqlite> CREATE unique INDEX mmarkk3 on PhotoTable (md5, file_format); sqlite> CREATE unique INDEX mmarkk4 on PhotoTable (md5, thumbnail_md5, file_format); These indexes are superfluous, but guarantee help me in my case. After that, Shotwell use 100% IO of my SD-card (instead of 100% CPU, as in situation before adding indexes). I have DB of size 13 MB. After adding indexes it become 22 MB :). It is acceptable for my 200-GB collection of photos. sqlite> select count(*) from PhotoTable; 36403 Also, I use Shotwell 5+ years, and maybe bug in DB upgrade procedures, that did not add indexes...
Hi, the same bug on Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777499. Please set this bug to confirmed. Thanks CU Jörg
Any plans to resolve this issue (like what is estimated time or major obstacles)? I have already 50k+ photos, so importing is very painful because of the constant waiting.
Created attachment 332033 [details] [review] Add indexs to PhotoTable To speed up duplicate searches. This is the first try on my rather limited set of images; if these don't provide a decent speedup, we might need to create some covering indexes. Signed-off-by: Jens Georg <mail@jensge.org>
Attachment 332033 [details] pushed as 767950c - Add indexs to PhotoTable
HAY! you have bug in that patch! 2) index on thumbnail_md5,file_format ..... on PhotoTable(md5, file_format) ... Feel the difference!
Also, will it add these indexes after upgrading ShotWell to new version ?
Also, it is a question about uniquiness. What about duplicate images in libary ? different by MD5, but exactly same in thumbnails ? (i.e. absolutely white jpeg 1000x1000 pixels and absolutely white 300x300 pixels) - they may have exactly the same thumbnail.
(In reply to Коренберг Марк from comment #5) > HAY! you have bug in that patch! > > 2) index on thumbnail_md5,file_format > ..... on PhotoTable(md5, file_format) ... > > Feel the difference! Whoops, damn you copy & paste! Thanks. (In reply to Коренберг Марк from comment #6) > Also, will it add these indexes after upgrading ShotWell to new version ? Yes.
what about uniquiness ?
(In reply to Коренберг Марк from comment #9) > what about uniquiness ? Yeah, I have to think about that. Good point.
You can only reasonably expect duplicate detection to work when comparing the full image (or hash thereof), anything on a thumbnail (i.e. lossy representation of the original) will be imperfect no matter what. If the operation is faster on thumbs, then (assuming there are more non-duplicates than duplicates) it's a good optimization to first test for thumbnail inequality* and if that fails, compare the full image. (*Maybe thumbnail size checking might be even faster?) The current duplicate check is a bit optimistic but probably works rather well, except for what are presumably edge cases and the thumbnails match up despite being different photos. Is someone having real trouble with false duplicates? e.g. fast shutters with very low noise pictures?
There are no problems. I just have checked: select filename from PhotoTable where thumbnail_md5 in (select thumbnail_md5 from PhotoTable group by thumbnail_md5 having count(*) > 1) order by 1; This give me a list of filenames with same thumbnail_md5. In all cases these are real duplicates (i.e. almost the same photo in same dir). So, adding UNIQUE index will FAIL at least on my collection of photos. It seems, some (unit-)test should be added to check that.
(In reply to Коренберг Марк from comment #12) Yeah, duplicate thumbnails can currently easily happen currently if you imported into your library and lost your DB. The extracted/sidecar jpegs will of course have identical thumbs. > So, adding UNIQUE index will FAIL at least on my collection of photos. It > seems, some (unit-)test should be added to check that. Patches welcome.
Created attachment 332319 [details] [review] Fix issue with indexes on PhotoTable - thumbnail_md5 might actually not be unique for various reasons - Second index was a duplicate of the first instead of using thumbnail_md5 Signed-off-by: Jens Georg <mail@jensge.org>
What about removing uniquinness on md5,file_format ?
(In reply to Коренберг Марк from comment #15) > What about removing uniquinness on md5,file_format ? Having this pair non-unique would be a fail of the dupe detection, no?
Comment on attachment 332319 [details] [review] Fix issue with indexes on PhotoTable Attachment 332319 [details] pushed as 71ec94a - Fix issue with indexes on PhotoTable
(In reply to Jens Georg from comment #16) > (In reply to Коренберг Марк from comment #15) > > What about removing uniquinness on md5,file_format ? > > Having this pair non-unique would be a fail of the dupe detection, no? Meh. Of course that happens (Bug 772223)
Index made non-unique on master and 0.24 branch