Bug 742670 – 100% CPU usage during photo import in large collection (no indexes in DB (!))

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 742670 - 100% CPU usage during photo import in large collection (no indexes in DB (!))


Summary:	100% CPU usage during photo import in large collection (no indexes in DB (!))


Status:	RESOLVED FIXED

Product:	shotwell
Classification:	Other
Component:	import
Version:	0.20.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	0.26
Assigned To:	Shotwell Maintainers
QA Contact:	Shotwell Maintainers

URL:
Whiteboard:	performance

Depends on:
Blocks:	749187 772223

Reported:	2015-01-09 21:00 UTC by Коренберг Марк
Modified:	2016-11-10 16:46 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Add indexs to PhotoTable (2.15 KB, patch) 2016-07-24 06:46 UTC, Jens Georg	committed	Details \| Review
Fix issue with indexes on PhotoTable (1.65 KB, patch) 2016-07-28 22:23 UTC, Jens Georg	committed	Details \| Review

Description Коренберг Марк 2015-01-09 21:00:10 UTC

I have straced shotwell, see that man-many-many reads on sqlite DB file. Next, I figure out what cause that. I choose GDB and see that many times Shotwell spend in function detecting duplicates.

It executes query like 

SELECT id FROM PhotoTable WHERE filename=? OR ((thumbnail_md5=? or md5=?) and file_format=?))

(not all conditions are always used). Next I see indexes on that table, and discover only ONE (!):

CREATE INDEX PhotoEventIDIndex ON PhotoTable (event_id);

Next, I added indexes by hand
CREATE INDEX PhotoEventIDIndex ON PhotoTable (event_id);


sqlite> CREATE unique INDEX mmarkk1 on PhotoTable (filename);

sqlite> CREATE unique INDEX mmarkk2 on PhotoTable (thumbnail_md5, file_format);
Error: UNIQUE constraint failed: PhotoTable.thumbnail_md5, PhotoTable.file_format;  <--- WTF?, but that is another history...

sqlite> CREATE INDEX mmarkk2 on PhotoTable (thumbnail_md5, file_format);

sqlite> CREATE unique INDEX mmarkk3 on PhotoTable (md5, file_format);

sqlite> CREATE unique INDEX mmarkk4 on PhotoTable (md5, thumbnail_md5, file_format);


These indexes are superfluous, but guarantee help me in my case. After that, Shotwell use 100% IO of my SD-card (instead of 100% CPU, as in situation before adding indexes).

I have DB of size 13 MB. After adding indexes it become 22 MB :). It is acceptable for my 200-GB collection of photos.

sqlite> select count(*) from PhotoTable;
36403


Also, I use Shotwell 5+ years, and maybe bug in DB upgrade procedures, that did not add indexes...

Comment 1 Joerg C. Frings-Fuerst 2015-02-24 08:55:51 UTC

Hi,

the same bug on Debian: 

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777499.


Please set this bug to confirmed.

Thanks


CU
Jörg

Comment 2 Daniel Koć 2015-08-31 12:50:14 UTC

Any plans to resolve this issue (like what is estimated time or major obstacles)? I have already 50k+ photos, so importing is very painful because of the constant waiting.

Comment 3 Jens Georg 2016-07-24 06:46:05 UTC

Created attachment 332033 [details] [review]
Add indexs to PhotoTable

To speed up duplicate searches. This is the first try on my rather limited set
of images; if these don't provide a decent speedup, we might need to create
some covering indexes.

Signed-off-by: Jens Georg <mail@jensge.org>

Comment 4 Jens Georg 2016-07-24 06:49:42 UTC

Attachment 332033 [details] pushed as 767950c - Add indexs to PhotoTable

Comment 5 Коренберг Марк 2016-07-24 22:18:49 UTC

HAY! you have bug in that patch!

2) index on thumbnail_md5,file_format
..... on PhotoTable(md5, file_format) ...

Feel the difference!

Comment 6 Коренберг Марк 2016-07-24 22:21:21 UTC

Also, will it add these indexes after upgrading ShotWell to new version ?

Comment 7 Коренберг Марк 2016-07-24 22:23:58 UTC

Also, it is a question about  uniquiness. What about duplicate images in libary ? different by MD5, but exactly same in thumbnails ? (i.e. absolutely white jpeg 1000x1000 pixels and absolutely white 300x300 pixels) - they may have exactly the same thumbnail.

Comment 8 Jens Georg 2016-07-25 07:50:53 UTC

(In reply to Коренберг Марк from comment #5)
> HAY! you have bug in that patch!
> 
> 2) index on thumbnail_md5,file_format
> ..... on PhotoTable(md5, file_format) ...
> 
> Feel the difference!

Whoops, damn you copy & paste! Thanks.


(In reply to Коренберг Марк from comment #6)
> Also, will it add these indexes after upgrading ShotWell to new version ?

Yes.

Comment 9 Коренберг Марк 2016-07-25 10:46:18 UTC

what about uniquiness ?

Comment 10 Jens Georg 2016-07-25 12:12:31 UTC

(In reply to Коренберг Марк from comment #9)
> what about uniquiness ?

Yeah, I have to think about that. Good point.

Comment 11 Andreas Brauchli 2016-07-25 19:53:17 UTC

You can only reasonably expect duplicate detection to work when comparing the full image (or hash thereof), anything on a thumbnail (i.e. lossy representation of the original) will be imperfect no matter what.

If the operation is faster on thumbs, then (assuming there are more non-duplicates than duplicates) it's a good optimization to first test for thumbnail inequality* and if that fails, compare the full image. (*Maybe thumbnail size checking might be even faster?)

The current duplicate check is a bit optimistic but probably works rather well, except for what are presumably edge cases and the thumbnails match up despite being different photos.

Is someone having real trouble with false duplicates? e.g. fast shutters with very low noise pictures?

Comment 12 Коренберг Марк 2016-07-25 20:23:47 UTC

There are no problems. I just have checked:

select filename from PhotoTable where thumbnail_md5 in (select thumbnail_md5 from PhotoTable group by thumbnail_md5 having count(*) > 1) order by 1;

This give me a list of filenames with same thumbnail_md5.

In all cases these are real duplicates (i.e. almost the same photo in same dir).

So, adding UNIQUE index will FAIL at least on my collection of photos. It seems, some (unit-)test should be added to check that.

Comment 13 Jens Georg 2016-07-28 22:20:53 UTC

(In reply to Коренберг Марк from comment #12)

Yeah, duplicate thumbnails can currently easily happen currently if you imported into your library and lost your DB. The extracted/sidecar jpegs will of course have identical thumbs.

> So, adding UNIQUE index will FAIL at least on my collection of photos. It
> seems, some (unit-)test should be added to check that.

Patches welcome.

Comment 14 Jens Georg 2016-07-28 22:23:34 UTC

Created attachment 332319 [details] [review]
Fix issue with indexes on PhotoTable

 - thumbnail_md5 might actually not be unique for various reasons
 - Second index was a duplicate of the first instead of using thumbnail_md5

Signed-off-by: Jens Georg <mail@jensge.org>

Comment 15 Коренберг Марк 2016-07-29 18:24:25 UTC

What about removing uniquinness on md5,file_format ?

Comment 16 Jens Georg 2016-08-05 21:46:25 UTC

(In reply to Коренберг Марк from comment #15)
> What about removing uniquinness on md5,file_format ?

Having this pair non-unique would be a fail of the dupe detection, no?

Comment 17 Jens Georg 2016-08-05 22:13:50 UTC

Comment on attachment 332319 [details] [review]
Fix issue with indexes on PhotoTable

Attachment 332319 [details] pushed as 71ec94a - Fix issue with indexes on PhotoTable

Comment 18 Jens Georg 2016-09-30 06:05:28 UTC

(In reply to Jens Georg from comment #16)
> (In reply to Коренберг Марк from comment #15)
> > What about removing uniquinness on md5,file_format ?
> 
> Having this pair non-unique would be a fail of the dupe detection, no?

Meh. Of course that happens (Bug 772223)

Comment 19 Jens Georg 2016-11-10 16:46:24 UTC

Index made non-unique on master and 0.24 branch