After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 169646 - need to handle duplicates
need to handle duplicates
Status: RESOLVED FIXED
Product: f-spot
Classification: Other
Component: Import
0.3.0
Other All
: Normal normal
: ---
Assigned To: F-spot maintainers
F-spot maintainers
: 305734 308796 352300 365573 382843 408541 448519 508011 520725 526274 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2005-03-08 21:12 UTC by Peter A. Goodall
Modified: 2009-10-22 15:26 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Duplicates patch (27.86 KB, patch)
2006-03-28 19:16 UTC, Thomas Van Machelen
none Details | Review
updated version (28.13 KB, patch)
2006-06-10 19:41 UTC, Thomas Van Machelen
none Details | Review
patch for current subversion (23.57 KB, patch)
2007-02-22 00:50 UTC, Nuno Ferreira
none Details | Review
Small improvements (24.73 KB, patch)
2007-02-23 07:17 UTC, Thomas Van Machelen
none Details | Review
Updated for current SVN head (24.94 KB, patch)
2007-03-07 10:40 UTC, Jamie Wilkinson
none Details | Review
minimal patch for database to have a md5sum field. (1.43 KB, patch)
2007-08-08 20:26 UTC, Pierre Sanon
none Details | Review
New duplicates detection patch (23.42 KB, patch)
2008-05-26 06:25 UTC, Thomas Van Machelen
none Details | Review
Slighty improved duplicates patch (24.26 KB, patch)
2008-05-29 06:29 UTC, Thomas Van Machelen
none Details | Review
Duplicates (23.74 KB, patch)
2008-06-08 19:08 UTC, Thomas Van Machelen
none Details | Review
Duplicates patch, with smarter comparison (23.92 KB, patch)
2008-06-09 06:18 UTC, Thomas Van Machelen
none Details | Review
Duplicates patch update N (34.12 KB, patch)
2008-06-27 17:34 UTC, Thomas Van Machelen
none Details | Review
Almost there duplicates detection patch (35.95 KB, patch)
2008-06-28 20:46 UTC, Thomas Van Machelen
none Details | Review
A bit closer again (36.14 KB, patch)
2008-07-04 06:44 UTC, Thomas Van Machelen
none Details | Review
Duplicates patch, this time without memory increasing (37.13 KB, patch)
2008-07-10 15:44 UTC, Thomas Van Machelen
committed Details | Review
Fix for files remaining on disk (318 bytes, patch)
2008-09-09 06:48 UTC, Thomas Van Machelen
committed Details | Review
Debug output from Nils - painfully slow import + error at duplicate (6.98 KB, text/plain)
2008-09-10 17:35 UTC, Nils Pickert
  Details
Add duplicate detection support to camera import (4.79 KB, patch)
2008-09-15 20:36 UTC, Thomas Van Machelen
none Details | Review
Slightly better version of the camera import (4.12 KB, patch)
2008-09-16 06:00 UTC, Thomas Van Machelen
committed Details | Review

Description Peter A. Goodall 2005-03-08 21:12:21 UTC
Version details: 0.0.10
Distribution/Version: SUSE LINUX 9.3 Beta 2

I somehow dupplicated a whole directory of images, and I now have to go through
and manually remove each duplicate.  In gthumb there is an option to
automatically remove dups.  If the functionality is there in f-spot I couldn't
find it.
Comment 1 Larry Ewing 2005-03-08 21:29:41 UTC
hmmmm, it shouldn't be possible to create duplicates.  Do you know how you did
it?  It might be possible to do via drag and drop, does that sounds likely?
Comment 2 Peter A. Goodall 2005-03-08 21:43:02 UTC
No, it doesn't.  I was doing a lot of testing with 0.0.9-3 and 0.0.10.  When I
first started testing I imported individual directories, then I figured out how
to import all the directories under /common/Picutures/.  I assumed that was how
it happened.  I could try to reproduce it for you.
Comment 3 Peter A. Goodall 2005-03-08 21:43:34 UTC
s/could try/will try/
Comment 4 Peter A. Goodall 2005-03-08 21:53:41 UTC
Ok.  It is because in that directory I have all the original pictures, a scaled
version of all the pictures, and a subdirectory call "original" that _also_
contains all the original pictures.  Not sure why :-/  However, given this
situation should there be a way to detect that all those files are duplicates of
the others?  Not necessarily during import (though that would be nice...), but
afterward based on EXIF data or something.
Comment 5 Gabriel Burt 2005-09-08 23:13:41 UTC
*** Bug 305734 has been marked as a duplicate of this bug. ***
Comment 6 Gabriel Burt 2005-09-08 23:16:15 UTC
*** Bug 308796 has been marked as a duplicate of this bug. ***
Comment 7 Jonas Bergler 2006-01-11 11:36:03 UTC
F-Spot should have duplicate dectection support, if it happened by checksumming all images, either on import, or in the background and providing an alert if duplicates are found. due to the cpu intensive nature of checksumming it should however be an option that can be turned off.
Comment 8 Bengt Thuree 2006-02-19 05:15:18 UTC
Just adding Gabriels comment 2 on bug 308796#c2
-------------
Duplicated detection that would solve bug 169646 will solve this too. Instead of
using EXIF-matching or MD5 hash matching to detect, it would be faster to, as
Haran said, check if it's a link and if the path it points to is already in the db.
Comment 9 Bengt Thuree 2006-02-19 05:36:08 UTC
Did we not have a patch for this one? Seem to remember that someone did a patch for detecting duplicates, and verified it did not slow down the loading of pictures much. Perhaps it could be extended with checking for existing links first?
Comment 10 Larry Ewing 2006-02-19 19:56:23 UTC
yeah most of this bug is solved in a patch in acs' repo but it requires some changes to the db schema and a bit more polish.  Now that we have the db update code it should be possible to integrate the patch with some work.  It is a hight priority but I might do it in stages.
Comment 11 Larry Ewing 2006-03-19 23:58:32 UTC
massis is cooking this patch now, and progress seems excellent.
Comment 12 Thomas Van Machelen 2006-03-28 19:16:44 UTC
Created attachment 62237 [details] [review]
Duplicates patch

In attachment you will find the duplicates patch acs once created, but updated to current cvs head and at the same time fixing a lot of the original issues raised.
You can find the original discussion here:
 http://mail.gnome.org/archives/f-spot-list/2005-October/msg00044.html

and the comments on the first version of the patch here:
 http://mail.gnome.org/archives/f-spot-list/2006-March/msg00018.html

Mind you that your database will be upgraded to a new version, so please backup before you test this.
Comment 13 Alvaro del Castillo 2006-04-03 21:59:26 UTC
Thomas, I plan to invest sometime in F-Spot next weeks so I can help you testing the final version of the duplicates patch I have sent to the list some months ago. I will try your patches as a first step and then do some testing in order to include it finally in F-Spot.

sid@delito:~/devel/f-spot-devel$ patch -p0 < DuplicatesPatch.patch
patching file src/ImportCommand.cs
patching file src/MainWindow.cs
patching file src/PhotoQuery.cs
patching file src/PhotoStore.cs
Hunk #3 succeeded at 564 (offset 7 lines).
....

Ok, the patch seems to works with current CVS, compile and install ok. I can see the Find Duplicates entry in the search menu entry and it is time to try to import some photos with duplicates.

In the import dialog you can select with the checkbox to include or not to include duplicates and it works as expected. If I select to import duplicates I have some photo duplicates in my collection. And then I can use the serach for duplicates in the search menu and it works nicely. I can remove the duplicates and then try to search for duplicates and the search is empty.

Ok, it seems that everything is working as expected.

Cheers

Comment 14 Jaakko R 2006-04-25 19:04:57 UTC
How about
- first find candidates for duplicates based on file sizes of the images
- then for colliding file sizes calculate md5sums

From my set of about 2000 photos in jpg-format, only two had the same file size. Therefore I conclude that for the typical case where the photos are stored in compressed format rather than raw format, there would be no performace penalty.

Comment 15 Bengt Thuree 2006-04-26 01:12:34 UTC
Then we have the question of what is a picture?
The JPG file, or only the image embeeded in the JPG file (that is excluding the tags).

You could modify one of the tags, and the filesize would then differ...
Comment 16 Thomas Van Machelen 2006-04-26 11:37:22 UTC
Jaakko,

Comparing the file sizes _could_ speed up the duplicate detection, but I don't think it is really necessary because the MD5 impact is taken upfront:
* the sums for already existing photos are calculated in a database upgrade
* the sums for new photos are calculated at the time they are created

This means that after creation the md5sums just have to be read from the database (unless the photo changes), and also that the detection of duplicate photos is quite speedy because we maintain a hashtable that contains all the md5sums (and this allows us to do fast lookups).  If you want to see if comparing file sizes might speed up duplicate detection, please don't hesitate to create your own patch...

Bengt,

For the moment the md5sums are calculated against the complete image file, which indeed ignores the fact that two photos might be the same with different exif/xmp data.  This point was already raised by Ruben Vermeersch, and Larry commented that he "should probably add some methods to the image classes that allow them to refine the checksum so that it has a better chance of matching.  It's important to also do a full checksum to point out that the file is not an exact duplicate though."

Regards,
Thomas

Comment 17 Tim Nicholas 2006-06-09 09:43:25 UTC
Hello! 

What is the current state of this bug/patch? This seems like a fantastic feature which could be added and then optomised for speed. I've only just started using f-spot seriously but this feature would make my life significantly easier. 

I'm loving finally having a decent way of organising my photos BTW. Nice work. 

Cheers, 
Tim
Comment 18 Thomas Van Machelen 2006-06-10 19:41:18 UTC
Created attachment 67096 [details] [review]
updated version

Updated the duplicates patch.  The previous version crashed f-spot when the updater encountered photos in the database that were no longer present on disk; also it fixes the "Find Duplicates" query by omitting files that are not on disk.
Comment 19 Michael Monreal 2006-06-16 00:03:38 UTC
This is what I was looking for! I often use F-spot to get new photos from my mobile phone. naturally I don't delete all the photos... so I often end up importing photos more than once, which results in those ...-1.jpg and so on.

Tested the patch from #18, seems to work really well for me! Would be great to see this upstram. Only thing I would change is the current menu entry from "Find->Find Duplicates" to "Find->Duplicate Photos" to better match with the rest of the menu entries.
Comment 20 Brian Geppert 2006-06-23 03:05:27 UTC
Using MD5 hashes to calculate dupes sounds like a pretty horrible idea.  Why don't we compare histograms?

In case you're thinking "This guy sounds like he has no idea what he's talking about," you are right.  But I was looking into seeing how other open-source picture programs do dupe detection, and gallery2 is having such a feature implemented (for Google Summer of Code 2006).  I'm going to look into what kind of algorithm they're using, and I'll see about trying to put a patch up here.
Comment 21 Bengt Thuree 2006-08-22 00:25:40 UTC
*** Bug 352300 has been marked as a duplicate of this bug. ***
Comment 22 Michael Monreal 2006-09-18 14:11:14 UTC
After using this patch for a short while and then using plain HEAD again, I found f-spot was not able to write to the db again. I had to manually remove the new "md5sum" column added by this patch. As a reference, here's the sqlite session I used to drop the column (I wish sqlite had proper alter table support...)

$ sqlite photos.db
sqlite> .tables
sqlite> .schema photos
sqlite> BEGIN TRANSACTION;
sqlite> CREATE TEMPORARY TABLE photos_backup(   id                 INTEGER PRIMARY KEY NOT NULL,   time               INTEGER NOT NULL,   directory_path     STRING NOT NULL,  name               STRING NOT NULL,              description        TEXT NOT NULL,           default_version_id INTEGER NOT NULL);
sqlite> INSERT INTO photos_backup SELECT id,time,directory_path,name,description,default_version_id FROM photos;
sqlite> DROP TABLE photos;
sqlite> CREATE TABLE photos(   id                 INTEGER PRIMARY KEY NOT NULL,   time               INTEGER NOT NULL,   directory_path     STRING NOT NULL,  name               STRING NOT NULL,              description        TEXT NOT NULL,           default_version_id INTEGER NOT NULL);
sqlite> INSERT INTO photos SELECT id,time,directory_path,name,description,default_version_id FROM photos_backup;
sqlite> DROP TABLE photos_backup;
sqlite> COMMIT;
sqlite> .exit
Comment 23 Bengt Thuree 2006-10-27 05:30:41 UTC
*** Bug 365573 has been marked as a duplicate of this bug. ***
Comment 24 Thomas Van Machelen 2006-12-06 06:26:00 UTC
*** Bug 382843 has been marked as a duplicate of this bug. ***
Comment 25 Morten Welinder 2006-12-06 14:11:00 UTC
Why all the md5sum stuff?  If seems to me that duplicates can be found
efficiently without adding fields to the database:

1. File size must match.
2. Photo metadata (camera, shutter speed, date?, ...) must match.
3. File contents must match.

Most non-dupes should be identified at steps 1 or 2 here.
Comment 26 stvn 2006-12-10 00:08:10 UTC
Extra use case to support the need for a 'find duplicate' tool. I had to recover my f-spot photo-collection. I recovered the Photos dir and the fspot-gconf settings, but was not entirely sure that it had all my photos in it. So I added all my photos from a separate backup and gave them the tag 'new'. I figured that I could quickly figure out which photos didn't survive and remove the 'new' tag and then delete all the 'new' photos from my HDD.

This is seriously a bad idea!

I got loads of duplicates in f-spot since most of them survived. A bunch didn't survive the recovery so they were unique. I removed the 'new' tag of those who didn't survive and deleted all the 'now' photos. But when I wanted to look at my pictures again I noticed that all the duplicates were empty. They were listed in f-spot; name, date, file location etc. but the file was gone.

This is seriously distressing; f-spot shows 2 photos but deleting one does delete the file and the other instance has a problem.

btw; using a fresh ubuntu edgy with f-spot 0.2.1 on a normal x86 system
Comment 27 Luis Villa 2006-12-19 19:55:11 UTC
Fixing the metadata a bit, and noting that this is still really irritating and the biggest thing preventing me from importing the rest of my pictures into f-spot.
Comment 28 Thomas Van Machelen 2007-02-16 15:20:34 UTC
*** Bug 408541 has been marked as a duplicate of this bug. ***
Comment 29 Nuno Ferreira 2007-02-22 00:50:05 UTC
Created attachment 83079 [details] [review]
patch for current subversion

I reworked the patch against current subversion, fixing a small bug in the process (I don't even remember what right now).

Is this being considered for a future version? If not, what's needed to get similar functionality accepted in f-spot?
Comment 30 Thomas Van Machelen 2007-02-23 07:17:40 UTC
Created attachment 83151 [details] [review]
Small improvements

Small update:
* close filestream after creating md5sum
* make addtocache in photostore also add to md5cache
* updated ChangeLog
Comment 31 Thomas Van Machelen 2007-02-23 07:18:29 UTC
Oh yeah, and it still seems to work properly; 

!! mind that your db will be upgraded though !!
Comment 32 Jamie Wilkinson 2007-03-07 10:40:29 UTC
Created attachment 84149 [details] [review]
Updated for current SVN head

I've just applied this patch almost cleanly against HEAD as of 10 minutes ago.  This patch rules.  I seriously hope you apply it straight away.
Comment 33 Jean-François Fortin Tam 2007-06-10 20:40:12 UTC
Hello devs, what's up with this? This is a pretty infuriating problem (coupled with the fact that f-spot starts importing before the user clicks the button).

There is a bunch of patches ready, what is holding them back?
Comment 34 Ruben Vermeersch 2007-06-17 17:06:46 UTC
*** Bug 448519 has been marked as a duplicate of this bug. ***
Comment 35 Pierre Sanon 2007-08-08 20:24:53 UTC
I don't see why this very basic feature is still not in.
Could a reason be given so that we can have a dialog as to what needs to be done to address this situation.

Would it be at least possible to add this code into Updater.cs so that those using patches for duplicates do not have to an incompatible database to the other versions.

Comment 36 Pierre Sanon 2007-08-08 20:26:09 UTC
Created attachment 93309 [details] [review]
minimal patch for database to have a md5sum field.
Comment 37 Oliver Gerlich 2007-08-18 09:52:52 UTC
Hello, has this feature slipped under the radar? IMHO it's one of the most important features for a photo management software (right after the timeline, but very clearly more important than "arty" retouche tools). Is there still anyone working on this? Any schedule estimations?
Comment 38 barthelemy 2007-09-23 18:48:38 UTC
Hello, I'm also eager to see this feature upstream.

As was pointed before (http://bugzilla.gnome.org/show_bug.cgi?id=169646#c16) I also think a md5sum on the full file won't fill the needs, for instance, in the following scenario
1. I import some picture
2. I tag it, with "write metadata on file"
3. I re-import the initial picture by mistake.

after the step 2, the file on disk changed, therefore, comparing the
checksums at step 3  won't detect duplicate.

To circumvent this a checksum of the image part itself should be stored, or we could decide to always keep the original picture unmodified (any modification should end up in a new version or be stored in the db)

Other tools to detect duplicates could be useful too:  
 - using metadata (2 pictures taken at the exact same time with the same camera are good  duplicate candidates)
 - an image similarity measure, independant of the file (using histogram, wavelet or...), GQview as such a feature,

These two features could be usefull to detect non strict duplicates such as 
different resolutions or rotated versions of a picture.
Comment 39 Thomas M. Hinkle 2007-11-24 02:43:07 UTC
Maybe this is related, maybe it isn't. I keep all my folders in a directory of my own choosing organized in my own way. Ideally, I'd like f-spot to just "watch" that directory and have it update when I drag new pictures in. Failing that, I'd like to be able to click "import" and import new pictures into f-spot -- either from the main directory or from a subdirectory. The problem is, I often end up wanting to add pictures to a directory which I've already imported and then import them -- f-spot creates duplicates of all the pictures.

For the very simple case of importing the *exact same filename* into f-spot, and *not* copying it to the Photos directory, it seems like a no brainer to simply not create the duplicate. A patch that made that simple fix would make f-spot much more usable to me!
Comment 40 Maxxer 2007-11-24 17:15:31 UTC
(In reply to comment #39)
> Maybe this is related, maybe it isn't. I keep all my folders in a directory of
> my own choosing organized in my own way. Ideally, I'd like f-spot to just
> "watch" that directory and have it update when I drag new pictures in. 

then you might be interested in inotify support (bug #312613)
Comment 41 Maxxer 2008-01-08 09:13:49 UTC
*** Bug 508011 has been marked as a duplicate of this bug. ***
Comment 42 Maxxer 2008-02-18 15:22:14 UTC
*** Bug 517234 has been marked as a duplicate of this bug. ***
Comment 43 Alexander Skwar 2008-02-18 18:56:45 UTC
(In reply to comment #39)
> Maybe this is related, maybe it isn't. I keep all my folders in a directory of
> my own choosing organized in my own way. Ideally, I'd like f-spot to just
> "watch" that directory and have it update when I drag new pictures in. Failing
> that, I'd like to be able to click "import" and import new pictures into f-spot
> -- either from the main directory or from a subdirectory. The problem is, I
> often end up wanting to add pictures to a directory which I've already imported
> and then import them -- f-spot creates duplicates of all the pictures.

FWIW, that's also my "usecase" for using f-spot.

Adding this comment so that devs can see that Thomas (c#39) is not alone.

> For the very simple case of importing the *exact same filename* into f-spot,
> and *not* copying it to the Photos directory, it seems like a no brainer to
> simply not create the duplicate. A patch that made that simple fix would make
> f-spot much more usable to me!

Metoo!


Comment 44 Maxxer 2008-03-06 14:22:25 UTC
*** Bug 520725 has been marked as a duplicate of this bug. ***
Comment 45 Morten Welinder 2008-03-06 14:36:47 UTC
That makes ten duplicates for a bug that has had at least a tentative
patch for 2 years.

This is a highly irritating issue.  It clearly hits lots of people.
Note that bug-buddy is not involved in the reports -- 11 reports
has got to be something of a record.

Yet there hasn't been a peep from the f-spot team for years about
this.
Comment 46 Oliver Gerlich 2008-03-06 14:56:16 UTC
Indeed... I remember that when installing Ubuntu 5.10 at a friend, I wondered why this feature wasn't in, and thought that surely it would be available in the next major Ubuntu release (due Juli 2006) :-/

By now the friend's photo collection has ~1500 unique photos (amazing how much pictures she got together when using a digicam), but there are around 2500 photos in F-spot, from the constant import-but-dont-delete-on-camera procedure, and manually weeding out the duplicates has become hopeless :-(
Comment 47 John Yarger 2008-03-13 23:55:36 UTC
I have been evaluating F-Spot version 0.4.0 under Ubuntu 7.10 for 3 months now and have been very IMPRESSED by the overall design and implementation of F-Spot.  The sole remaining concern that holds me back from using F-Spot to manage my 6 years worth of digital photos (which appears to be growing exponentially) is this duplication issue.  

Although I have not examined the C# code, it appears that the central issue revolves around importing a file with the same name and date.  Currently, when that condition exists, the code appears to identify the error condition and follows the procedure of creating a unique filename (-1.jpg, -2.jpg, etc).  

I would suggest that at this point, the code should not automatically produce a unique filename.  Instead, present the user with a dialogue window.  The dialogue window could simply tell the user that the photo appears to be a duplicate and ask if the user would like to import this photo anyway.  Perhaps a check box could be included in the dialogue window to indicate the same answer should be used during this import session when this condition occurs (i.e. when the filename and EXIF photo date appears to indicate a duplicate).  

Another approach that might work even better would be to include within the "Preference" configuration an option for not importing any new photos when an identical filename and date are already contained with the current photo collection.  

I suspect that providing users such an option would be VERY welcomed and would cover 99% of the most common use case conditions.  And hopefully this approach would not require significant new code or database changes.  If any of the F-Spot maintainers feel that this approach might hold promise, I would be happy to spend some more time documenting the suggestion more fully.  

Thanks again for the donation of your time and energy producing such a great application for the world to use!  F-Spot sets a new standard in photo management.
Comment 48 vetsel.patrice 2008-04-03 08:24:53 UTC
Confirmed on Hardy (f-spot 0.4.2-1ubuntu1).
	
This is typically the kind of bug that makes me sad. And so I absolutely don't use f-spot.
Comment 49 Maxxer 2008-04-06 18:31:29 UTC
*** Bug 526274 has been marked as a duplicate of this bug. ***
Comment 50 Martin Harvan 2008-05-03 10:33:12 UTC
It has been a while now and this bug is still present (4.2). It is showstopper for many people I believe. For instance I like to keep photos on memory card, but I also like to download them to computer just in case. 
With F-spot not detecting I have to be very careful while importing photos and even now I have 4 copies of the same photo, which takes space and my time while i have to hunt for duplicates...

maybe it would be enough to add some sort of basic duplicate based on file name, that would be enough for me.. Maybe when importing and find image in the destination folder (with the same date) exists then ask what to do.
Comment 51 Martin Harvan 2008-05-03 14:02:51 UTC
I have just done some tests and I found out one thing.. I don't know from what  is the md5 calculated but I would bet that it's calculated from imported file. If so the md5 sums of the imported image and the one on camera will never match because f-spot adds "tag" with information when was the picture imported and that changes the image.

I have done some tests to confirm this I have imported a photo from camera two times and I got duplicate. Then I took the imported image (from where f-spot keeps images) copied it to a different folder and tried to import it, this time f-spot detected that there is the same picture in the database (it was essentially the same file including the "time of import tag" so even the md5 matched).

So, in order to successfully detect duplicates f-spot needs to either:
A) not store the import date/time inside file. This is probably not preferred.
B) calculate the md5 before it writes the time/date into import and store it somewhere.. This seems like the way to go, so now only somebody that would put it into code is needed. I would do it but I really suck at C# :(
Comment 52 Adam Theo 2008-05-26 03:12:42 UTC
This feature *still* isn't in F-Spot yet? WTF?

I just made the switch from Picasa for Linux (proprietary, running under WINE), to F-Spot because I like Open Source and wanted a native GNOME app. But then I discover there's no duplicate detection in this application, which I find absolutely amazing that there isn't, especially since this topic has been discussed... and discussed... and discussed for *years* now, and is still no peep from the developers.

I, for one, sadly declare proprietary software the winner in this round. Open Source F-Spot is not a viable application. Back to Picasa, I guess. *sigh*
Comment 53 Thomas Van Machelen 2008-05-26 06:25:40 UTC
Created attachment 111542 [details] [review]
New duplicates detection patch

To finally put an end to all the whining and no code, i decided to give this one another shot.  It's a more or less from scratch implementation to detect duplicates at import time

WARNING: it modifies your db schema, so use with the needed care

Things it does:
1. add md5 field to photos table to store the photos
2. in the import from folder dialog you can now toggle whether to include duplicates or not (others default to no duplicates, but i didn't test this)
3. duplicate detection occurs ad hoc with db queries, which makes it rather fast; the previous patches kept an md5 dict but there's no need for that, querying is fast
4. the md5 is created against a smaller version of the image, that should be enough and makes things fast.  

So if you're eager to get this into f-spot, show some guts and try it out.
Comment 54 Thomas Van Machelen 2008-05-29 06:29:02 UTC
Created attachment 111705 [details] [review]
Slighty improved duplicates patch
Comment 55 Maxxer 2008-06-07 09:36:11 UTC
a quick test raised a problem: it prevents importing jpg/raw!
when importing both raw and jpeg versions of a picture only one of the two is really added to f-spot database.
Comment 56 Thomas Van Machelen 2008-06-08 17:22:49 UTC
Thanks for testing, 

As i don't have any raw pictures, i didn't have any problem.  But sure, the way the duplicates thing is working right now (calculating against the thumbnail), that scenario won't work.  The CheckForDuplicate method in the PhotoStore is where we can go wild on how to calculate duplicates.

Expect an update soon.
Comment 57 Thomas Van Machelen 2008-06-08 19:08:08 UTC
Created attachment 112371 [details] [review]
Duplicates

Patch that should make import of jpg/raw combinations work
Comment 58 Thomas Van Machelen 2008-06-09 06:18:58 UTC
Created attachment 112392 [details] [review]
Duplicates patch, with smarter comparison

Ignore the previous patch, it completely ingores the embedding of tags in the xmp data.  This one should work better
Comment 59 Maxxer 2008-06-09 18:33:05 UTC
thanks for the update, this patch handles correctly r+j! great!
i did just some simple tests:
* importing same picture
* importing same pic with different filename
* importing same pic without exif data (jhead -purejpg)
* importing some pics, few dupe and few not

all these tests went fine.
The only _trivial_ bug is that the last roll is always show, even if no pic was imported (being dup). you probably miss to remove the skipped pics from the import count.
Comment 60 Maxxer 2008-06-10 06:29:18 UTC
i also tried to import a picture, make a copy and a small modification to it (say luminosity or such), import. the pic was correctly imported as new.
Comment 61 Thomas Van Machelen 2008-06-10 19:55:17 UTC
(In reply to comment #59)
> all these tests went fine.
> The only _trivial_ bug is that the last roll is always show, even if no pic was
> imported (being dup). you probably miss to remove the skipped pics from the
> import count.
> 

Hmmm, what does f-spot do when you import from an empty folder?  Does it create a roll as well?

Comment 62 Thomas Van Machelen 2008-06-27 17:34:23 UTC
Created attachment 113543 [details] [review]
Duplicates patch update N

This patch fixes some issues with the previous version:
1. calculation of md5's in updater happens async
2. no dummy roll created when all photos on import are dups
3. also versions have md5 sums and are checked for duplicates detection
4. no crash on file not found

Remaining: after modification of a picture (b&w, red-eye) the md5 sum should be updated as well
Comment 63 Thomas Van Machelen 2008-06-28 20:46:26 UTC
Created attachment 113581 [details] [review]
Almost there duplicates detection patch

Same as previous patch but with improved version handling and md5 updating on photo edits
Comment 64 Steph Meslin-Weber 2008-07-02 19:50:34 UTC
Just applied this against head and it applies with a few offsets :-)

Would it be possible to add detection and filtering of duplicates in the existing database, as opposed to just on import?

Thanks for the work!
Comment 65 Thomas Van Machelen 2008-07-03 07:38:53 UTC
(In reply to comment #64)
> Just applied this against head and it applies with a few offsets :-)
> 
> Would it be possible to add detection and filtering of duplicates in the
> existing database, as opposed to just on import?
> 

I discussed this with Stephane, and we both agree that this kind of functionality can be added as an extension.  Once the datamodel changes are in place, it could be easy to build an extension on top of it that does this exactly.
Comment 66 Steph Meslin-Weber 2008-07-03 10:56:08 UTC
I'll look forward to that then :-) I have 45k photos of which I'm fairly certain 5k are duplicates; reimporting everything just to benefit from the dedupe is not a process I particularly look forward to, retagging would take weeks.

Two issues that came up while trying out the patch:

1) Initial DB upgrade and initialisation took 18 minutes with the UI frozen and no progress indication (100% CPU + 90% Disk)
2) f-spot consumed 3GB of ram during the background md5 hashing, then fails due to not enough memory (I have 3GB ram + 8GB swap but the swap was ignored). Restarting f-spot allowed the process to continue where it left off, finishing again at 2.5GB of used ram.

Not related to this patch, no progress indication for background tasks is a bit problematic for large photo collections :-)
Comment 67 Thomas Van Machelen 2008-07-04 06:44:15 UTC
Created attachment 113959 [details] [review]
A bit closer again

Same as previous, but 18 minutes waiting at job creation time shouldn't happen as it goes in one insert.  Also i added some explicit gc; could you check if it makes any difference?
Comment 68 Steph Meslin-Weber 2008-07-08 15:17:07 UTC
That's infinitely better on the startup time:

[Info  15:52:18.685] Starting new FSpot server
Updating F-Spot Database
photos_temp - photo_versions_temp
Updated database from version 14 to 15
Database updates completed successfully.
[Debug 15:52:24.384] Db Initialization took 5.477514s
[Debug 15:52:24.867] Query: SELECT photos.id, photos.time, photos.uri, photos.description, photos.roll_id, photos.default_version_id, photos.rating, photos.md5_sum FROM photos  WHERE photos.id NOT IN (SELECT photo_id FROM photo_tags WHERE tag_id = 2) ORDER BY photos.time
[Debug 15:52:26.371] Query took 1.50316s
[Debug 15:52:26.410] Query: SELECT photos.id, photos.time, photos.uri, photos.description, photos.roll_id, photos.default_version_id, photos.rating, photos.md5_sum FROM photos  WHERE photos.id NOT IN (SELECT photo_id FROM photo_tags WHERE tag_id = 2) ORDER BY photos.time
[Debug 15:52:27.355] Query took 0.945111s
[Info  15:52:28.516] Starting BeagleService
[Debug 15:52:28.517] BeagleService startup took 2.4E-05s
[Debug 15:52:28.716] Calculating Hash 1...

Memory usage doesn't seem to have been affected by the latest patch:

1500 hashes, 197MB
2000 hashes, 239MB
2500 hashes, 304MB
3000 hashes, 368MB
3500 hashes, 435MB

that's roughly 238MB for 2000 hashes, ~119kb for each hash. Is there a pixbuf dispose missing somewhere?
Comment 69 Thomas Van Machelen 2008-07-10 15:44:20 UTC
Created attachment 114320 [details] [review]
Duplicates patch, this time without memory increasing

Same as before, only this time it should no longer increase memory.  Appeared to be the Pixdata in the PixbufSerializer wasn't being freeed (the api is a bit strange).  On my machine, memory stays under 40mb with 500+ photos and it doesn't seem to increase over time.
Comment 70 Steph Meslin-Weber 2008-07-13 22:19:18 UTC
Works as advertised! Total memory usage only went up 20MB (153MB->173MB) after 45k files.

Thanks again for the work :)
Comment 71 Johan Walles 2008-08-27 16:28:04 UTC
Stephane, do you have any objections to committing this patch?

  Regards //Johan
Comment 72 Steph Meslin-Weber 2008-08-27 16:32:47 UTC
No objections at all - can't wait for the plugin version of it either :)
Comment 73 Thomas Van Machelen 2008-09-06 19:35:58 UTC
Committed a slightly modified version of the patch in r4313.  Keeping the bug open until some more testing has happened and next version of f-spot is released.
Comment 74 Nils Pickert 2008-09-08 20:54:59 UTC
Importing a 14 CRW files now with this patch took more than an hour, as mentioned on IRC on 2008/09/08

I am reading from a CF card in an USB card reader, my photo dir resides on a NAS mounted on NFS via Gigabit-Link. It is definitely not network trouble or a slow USB connection as both worked fast before. Editing pictures after the import works also fast like before. At the first try some processes were also occupying CPU, but I re-tested with an idle CPU. F-Spot takes 100% CPU for a long time while calculating the md5 sums...

I did run f-spot (without importing) for a couple of hours before, so all the md5 hashing in the background for the existing database should be done. As f-spot only uses very little CPU when not importing, I assume that the background hashing should be done already. Running  sqlite3 ~/.gnome2/f-spot/photos.db "select count(*) from jobs" as suggested by sde on IRC returns 0.

Running import tasks with the svn version from before the md5 patch also worked as usual and did not take that long.
Comment 75 Nils Pickert 2008-09-08 21:35:29 UTC
And... it does not work. I do get duplicates: additional to the file CRW_XXXX.CRW I also get now CRW_XXXX-1.CRW in my photo dir. It's not showing up in f-spot but it's on disk...

I deleted all but one picture on my CF card now. Selecting import it immediately shows "1 of 1" at the progress bar and the file is copied over, but then takes a long time (~10mins) to do anything. It does not show a preview of the file (as it is a dupe) and does not include it in the DB, but it is on disk.

Comment 76 Thomas Van Machelen 2008-09-09 06:48:07 UTC
Created attachment 118346 [details] [review]
Fix for files remaining on disk

This patch should fix files being copied but not removed when they are detected as being a dupe.
Comment 77 Maxxer 2008-09-09 07:14:40 UTC
I'm trying to upgrade my main db (~34k photos) and it's going straight since 30 minutes, having done ~4000 jobs so far.
given that, shouldn't f-spot justify the usage of cpu for so long time? I'm just guessing the first bug for next release: "f-spot taking 100% cpu". maybe just a dialog saying that the db needs upgrade and may take cpu time for some time, depending on db size. what do you think?

now two bugs:
1. i ran f-spot while my photo archive was disconnected (it's on usb). the jobs executed between f-spot startup and usb disk connect obviously went wrong, nothing bad so far. the bad thing is that those pictures will never get their md5 hash. so, shouldn't a query be run at startup resubmitting jobs for the photos without md5?
2. i think you lose photo version infos in src/PhotoStore:631. you use "version" but don't update that back to "photo" before commit. in fact i have no picture in photo_versions with an md5.
Comment 78 Thomas Van Machelen 2008-09-10 16:11:39 UTC
(In reply to comment #75)
> And... it does not work. I do get duplicates: additional to the file
> CRW_XXXX.CRW I also get now CRW_XXXX-1.CRW in my photo dir. It's not showing up
> in f-spot but it's on disk...
> 
I committed a fix for this in svn; can you please test?

Also i added some debug that should give some more info on the md5 summing of the pictures; could please run the svn version with the --debug flag and paste the output here?

Comment 79 Nils Pickert 2008-09-10 17:34:06 UTC
> I committed a fix for this in svn; can you please test?
> 
> Also i added some debug that should give some more info on the md5 summing of
> the pictures; could please run the svn version with the --debug flag and paste
> the output here?

I will attache the debug output. At your patch seemed to work, I did run f-spot without --debug and everything was OK (besides the import being painfully slow): no duplicate on disk. When I ran f-spot with --debug, I got an error message "System.NullReferenceException: Object reference not set to an instance of an object" after importing a duplicate.

Comment 80 Nils Pickert 2008-09-10 17:35:59 UTC
Created attachment 118444 [details]
Debug output from Nils - painfully slow import + error at duplicate
Comment 81 Thomas Van Machelen 2008-09-11 11:55:54 UTC
Nils, could you send me and stephane your db?  We promise we won't do anything with it, except fix your problem.  :-)

thomas dot vanmachelen at gmail dot com
Comment 82 Nick Brown 2008-09-11 13:20:44 UTC
Will this find exsisting duplicates in your db and allow their removal?
Comment 83 Thomas Van Machelen 2008-09-14 08:49:48 UTC
(In reply to comment #77)

> now two bugs:
> 1. i ran f-spot while my photo archive was disconnected (it's on usb). the jobs
> executed between f-spot startup and usb disk connect obviously went wrong,
> nothing bad so far. the bad thing is that those pictures will never get their
> md5 hash. so, shouldn't a query be run at startup resubmitting jobs for the
> photos without md5?

That indeed is a problem, but it's hard to tell whether the picture is really deleted (in which case the md5 hashing would keep going on forever); or whether the photo archive was disconnected.  Need to think about it...

> 2. i think you lose photo version infos in src/PhotoStore:631. you use
> "version" but don't update that back to "photo" before commit. in fact i have
> no picture in photo_versions with an md5.
> 

I got a fix ready for that; should land in svn soon.



Comment 84 Stephane Delcroix 2008-09-15 08:33:55 UTC
the slow import issue is fixed
Comment 85 Stephane Delcroix 2008-09-15 08:35:18 UTC
thomas, it doesn't detect dupes on import from card or camera (the CameraFileSelectionDialog)
Comment 86 Thomas Van Machelen 2008-09-15 20:36:43 UTC
Created attachment 118788 [details] [review]
Add duplicate detection support to camera import

Stephane, here is a version that allows you to skip duplicates when importing from camera.  Can you test and report?  It should not copy over any duplicate files from the camera to the photos directory; and properly detect photos that were imported before.

The checking happens explictely in the CameraFileSeletionDialog class as otherwise the photos are copied, and then detected as being duplicate; causing the files to remain on disk.
Comment 87 Thomas Van Machelen 2008-09-16 06:00:53 UTC
Created attachment 118811 [details] [review]
Slightly better version of the camera import
Comment 88 Stephane Delcroix 2008-09-16 07:38:18 UTC
works, commit
Comment 89 Michael Van Dorpe 2009-01-14 16:01:52 UTC
Could someone please confirm whether comment 88 from Stephane means that this issue has been fixed and will be in the next f-spot release? If so, should we open new bugs for any issues that are not fixed by this patch?
Comment 90 Maxxer 2009-01-14 16:37:14 UTC
(In reply to comment #89)
> Could someone please confirm whether comment 88 from Stephane means that this
> issue has been fixed and will be in the next f-spot release? 

committed in r4385

> If so, should we open new bugs for any issues that are not fixed by this patch?

before filing new bugs please test with SVN or STABLE SVN (which is 0.5.0.3 with few other patches). Some fixes has been committed there, but no release yet.

http://svn.gnome.org/viewvc/f-spot/branches/FSPOT_0_5_0_STABLE/
Comment 91 Robert Pollak 2009-10-22 15:23:58 UTC
Shouldn't this bug be set to FIXED?
Comment 92 Ruben Vermeersch 2009-10-22 15:26:53 UTC
Done.