GNOME Bugzilla – Bug 336673
allow users to add metadata from outside
Last modified: 2007-12-02 02:55:30 UTC
Its a commonly asked question about how to add extra metadata to already indexed items. The metadata in question is not present in the file and obtained from outside sources like sqlitedb (f-spot/digikam store), information obtained later (e.g. scripts which fetch lyrics/movie info from web) or some random comment someone thinks should be stored with the file. Currently the filters pick up metadata based on what is available during indexing. That has the problem that if the extra metadata comes later, then beagle doesnt sees it. E.g. if digikam added some tag to an image, only the sqlitedb is changed, the image is untouched, so the newly added tag wont be seen by beagle. Or, if some user decides to add some personal comment to an arbitrary file, there is no sane way to do this. Using .filename.xmp files for metadata would be nice, except that beagle needs to be informed to pickup the metadata from .filename.xmp. A hacky way to do that is to _touch_ the relevant file shen the metadata file is created so that beagle picks it up, at which point the relevant filters (with appropriate modifications to read new metadata) will index the new metadata. All it requires is a sane way for outside apps to tell the daemon about some extra properties it needs to add to the index for some particular object. One way to do it would be to create an Indexable with type PropertyChange and submit it to the scheduler. Issues with this solution: 1) What happens when the main object is moved/renamed ? Should the extra metadata be carried on forever i.e. should the metadata be stored in primary or secondary index. I prefer primary index but there might be cases where moving the file to a different directory invalidates the metadata. 2) Currently the only way for other apps to send an indexable to beagle is using the indexingservice. Find a way for the apps to specify which backend should handle it. Or since such metadata only makes sense for Files backend, enable the files backend to receive such requests or allow forwarding specified indexingservice messages to the files backend.
Some more thoughts. This metadata-index could also be used to store information about which results a user selected (and then boost the rank of previously selected results in subsequent searches. All in all, this might be easy to do with a third index, UserDataIndex. Some of the questions that needs to be answered: 1) Should there be a way to change/remove existing properties in UserDataIndex ? 2)Should the data move with the file's location or stay persistent across file renames and moves ?
In my mind, the best way to do this would be integration with nautilus' existing metadata (notes) system.
We need that as well but it shouldnt be the only way to do it for several reasons: 1) Photoshop (and presumably other CS apps) create XMP sidecar files so supporting them would be great 2) not every one uses nautilus The reason I'm interested is because I've created XMP sidecar files for my movie collection that contains all of the imdb info for that movie (director, cast, plot summary e.t.c). This would be awkward to do using the nautilus method because the movies are on a server and accessed from several machines over NFS, I wouldnt want to have to re tag the movies in nautilus on each machine. looks pretty cool in action: http://cs.nott.ac.uk/~ajm/pimped-beagle-search.png
Nautilus isnt the right way for reasons Alex mentioned and many more. Let me try to describe the situation in detail. Currently beagle has two types of indexes: 1) Primary -> data obtained from the file and can change if the file moves 2) Secondary -> data obtained from the file and may not change if the files moves There is another kind of metadata: * data obtained from external sources (which may or may not move with the file). This includes - tags (from external application database or user specified tags) - search feedback results (whether this result was picked by the user in some search) - other metadata supplied by user (like Alex's patch) and more-crazy-ideas - could dynamically query other tagging/metadata systems for metadata I dont think implementing this ExternalIndex as parts of Primary and Secondary index is a good idea but I dont have strong reasons to back the idea.
Are people still thinking about implementing some of this? Getting XMP sidecar files working would make lots of cool things possible!
There is an idea of a separate metadata store floating in the air (and in the minds and TODO list of Joe and Lukas). It will be easier to handle external metadata after that is implemented.
Joe finally added the infrastructure to add metadata from outside. http://svn.gnome.org/viewcvs/beagle?rev=3536&view=rev I havent seen the code so I dont yet know how to use the API. NautilusMetedataQueryable might give some hints. The missing bits of API and examples and tools will appear soon.
Yeah, this is just infrastructure in the core. You can write backends which generate only metadata for items in other backends now. There's not yet any way to do this through any sort of external API -- that's a logical next step, though -- so external applications could add metadata. (This may be what we want to do with F-Spot integration, for instance, rather than poking at its database.)
cool, so how do I go about getting XMP sidecar files working? either the FilterXmp needs to be able to inject properties from its XMP file into the results for the real file or the filter base class needs to handle the XMP file and FilterXmp then only needs to tell the backend to rescan the real file anytime the xmp file changes. Or the file system backend needs to handle xmp files itself. is this anywhere close to being the "right way" to do this? or does all this new code provide a better way
Hmm, I hadn't thought about doing this from the filter side of things, but this might be possible. We'd have to play around with it a little, and maybe tweak some things to get it working. How does an XMP sidecar file reference the original? Would we ever deal with XMP sidecar files outside of the file system backend? We might be able to do this with child indexables. Let's say you have foo.txt and foo.txt.xmp. When processing the XMP sidecar, the filter could create a property change indexable, which set the Uri to "foo.txt", and add it as a child to the XMP file itself. I'm not totally sure if that would work -- it might ultimately make foo.txt a child of foo.txt.xmp, which would be wrong -- but it's worth testing.
Is XMP a standard way of doing such things ? If not, I am a bit worried that we might be mis-interpreting people's data. OTOH, with the infrastructure code and probably a little more API support, it will be possible for users to keep their own databases, e.g. sqlite or one big xml file and whatever program they use to get metadata can just add/update the information in beagle ?
(In reply to comment #11) > Is XMP a standard way of doing such things ? If not, I am a bit worried that we > might be mis-interpreting people's data. OTOH, with the infrastructure code and XMP is used by all the good raw photo apps (photoshop, lightroom, aperture) so it's something beagle needs to support > probably a little more API support, it will be possible for users to keep their > own databases, e.g. sqlite or one big xml file and whatever program they use to > get metadata can just add/update the information in beagle ? True but that doesn't mean beagle cant support sidecar files as well. And having sidecar files means I dont need to have to tag the same file twice if it exists on an nfs server and I want to search for it from two pcs (In reply to comment #10) > Hmm, I hadn't thought about doing this from the filter side of things, but this > might be possible. We'd have to play around with it a little, and maybe tweak > some things to get it working. > > How does an XMP sidecar file reference the original? Would we ever deal with > XMP sidecar files outside of the file system backend? its just named the same as the file that it references and has a .xmp ending (although the xmp files photoshop generates do also include the filename in one of the nodes) > We might be able to do this with child indexables. Let's say you have foo.txt > and foo.txt.xmp. When processing the XMP sidecar, the filter could create a > property change indexable, which set the Uri to "foo.txt", and add it as a > child to the XMP file itself. I'm not totally sure if that would work -- it > might ultimately make foo.txt a child of foo.txt.xmp, which would be wrong -- > but it's worth testing. it could be done rather simply (if a bit hacky) if a filter for one file (the xmp) could ask beagle to re-index another file (the file the xmp references) or you could make the file system monitor detect .xmp files and put the real file into the queue instead of the .xmp and then have the filter base class try to load properties from an xmp file
> XMP is used by all the good raw photo apps (photoshop, lightroom, aperture) so > it's something beagle needs to support And F-Spot! We actually already support XMP inside image files. We can base sidecar support on this code. > its just named the same as the file that it references and has a .xmp ending > (although the xmp files photoshop generates do also include the filename in one > of the nodes) Ok, that helps. > it could be done rather simply (if a bit hacky) if a filter for one file (the > xmp) could ask beagle to re-index another file (the file the xmp references) > > or you could make the file system monitor detect .xmp files and put the real > file into the queue instead of the .xmp and then have the filter base class try > to load properties from an xmp file These are both possibilities too. Care to hack them up? :)
Let me just tell you, it ain't easy. The difficulty arises because its easy to generate one or multiple searchable objects from one physical file but not generally possible to do the opposite. You also need to make sure that when you delete the xmp file or modify it, the index is updated correctly. And you would not like to always re-index the original file because you changed only the xmp file. Anyway, I spend a good evening trying to straighten this out and I only got "half" of it working :P. I checked in the working part of the code if anyone is interested. Alex, I need an xmp parser at this point (not a FilterXMP). Most likely some of the already existing code from beagle/Util/F-Spot could be re-used but frankly I don't care. All I need is a class class XmpParser { public XmpParser (string path_to_file.xmp) { ... } public ArrayList GetProperties () { ... } } where GetProperties() will return a list of Beagle.Property (ideally, I would like ArrayList to be replaced with IEnumerable and using "yield return" enumerators to incrementally return Properties, but that can wait). The XmpParser should _only_ parse the xmp file passed to its constructor and not try to read information from the corresponding jpeg/image file. I will work on this again on next weekend, so it will be wonderful if you can give me something before that.
Created attachment 86771 [details] XMP Parser Simple XMP Parser
Created attachment 86772 [details] Sample XMP File Sample XMP file taken from http://www.figuiere.net/hub/blog/?Exempi
Created attachment 86773 [details] Another Sample XMP File Another XMP file, this is one of mine.
(In reply to comment #14) > where GetProperties() will return a list of Beagle.Property (ideally, I would > like ArrayList to be replaced with IEnumerable and using "yield return" > enumerators to incrementally return Properties, but that can wait). > > The XmpParser should _only_ parse the xmp file passed to its constructor and > not try to read information from the corresponding jpeg/image file. > > I will work on this again on next weekend, so it will be wonderful if you can > give me something before that. > I haven't had time to see how the yield keyword works yet so for now the parser just returns a List<Beagle.Property>, enjoy!
Phew! r3677 - XMP support is more or less there. From the commit message: "Wrapping up XMP patch. Several features are missing like renaming of files (xmp or renamed to/from some file with matching xmp) and deletion of only xmp files. See the comments for details of the patch. Also changed the bad design, so now whenever a matching is found (e.g. a file created for a matching xmp or an xmp created for a matching file), the xmp is properly indexed. Based upon limited testing results, the implementation is a correct one but not the best even a fast one. Please test." Alex, can you give it some testing ? Not stress testing, but with a few files only :). Try the usual operations, creating xmp after original file, creating original after xmp file, updating either original or xmp file. Deletes/renames don't work yet. Btw, the xmp extractor was extracting whole lot of properties from the xmp files. I trimmed the list quite a bit and retained only the ones which look like they can be queried or reported. The file is checked in as beagled/FileSystemQueryable/XmpFile.cs If this approach turns out to be buggy, there is another approach that I have in mind. I will get this feature done, come what may :)
(In reply to comment #19) > Phew! r3677 - XMP support is more or less there. From the commit message: > > "Wrapping up XMP patch. Several features are missing like renaming of files > (xmp or renamed to/from some file with matching xmp) and deletion of only xmp > files. just to be clear, you mean If I delete the xmp file myself the properties will still be searcheable on the main file? Or are you saying that beagle might delete my xmp files!! > See the comments for details of the patch. Also changed the bad design, so now > whenever a matching is found (e.g. a file created for a matching xmp or an xmp > created for a matching file), the xmp is properly indexed. > Based upon limited testing results, the implementation is a correct one but not > the best even a fast one. Please test." > > Alex, can you give it some testing ? Not stress testing, but with a few files > only :). Try the usual operations, creating xmp after original file, creating > original after xmp file, updating either original or xmp file. Deletes/renames > don't work yet. I'm busy writing a paper at the moment but I'll try and give it a workout later in the week. > Btw, the xmp extractor was extracting whole lot of properties from the xmp > files. I trimmed the list quite a bit and retained only the ones which look > like they can be queried or reported. The file is checked in as > beagled/FileSystemQueryable/XmpFile.cs Personally I think that absolutely all properties should be indexed for all files, unless there is some performance issue in beagle that makes this impractical. But if you are going to trim the amount of peoperties down then we should do it with a blacklist rather than a whitelist, that way it is easy for people (like me) who want to put some arbitrary metadata on their files. > If this approach turns out to be buggy, there is another approach that I have > in mind. I will get this feature done, come what may :) > I hope so, I cant wait to get my imdb-xmp beagle search working properly: http://cs.nott.ac.uk/~ajm/pimped-beagle-search.png
> just to be clear, you mean If I delete the xmp file myself the properties will > still be searcheable on the main file? Or are you saying that beagle might > delete my xmp files!! heh :-), good concern. no beagle won't pro-actively delete any xmp file; the properties will be searchable on the main file. > I'm busy writing a paper at the moment but I'll try and give it a workout later > in the week. Sure whanever. This will take a few iterations to get it correct. > > Btw, the xmp extractor was extracting whole lot of properties from the xmp > > files. I trimmed the list quite a bit and retained only the ones which look > > like they can be queried or reported. The file is checked in as > > beagled/FileSystemQueryable/XmpFile.cs > > Personally I think that absolutely all properties should be indexed for all > files, unless there is some performance issue in beagle that makes this > impractical. > > But if you are going to trim the amount of peoperties down then we should do it > with a blacklist rather than a whitelist, that way it is easy for people (like > me) who want to put some arbitrary metadata on their files. I had that feeling. The code to remove some properties is marked with a "FIXME" and I added it there mainly for testing purposes. I left a lot of debug outputs turned on the commit and the huge list was causing a bit trouble in reading the log file. That was the main reason. The other being, I saw one property which looked like the base64 string encoding of something. So in temporarily left the exif:, tiff: etc. properties and removed the others. Once everything is good and working, we can decide what to do with the other properties. I am not biased towards any side.
Ok. As far as this bug is concerned, it is FIXED. There is now infrastructure to add external metadata and that will be all since beagle is not and does not want to be a metadata store. Extracting metadata from XMP is now partly possible and anyway, there should be a different bug for that.