GNOME Bugzilla – Bug 406461
Correctly recognize mimetypes
Last modified: 2007-02-25 21:03:45 UTC
Beagle uses xdgmime to recognize mimetypes and accordingly filter documents. Freedesktop spec shared-mime-info specifies a particular order in which implementations would try to match mime-magic and globs. However, it does not look like the reference implementation of xdgmime obeys the correct order; it tries to match with the glob pattern first and upon multiple return values, tries to find the best match by matching magic. There are probably performance reasons why this makes sense in most implementations. Most mime-sniffing would not require reading from the file at all. There is a potential of error since mime sniffing by extension is liable to errors. Though that is not a problem for most implementations, it is for beagle since beagle tries to actively parse the file based on its mimetype. Also, after figuring out the mimetype beagle will anyway go and read the file data, so nothing is gained by passive mime sniffing. So, it makes sense for beagle to first try matching by magic and if that fails, then try globs.
Created attachment 82288 [details] [review] changes to beagle's copy of xdgmime to match first by magic and then by glob
Created attachment 82291 [details] [review] changes to beagle's copy of xdgmime to match first by magic and then by glob Earlier patch wont work if you HAVE_MMAP is defined. This one works.
Confirming!
While I share the frustration of files not being detected by their actual type, this change makes me a little nervous since it'll be different from most other desktop apps. Maybe it would be good to create a test harness for this, so we can see what files change, if there are any errors, and what the impact is on performance.
(In reply to comment #4) > While I share the frustration of files not being detected by their actual type, > this change makes me a little nervous since it'll be different from most other > desktop apps. Maybe it would be good to create a test harness for this, so we Ah.. I meant to comment on this but forgot. This approach wont work reliably. The reason is: current xdgmime and current shared-mime-info work hand to hand. Changing the xdgmime to _always_ use the magic before extension will require significant changes in shared-mime-info which we would have to do it ourselves (since upstream wont do it) and do it ourselves is not very safe. For example, with these changes, magic is always preferred before extension, so I can rename a jpeg file as foo.gif and still get it correctly recognized as jpeg. But soon I figured out that tar.gz archives were getting recognizes as gzip files. Original xdgmime recognizes them as tar-gz files. Several such discrepancies will keep showing up and it will be a pain to fix them. Its much better (read: easier) to use a standard implementation and get them fix the problem. Strangely, the idea of Override.xml mentioned in the shared-mime-info spec is not quite working on my computer. I was thinking of boosting the priority of magic matching for the image files.
I would like to point out where using file extension first will fail. There is a group of file types that are really "container" types. These are files with extensions like ogg, mp4, xml, gz. They all need another step to determine the data in side the container. For the example of a gzipped tar archive with the extension tar.gz, magic will need to be applied twice, once to detect the container, and again to determine what is in the container (a tar archive). If this was implemented it would solve the problem for all container types. This is my vote for magic before extension. I even volunteer to help test it.
The command 'file' from the package file uses magic extensively and gets it correct almost everytime (based on my experience, 100% success rate). It might not be a bad idea to pursue that direction. But what I would like to happen is to convince the xdgmime people to follow the specs better and the shared-mime-info people to increase priority for some of the binary formats like jpg, gif, pdf. Currently even if the priority is increased and update-mime-database reran, mimetypes are still incorrect. See the patch attached to the bug for the details.
(In reply to comment #6) > I would like to point out where using file extension first will fail. There is > a group of file types that are really "container" types. These are files with > extensions like ogg, mp4, xml, gz. They all need another step to determine the There is some discussion about this in http://lists.freedesktop.org/archives/xdg/2005-November/007537.html I have a feeling we are not using the correct xdgmime API for detecting mimetype. I posted a question in the xdg mailing list, http://lists.freedesktop.org/archives/xdg/2007-February/009262.html I think we should extract the first 1K of the file and then use the other method of xdgmime on that data. That will be independent of the file name. This will slow down the mimetype detection but probably will be more accurate. 1K should be enough, but I cannot guarantee.
Found this gem http://lists.freedesktop.org/archives/xdg/2005-November/007537.html It specifies the way Nautilus detects mimetype. Not as simple as we do in beagle.
I looked at the gnome-vfs implementation and it does seem to read a chunk of data when doing the "slow" MIME check. We should probably try this and see how it does. I would probably do this entirely in unmanaged code: write some glue to mmap() in a single page of data and pass it into the appropriate xdgmime method.
I dont think there is a need to mmap. The xdgmime method expects data in a char* buffer. This data will not be useful after this. Is there any reason why we should mmap - will that save any IO, e.g. ? The code would probably look like: - get the buffer size by xdg_mime_get_max_buffer_extents - get the mimetype by xdg_mime_get_mime_type_for_data - if octet-stream, try to use our XdgMime method to see if it is text/plain - if one of the container mimetypes, call xdg_mime_get_mime_type_from_file_name - if *_from_file_name is related to _for_data, return the mime type from *_file_name - else, return the mimetype from _for_data This can be implemented standalone and tested outside of beagle. The concept should be tested first so this can be written as a C program. The C# wrappers can be added later. Anyone wants to take this up - C/C++/C# ?
The memory address returned by mmap() can then be passed directly into xdg_mime_get_mime_type_for_data(). It removes the need to do IO and store things in our own buffer, and if we do it all in unmanaged code we avoid a lot of temporary and pointless memory copy operations.
I just checked in code which scans the first 4k of the file for this. http://svn.gnome.org/viewcvs/beagle?rev=3495&view=rev
Following fd.o recommendations, the mimetype detector now checks for user.mime_type extended attribute and uses that if present. r3507. You can use "setfattr -n user.mime_type -v foo/bar" to set the xattr for any file.