Bug 406461 – Correctly recognize mimetypes

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 406461 - Correctly recognize mimetypes


Summary:	Correctly recognize mimetypes


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	0.2.16
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-02-10 18:10 UTC by Debajyoti Bera
Modified:	2007-02-25 21:03 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
changes to beagle's copy of xdgmime to match first by magic and then by glob (2.74 KB, patch) 2007-02-10 18:12 UTC, Debajyoti Bera	none	Details \| Review
changes to beagle's copy of xdgmime to match first by magic and then by glob (5.27 KB, patch) 2007-02-10 19:16 UTC, Debajyoti Bera	rejected	Details \| Review

Description Debajyoti Bera 2007-02-10 18:10:44 UTC

Beagle uses xdgmime to recognize mimetypes and accordingly filter documents. Freedesktop spec shared-mime-info specifies a particular order in which implementations would try to match mime-magic and globs. However, it does not look like the reference implementation of xdgmime obeys the correct order; it tries to match with the glob pattern first and upon multiple return values, tries to find the best match by matching magic.

There are probably performance reasons why this makes sense in most implementations. Most mime-sniffing would not require reading from the file at all. There is a potential of error since mime sniffing by extension is liable to errors. Though that is not a problem for most implementations, it is for beagle since beagle tries to actively parse the file based on its mimetype. Also, after figuring out the mimetype beagle will anyway go and read the file data, so nothing is gained by passive mime sniffing.

So, it makes sense for beagle to first try matching by magic and if that fails, then try globs.

Comment 1 Debajyoti Bera 2007-02-10 18:12:20 UTC

Created attachment 82288 [details] [review]
changes to beagle's copy of xdgmime to match first by magic and then by glob

Comment 2 Debajyoti Bera 2007-02-10 19:16:07 UTC

Created attachment 82291 [details] [review]
changes to beagle's copy of xdgmime to match first by magic and then by glob

Earlier patch wont work if you HAVE_MMAP is defined. This one works.

Comment 3 Kevin Kubasik 2007-02-12 13:28:28 UTC

Confirming!

Comment 4 Joe Shaw 2007-02-12 20:56:34 UTC

While I share the frustration of files not being detected by their actual type, this change makes me a little nervous since it'll be different from most other desktop apps.  Maybe it would be good to create a test harness for this, so we can see what files change, if there are any errors, and what the impact is on performance.

Comment 5 Debajyoti Bera 2007-02-12 21:10:45 UTC

(In reply to comment #4)
> While I share the frustration of files not being detected by their actual type,
> this change makes me a little nervous since it'll be different from most other
> desktop apps.  Maybe it would be good to create a test harness for this, so we

Ah.. I meant to comment on this but forgot. This approach wont work reliably. The reason is: current xdgmime and current shared-mime-info work hand to hand. Changing the xdgmime to _always_ use the magic before extension will require significant changes in shared-mime-info which we would have to do it ourselves (since upstream wont do it) and do it ourselves is not very safe.

For example, with these changes, magic is always preferred before extension, so I can rename a jpeg file as foo.gif and still get it correctly recognized as jpeg. But soon I figured out that tar.gz archives were getting recognizes as gzip files. Original xdgmime recognizes them as tar-gz files. Several such discrepancies will keep showing up and it will be a pain to fix them.

Its much better (read: easier) to use a standard implementation and get them fix the problem. Strangely, the idea of Override.xml mentioned in the shared-mime-info spec is not quite working on my computer. I was thinking of boosting the priority of magic matching for the image files.

Comment 6 Cameron Meadors 2007-02-13 16:47:28 UTC

I would like to point out where using file extension first will fail.  There is a group of file types that are really "container" types.  These are files with extensions like ogg, mp4, xml, gz.  They all need another step to determine the data in side the container.  For the example of a gzipped tar archive with the extension tar.gz,  magic will need to be applied twice, once to detect the container, and again to determine what is in the container (a tar archive).   If this was implemented it would solve the problem for all container types.

This is my vote for magic before extension.  I even volunteer to help test it.

Comment 7 Debajyoti Bera 2007-02-13 17:43:45 UTC

The command 'file' from the package file uses magic extensively and gets it correct almost everytime (based on my experience, 100% success rate). It might not be a bad idea to pursue that direction.

But what I would like to happen is to convince the xdgmime people to follow the specs better and the shared-mime-info people to increase priority for some of the binary formats like jpg, gif, pdf. Currently even if the priority is increased and update-mime-database reran, mimetypes are still incorrect. See the patch attached to the bug for the details.

Comment 8 Debajyoti Bera 2007-02-14 17:21:32 UTC

(In reply to comment #6)
> I would like to point out where using file extension first will fail.  There is
> a group of file types that are really "container" types.  These are files with
> extensions like ogg, mp4, xml, gz.  They all need another step to determine the

There is some discussion about this in http://lists.freedesktop.org/archives/xdg/2005-November/007537.html

I have a feeling we are not using the correct xdgmime API for detecting mimetype. I posted a question in the xdg mailing list, http://lists.freedesktop.org/archives/xdg/2007-February/009262.html

I think we should extract the first 1K of the file and then use the other method of xdgmime on that data. That will be independent of the file name. This will slow down the mimetype detection but probably will be more accurate. 1K should be enough, but I cannot guarantee.

Comment 9 Debajyoti Bera 2007-02-14 17:30:51 UTC

Found this gem http://lists.freedesktop.org/archives/xdg/2005-November/007537.html

It specifies the way Nautilus detects mimetype. Not as simple as we do in beagle.

Comment 10 Joe Shaw 2007-02-15 16:22:24 UTC

I looked at the gnome-vfs implementation and it does seem to read a chunk of data when doing the "slow" MIME check.  We should probably try this and see how it does.

I would probably do this entirely in unmanaged code: write some glue to mmap() in a single page of data and pass it into the appropriate xdgmime method.

Comment 11 Debajyoti Bera 2007-02-15 16:36:24 UTC

I dont think there is a need to mmap. The xdgmime method expects data in a char* buffer. This data will not be useful after this. Is there any reason why we should mmap - will that save any IO, e.g. ?

The code would probably look like:
- get the buffer size by xdg_mime_get_max_buffer_extents
- get the mimetype by xdg_mime_get_mime_type_for_data
- if octet-stream, try to use our XdgMime method to see if it is text/plain
- if one of the container mimetypes, call xdg_mime_get_mime_type_from_file_name
  - if *_from_file_name is related to _for_data, return the mime type from *_file_name
  - else, return the mimetype from _for_data

This can be implemented standalone and tested outside of beagle. The concept should be tested first so this can be written as a C program. The C# wrappers can be added later. Anyone wants to take this up - C/C++/C# ?

Comment 12 Joe Shaw 2007-02-15 17:01:17 UTC

The memory address returned by mmap() can then be passed directly into xdg_mime_get_mime_type_for_data().  It removes the need to do IO and store things in our own buffer, and if we do it all in unmanaged code we avoid a lot of temporary and pointless memory copy operations.

Comment 13 Joe Shaw 2007-02-23 17:37:39 UTC

I just checked in code which scans the first 4k of the file for this.

http://svn.gnome.org/viewcvs/beagle?rev=3495&view=rev

Comment 14 Debajyoti Bera 2007-02-25 21:03:45 UTC

Following fd.o recommendations, the mimetype detector now checks for user.mime_type extended attribute and uses that if present. r3507. You can use "setfattr -n user.mime_type -v foo/bar" to set the xattr for any file.