GNOME Bugzilla – Bug 541236
not detecting exact content type
Last modified: 2018-05-24 11:28:41 UTC
my app has data files in xml format, which I can detect based of doctype or namespace. This is the snippet from the xml that I install: <mime-type type="audio/x-bzt-xml"> <glob pattern="*.xml" weigth="80"/> <magic priority="80"> <match type="string" value="<buzztard" offset="0:100"/> <match type="string" value="http://www.buzztard.org/" offset="0:100"/> </magic> <sub-class-of type="application/xml" /> <comment>buzztard song (xml)</comment> </mime-type> Problem is that e.g. the FileFilters in gtk Filechooser or in RecentChooser don't show any of those (they still see them as application/xml). Also nautilus with in gnome 2.20 is showing them after a click as the right type, but in gnome 2.22 it does not do that anymore. Also: > gvfs-info -a "standard" melo1.xml ... attributes: standard::name: melo3.xml ... standard::content-type: application/xml standard::fast-content-type: application/xml But: > gnomevfs-info ./melo1.xml | head -n3 Name : melo1.xml Type : Regular MIME type : application/xml > gnomevfs-info -s ./melo1.xml | head -n3 Name : melo1.xml Type : Regular MIME type : audio/x-bzt-xml
I can't say that I think installing an overriding glob for *.xml is a good idea, but it works fine here: [mclasen@localhost ~]$ gvfs-info -a standard melo1.xml display name: melo1.xml edit name: melo1.xml name: melo1.xml type: regular size: 0 attributes: standard::name: melo1.xml standard::type: 1 standard::size: 0 standard::display-name: melo1.xml standard::edit-name: melo1.xml standard::copy-name: melo1.xml standard::content-type: audio/x-bzt-xml standard::icon: GThemedIcon:0x9637978 standard::fast-content-type: audio/x-bzt-xml did you run update-mime-database after installing your mime type ?[mclasen@localhost ~]$ gvfs-info -a standard melo1.xml display name: melo1.xml edit name: melo1.xml name: melo1.xml type: regular size: 0 attributes: standard::name: melo1.xml standard::type: 1 standard::size: 0 standard::display-name: melo1.xml standard::edit-name: melo1.xml standard::copy-name: melo1.xml standard::content-type: audio/x-bzt-xml standard::icon: GThemedIcon:0x9637978 standard::fast-content-type: audio/x-bzt-xml
Yes, my installer runs update-mime-database. I have glib-2.16.3 (Ubuntu 8.04 Hardy Heron).
I don't think I can help more here. Your data works for me.
I have now upgraded to opensuse 11 and its still the case. I will look into the code, although any pointers where to look at would be appreaciated.
Its definitely broken in glib 2.16.3. It seems to be fixed in HEAD (2.17.5). This is both Ubunto Hardy and OpenSuse-11.0. Its was a bit hard to track down as the code uses no logging at all. Just for the record, this is doing the detection: gio/gcontenttype.c:g_content_type_guess() I'll do some hack in my app now.
> It seems to be fixed in HEAD (2.17.5). I'm going to assume this is not an issue in 2.18 anymore, then.
I have now 2.18.2 and it still happens :/ I was not seeing it as of my hack: gvfs-info -a standard melo1.xml | grep stan standard::name: melo1.xml standard::type: 1 standard::size: 3071 standard::allocated-size: 4096 standard::display-name: melo1.xml standard::edit-name: melo1.xml standard::copy-name: melo1.xml standard::content-type: application/xml standard::icon: GThemedIcon:0x806e178 standard::fast-content-type: application/xml Its definitely also considering my mime-cache ... $ grep mime/mime.cache strace.log stat64("/home/ensonic/.local/share//mime/mime.cache", {st_mode=S_IFREG|0644, st_size=280, ...}) = 0 stat64("/home/ensonic/.local/share//mime/mime.cache", {st_mode=S_IFREG|0644, st_size=280, ...}) = 0 open("/home/ensonic/.local/share//mime/mime.cache", O_RDONLY|O_LARGEFILE) = 3 stat64("/usr/local/share/mime/mime.cache", 0xbfd7636c) = -1 ENOENT (No such file or directory) stat64("/usr/share/mime/mime.cache", {st_mode=S_IFREG|0644, st_size=102304, ...}) = 0 open("/usr/share/mime/mime.cache", O_RDONLY|O_LARGEFILE) = 3 stat64("/usr/share/gdm/mime/mime.cache", 0xbfd7636c) = -1 ENOENT (No such file or directory) stat64("/home/ensonic/buzztard/share/mime/mime.cache", {st_mode=S_IFREG|0644, st_size=744, ...}) = 0 open("/home/ensonic/buzztard/share/mime/mime.cache", O_RDONLY|O_LARGEFILE) = 3 and its still only gnomevfs-info -s that gets it right. $ gnomevfs-info -s ./melo1.xml | head -n3 Name : melo1.xml Type : Regular MIME type : audio/x-bzt-xml Any idea how I can track it down. I mean overloading xml types is pretty normal as there can't be one default application for xml.
Created attachment 134319 [details] [review] add some printf logging apparently the file-chooser is doing only shallow matches: checking file_name=badwav.bzt, data=(nil) return exact match audio/x-bzt checking file_name=live.xml, data=(nil) return exact match application/xml checking file_name=live.xml, data=(nil) return exact match application/xml Also gvfs-info seems buggy, it does a shallow match twice! gvfs-info -a "standard" ~/buzztard/share/buzztard/songs/melo1.xml checking file_name=melo1.xml, data=(nil) return exact match application/xml checking file_name=melo1.xml, data=(nil) return exact match application/xml
I have reopened the bug, but it might need to be reasigned/split. The mime-type detection might actually work, but the gvfs-info and the file-chooser don't use it correctly. I will see where gvfs-info is next and see if I can patch that. For the file-chooser its more complicated, it would need to know from the backend if given the set of filters, shallow matching is enough, or if a filter requires data probes for sniffing. I am not sure where that needs to be addressed.
Not sure what you mean by shallow matching. But if you mean is that gio prefers filename matches, then that is intentional.
Yes, shallow match = filename match. As far as I understood the data probe is done if the callee gives some data. Now I need to go up step by step to see where its decived wheter gio/gcontenttype.c:g_content_type_guess() gets only a filename or data too. Doing only filename matching is a bit too naiive. Windows is doing that and its quite a failure. Its not limmited to xml. Other user cases are e.g. *.ogg can be video or music Same for other format like .mp4). This matters to a file filter for a music player.
The behaviour is slightly more complex than only use filename, but the basic approach is that extension matching is trusted more than sniffing, for two basic reasons, sniffing is *extremely* slow (lots of seeks) and any failures on sniffing is unfixable by the user. However, we do use sniffing if there is uncertainty, for instance, if there is no extension match, or if multiple extensions match. The exact behaviour here has been discussed on xdg-list and agreed upon from both the kde and the gnome side. However, the freedesktop spec also allows xml namespace sniffing. I don't think we're actually using them in xdgmime/gio atm, but we probably should, and we should use them for sniffing for *.xml files.
Alex, thanks for commenting. I was about to make a test and see if sniffing is done for ogg/mp4. I'd like to get the situation imporved for xml files, any pointers what I should look at and whats needed. It basically breaks file-filters and the recent files for me right now, either I see all xml files there or none, but I can get just mine to show.
For instance, this: http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html#id2553996 is what the spec says should be used for xml detection, and the freedesktop.org mime file has root-XML elements specified. However, the xdgmime code doesn't use these. Also, after this is implemented we need to switch to namespace sniffing for "application/xml" types (but not subtypes). We should probably bring this up on xdg-list so that we can decide exactly how to handle this for all desktops.
Sorry, but I don't see the relation between my problem reported here and the xml-namespace. In my shared-mime file I introduce a mime-type and tell that it needs a deep-scan (looking at the content). The xml-type can be recognized by a normal content matching. When I use GtkRecentManager or a GtkFileFilter and use gtk_recent_filter_add_mime_type/gtk_file_filter_add_mime_type the code in gtk/glib needs to switch to content based filtering if it otherwise can ensure the right files are matched. I mean if there a 3 file types using *.xml and I like to filter for a specific type, why can't glib e.g. do something like this - only sniff when its needed. foreach(file in list) { if(!has_ext(file,*.xml) continue; if(!has_magic(file,magic) continue; add_to_list(file); }
I tried this with the latest glib (2.22) and it seems to work (output cut down): $ gvfs-info -a "standard::*" /tmp/melo1.xml /tmp/structure.xml display name: melo1.xml attributes: standard::name: melo1.xml standard::content-type: audio/x-bzt-xml standard::fast-content-type: audio/x-bzt-xml display name: structure.xml attributes: standard::content-type: application/docbook+xml standard::fast-content-type: audio/x-bzt-xml However, I wouldn't recommend registering a glob for *.xml with a higher weight than the application/xml one (50), as that will make e.g. gedit and other apps that pick mimetypes by filename only (for e.g. source colorization in gedit on a new file) to always think *.xml files are audio/x-bzt-xml.
However, i noticed that we pick the wrong order for glob weight, which i fixed in glib master.
I just retrired with glib git head (2.22.3) $ gvfs-info -a "standard::*" ../share/buzztard/songs/melo1.xml /home/ensonic/temp/ensonic_advogato_rss.xml display name: melo1.xml attributes: standard::name: melo1.xml standard::content-type: application/xml display name: ensonic_advogato_rss.xml attributes: standard::name: ensonic_advogato_rss.xml standard::content-type: application/xml standard::fast-content-type: application/xml Also I have removed the weight on the glob ((that was for testing only).
I also tried with gvfs from git head in addition now. no change. Will readd the logging and make a new patch. maybe alex can try this some time later, so tat we can see whats causing the difference.
Tried same on ubuntu 9.04 (before that was OpenSuSE 11.1). Always get "application/xml".
Created attachment 145842 [details] [review] aggregate results from all caches Finally I found the problem. The cache_glob_lookup_suffix() function was exiting after it found something. Unfortunately we cannot leave on n>0 as that includes n=1 and that would assume that the glob match was unique (thus not running a deep match). Now in my case the 2nd cache is never probed and thus we dont notice that the glob match is not unique. before: $ gvfs-info -a "standard::*" /home/ensonic/buzztard/share/buzztard/songs/melo1.xml DEBUG: checking file_name=melo1.xml, data=(nil) DEBUG: read cache (nil) from '/home/ensonic/.local/share//mime/mime.cache' DEBUG: read cache 0x9b925a0 from '/usr/share//mime/mime.cache' DEBUG: read cache 0x9b92770 from '/home/ensonic/buzztard/share/mime/mime.cache' DEBUG: cached lookup: 'melo1.xml',10 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=43 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=17 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=10 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=1 DEBUG: cache_glob_lookup_suffix (lower_case, len, FALSE, mimes, n_mimes) = 1 DEBUG: return exact match application/xml DEBUG: checking file_name=melo1.xml, data=(nil) DEBUG: cached lookup: 'melo1.xml',10 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=43 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=17 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=10 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x9b925a0,melo1.xml), n_entries=1 DEBUG: cache_glob_lookup_suffix (lower_case, len, FALSE, mimes, n_mimes) = 1 DEBUG: return exact match application/xml display name: melo1.xml edit name: melo1.xml name: melo1.xml type: regular size: 3071 attributes: standard::type: 1 standard::name: melo1.xml standard::display-name: melo1.xml standard::edit-name: melo1.xml standard::copy-name: melo1.xml standard::icon: application-xml, gnome-mime-application-xml, text-html, application-x-generic standard::content-type: application/xml standard::fast-content-type: application/xml standard::size: 3071 standard::allocated-size: 4096 after: $ gvfs-info -a "standard::*" /home/ensonic/buzztard/share/buzztard/songs/melo1.xml DEBUG: checking file_name=melo1.xml, data=(nil) DEBUG: read cache (nil) from '/home/ensonic/.local/share//mime/mime.cache' DEBUG: read cache 0x90ee5a0 from '/usr/share//mime/mime.cache' DEBUG: read cache 0x90ee770 from '/home/ensonic/buzztard/share/mime/mime.cache' DEBUG: cached lookup: 'melo1.xml',10 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=43 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=17 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=10 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=4 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_lookup_suffix (lower_case, len, FALSE, mimes, n_mimes) = 2 DEBUG: return uncertain mimetype audio/x-bzt-xml DEBUG: checking file_name=melo1.xml, data=0xbf843748 DEBUG: cached lookup: 'melo1.xml',10 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=43 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=17 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=10 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=4 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_lookup_suffix (lower_case, len, FALSE, mimes, n_mimes) = 2 DEBUG: sniffed mimetype audio/x-bzt-xml DEBUG: return certain mimetype audio/x-bzt-xml DEBUG: checking file_name=melo1.xml, data=(nil) DEBUG: cached lookup: 'melo1.xml',10 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=43 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=17 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=10 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee5a0,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=4 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_node_lookup_suffix(0x90ee770,melo1.xml), n_entries=1 DEBUG: cache_glob_lookup_suffix (lower_case, len, FALSE, mimes, n_mimes) = 2 DEBUG: return uncertain mimetype audio/x-bzt-xml display name: melo1.xml edit name: melo1.xml name: melo1.xml type: regular size: 3071 attributes: standard::type: 1 standard::name: melo1.xml standard::display-name: melo1.xml standard::edit-name: melo1.xml standard::copy-name: melo1.xml standard::icon: audio-x-bzt-xml, gnome-mime-audio-x-bzt-xml, audio-x-generic standard::content-type: audio/x-bzt-xml standard::fast-content-type: audio/x-bzt-xml standard::size: 3071 standard::allocated-size: 4096
Created attachment 145843 [details] [review] add some printf logging
Created attachment 145844 [details] [review] aggregate results from all caches + early exit Not sure if that is safe. It still fixes the problem, but preserves a little optimization.
Before I forget. There is also cache_glob_lookup_fnmatch() where I am not sure if it should have a simillar fix as well.
Hey, thanks for tracking this down. I'll have to contemplate a bit about this.
Another use case for overloading the *.xml glob, which is affected by the bugs described here, is my rather whimsical request for a mime type for mime-info files themselves: https://bugs.freedesktop.org/show_bug.cgi?id=24669
It looks like this was addressed in: http://git.gnome.org/cgit/glib/commit /?id=e63262d49d40a36060613fb1d0ed468ca5dddc19 The aggregation is not quite right for suffixes. The non-cache code takes the longest matching suffix before considering weights, but the cache code takes the longest matching suffix separately for each cache, which means that a high-weight short suffix in one cache could incorrectly beat a lower-weight longer suffix in another cache. In addition, the "n < 2" tests look wrong to me. The entire list of mime types returned by glob matching is important because g_content_type_guess checks the sniffed type against each of them. With the previous "n == 0" test, there was a clear semantic: the first kind of glob (literal, case-insensitive suffix, case-sensitive suffix, arbitrary) to provide /any/ matches wins. Now there's no useful semantic that I can discern.
Created attachment 146620 [details] Demo of misaggregation of shorter / longer suffix Demo kit for the first issue in comment #27. With the caches, gvfs-info uses the shorter, higher-weight "*.b" glob; without them, it uses the longer, lower-weight "*.a.b" glob.
Well, all of this is only heuristics anyway, so I'm not sure that one can really speak of 'correct' or 'incorrect' here. If you want to be pedantic, then any of the early returns are 'incorrect' - after all, even if there is a literal match, you might still have globs that match as well.
Still, when I want to code shared-mime-info files so that the heuristic gives certain results in certain cases, and particularly if I want that to hold when other mime info is added, it really helps to have clear semantics. And the cache feature is the sort of thing that one expects to be transparent.
Matt, my patch was only addressing the issue, that there is an early return code path, but later on code that is written as if there is no early return. I see your point though. So maybe its better to remove the early returns in these cases. Would be nice if someone would atleast move the bug to NEW. I think there is enought proof that things can indeed go wrong :) If there are problem with the approach I took I am more that happy to rework thing, test thing and so on.
This is not a new problem with your patch.
Matthias, how should we proceed. Can the current patch go in?
As mentioned in comment 27, I committed something close to your patch already. There is some remaining problems that are outlined in comment 27 and 30.
Original problem persists :/
Sorry for the noise, it is fixed from my POV (since glib 2.22.2). This is the catch still though: * I need to use the glob (but now with an extra low weight). <mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info"> ... <mime-type type="audio/x-bzt-xml"> <sub-class-of type="application/xml"/> <glob pattern="*.xml" weight="5"/> <magic priority="100"> <match type="string" value="<buzztard" offset="0:100"/> <match type="string" value="http://www.buzztard.org/" offset="0:100"/> </magic> <comment>buzztard song (xml)</comment> </mime-type> </mime-info> * I need to manually delete old $HOME/.recently-used.xbel entries as the wrong mime-type is stored there.
Created attachment 190539 [details] [review] add some printf logging
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/148.