GNOME Bugzilla – Bug 784747
GContentType doesn't keep track of file extensions
Last modified: 2018-05-24 19:41:56 UTC
GContentType is a string, and glib is canonically written with the assumption that this string is a MIME/type. Some time ago there was a discussion about changing this in a somewhat compatible way to ensure that other metadata (other than MIME/type) can be preserved in the content type string, which would help portability. These patches are the result of that. More specifically, the discussion mostly happened in bug 735696 (particularly - bug 735696, comment 45). However, it's been hard to get someone to review this code due to the fact that it mixed W32, gcontenttype and xdgmime changes. Therefore i picked my Debian laptop and re-worked the code to make changes to the "unix" contenttype code and xdgmime the first step, completely excising any mentions of W32 and doubling down on testsuite additions. The code still has some xdgmime behavior changes thrown in (globs2 check, empty cache check). Hopefully, that won't be a problem.
Created attachment 355259 [details] [review] xdgmime - check for globs2 This adds a globs2 file check to xdg_check_dir(). This way its behaviour matches the xdg_mime_init_from_directory(), which first tries to load globs2, then falls back to globs.
Created attachment 355260 [details] [review] xdgmime - add mime/type->extension lookup Adds xdg_mime_get_file_exts_from_mime_type(), which takes a mime/type and looks up possible file extensions that files matching this type could have. This initial implementation uses glob hash.
Created attachment 355261 [details] [review] xdgmime - add a function for mime/type comparison This function compares "MIME/type; foo=bar" types correctly by ignoring everything after ';'.
Created attachment 355262 [details] [review] xdgmime Add a function for parsing MIME/type parameters Right now it only gets the ext=... parameter, but can be modified to get other parameters as well.
Created attachment 355263 [details] [review] xdgmime - return synthetic types for filenames with no matches When nothing matches foo.bar, return its MIME/type as application/x-extension-bar
Created attachment 355264 [details] [review] xdgmime - parse application/x-extension-foo to get .foo
Created attachment 355265 [details] [review] xdgmime - ignore extended parameters in supertype checks
Created attachment 355266 [details] [review] GContentType - correctly support extended MIME/types
Created attachment 355267 [details] [review] xdgmime - don't use empty caches The code will try to use the cache, and only the cache, if it's available, even if it's empty. Don't let it do that.
Created attachment 355268 [details] [review] GContentType - integrate filename and data sniffing tighter Instead of doing both types of checks separately, use the results of data sniffing to filter out filename sniffing results that can't apply: if we have a magic signature for MIME/type foo/bar, and our data does not match that magic signature, it means that this file is not foo/bar, even though its filename looks like it could be foo/bar.
Created attachment 355269 [details] [review] xdgmime - implement cache-based extensions lookup Note that this implementation makes it obvious just how inefficient it is to make extensions lookups in suffix cache. If this ever becomes a bottleneck, consider creating a new kind of cache, one optimized for doing mime->ext lookups.
Created attachment 355270 [details] [review] Testsuite: fix gcontenttype test to pass, with the new logic * A bogus text-ish file named "*.pot" won't be detected as gettext template, because we have a magic for gettext templates, and it doesn't match that file, even though it has the right extension. Powerpoint is more plausible here (although also wrong). * Use actual *.pot file contents to correctly detect it as a gettext template. Again, feeding it something vaguely text-ish will not cut it anymore.
Created attachment 355271 [details] [review] GContentType - add a function to get extension Adds g_content_type_get_extension(), which returns the extension associated with a given content type. Also add a testcase for it.
Created attachment 355272 [details] [review] xdgmime - make extension lookup function walk the tree If it need to find more extensions than it already found, it gets the list of parents and calls itself recursively on them. Also, unalias the type before looking it up. Also change it into a pair of functions (internal and external). External function calls ..._init() and calls the internal function.
Created attachment 355273 [details] [review] GContentType - add a function for getting multiple extensions Also extend the testsuite to test it. Unfortunately, it has to accept a number-of-extensions-to-look-up parameter, as the xdgmime API isn't designed to deal with an arbitrary number of things to find.
Something just occurred to me - it's probably not very nice for gcontenttype.c to use _xdg_mime_mime_type_cmp_ext(), as it's a private xdgmime function (although, unlike other _-prefixed functions, it doesn't need xdgmime to be initialized, so calling it is just ugly, not wrong). Rename it (remove the _-prefix), duplicate it in gcontenttype, or just leave the situation as-is?
Portability note: Cursory examination suggests that MacOS will be able to use this. It would use UTTypeCreatePreferredIdentifierForTag() with kUTTagClassFilenameExtension to create UTI (Apple's hierarchical type-string thingy) from a file extension (it doesn't seem possible to get UTI without a file extension), and use UTTypeCopyPreferredTagWithClass() with kUTTagClassFilenameExtension to get file extension from a UTI. UTI can also be obtained from 3rd-party applications or just created by code that actually *knows* the type of the data. Thus, while shared-mime-info doesn't know anything about UTI, it does know about file extensions, and UTI can be determined from file extensions trivially. And vice-versa - UTI can be converted to file extension, and file extension can be converted to MIME/type via shared-mime-info (that is, if you don't have a file to sniff). UTI is a string and can be trivially stored in MIME/type after ';', the same way file extension will be. Thus if UTI is known in advance, it will survive a full dive into g_content_guess(), and can be used later on.
Hi, I opened Bug #788401. In the discussion there was also a reference to this proposal here. Looking at the history of gcontenttype-win32.c the split from gcontenttype.c was done in 2012 - minimum introducing the g_content_type_guess result to be win32 specific at that point of time. Whatever the reason was to break the API at that point of time - your first statement in this proposal is that major parts of glib is still written based on the wrong API assumption that this call yields mime types. At least in gdk-pixbuf I could see that this is correct. They use hard coded "text/plain" comparison for comparing content type. However, since 2012 the poor ".tiff" windows type definitions seem to work. Maybe the decision was wrong to change the API back in the old ages. But just the fact to change API again to fix a problem that seems not to be critical would be a reason for me to be skeptic.
The API doesn't change. GContentType was a string, is a string, and will remain a string. Your skepticism, if i understand you correctly, is based on the assumption that glib API users running on Windows will see types like "foo/bar; ext=.baz" instead of the expected ".baz". That is not the case. Windows port of glib will continue to return a file extension as the content type (and MacOS port will probably continue to return the UTI string as the content type). The difference lies in the way content type is processed and/or guessed internally. P.S. I *would* like add an optional code path triggered by a special environment variable, which *will* enable Windows port of glib to return content type as-is, plus the g_content_type_get_ext() function to conveniently convert it to a file extension. That way individual applications will be able to decide for themselves whether they want the old file extensions or the new MIME/types.
Review of attachment 355266 [details] [review]: Thanks for the clarification that the API is not supposed to be changed. Maybe I do not understand the code correctly but when I look at the changes in g_content_type_guess, then I had the impression that the return value would change from "text/plain" to "text/plain; .txt" or similar. I just noticed that for example in gdk-pixbuf they do hard string comparisons to "text/plain". ::: gio/gcontenttype.c @@ +816,3 @@ * Guess on the first one */ + mimetype = maybe_add_ext (name_mimetypes[0], actual_extension); I interpret this as adding the extension to the old mimetype, i.e. instead of returning "text/plain", the return value would be "text/plain; .txt"
H-m-m...i'll have to look carefully at the source code to be sure. That said, we could just chalk it up as a bug in the user code. g_content_type_equals() will correctly compare two content types and pronounce them equal or unequal by ignoring the extra parts. Same goes for g_content_type_is_a() (instead of doing strncmp (type, "image/", ...), for example). At least the code that expects MIME/types have no excuses for not using these functions. The code that expects something else should also be using them (on Windows g_content_type_equals() does g_ascii_strcasecmp() first, then does strcmp on corresponding class key default values).
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1279.