After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 784747 - GContentType doesn't keep track of file extensions
GContentType doesn't keep track of file extensions
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: gio
unspecified
Other All
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2017-07-10 12:35 UTC by LRN
Modified: 2018-05-24 19:41 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
xdgmime - check for globs2 (1.88 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - add mime/type->extension lookup (9.30 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - add a function for mime/type comparison (7.75 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime Add a function for parsing MIME/type parameters (9.14 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - return synthetic types for filenames with no matches (3.63 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - parse application/x-extension-foo to get .foo (3.65 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - ignore extended parameters in supertype checks (1.44 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
GContentType - correctly support extended MIME/types (7.49 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - don't use empty caches (5.11 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
GContentType - integrate filename and data sniffing tighter (10.09 KB, patch)
2017-07-10 12:35 UTC, LRN
none Details | Review
xdgmime - implement cache-based extensions lookup (11.82 KB, patch)
2017-07-10 12:36 UTC, LRN
none Details | Review
Testsuite: fix gcontenttype test to pass, with the new logic (2.18 KB, patch)
2017-07-10 12:36 UTC, LRN
none Details | Review
GContentType - add a function to get extension (3.88 KB, patch)
2017-07-10 12:36 UTC, LRN
none Details | Review
xdgmime - make extension lookup function walk the tree (5.75 KB, patch)
2017-07-10 12:36 UTC, LRN
none Details | Review
GContentType - add a function for getting multiple extensions (4.25 KB, patch)
2017-07-10 12:36 UTC, LRN
none Details | Review

Description LRN 2017-07-10 12:35:06 UTC
GContentType is a string, and glib is canonically written with the assumption that this string is a MIME/type.
Some time ago there was a discussion about changing this in a somewhat compatible way to ensure that other metadata (other than MIME/type) can be preserved in the content type string, which would help portability.

These patches are the result of that. More specifically, the discussion mostly happened in bug 735696 (particularly - bug 735696, comment 45).
However, it's been hard to get someone to review this code due to the fact that it mixed W32, gcontenttype and xdgmime changes. Therefore i picked my Debian laptop and re-worked the code to make changes to the "unix" contenttype code and xdgmime the first step, completely excising any mentions of W32 and doubling down on testsuite additions.

The code still has some xdgmime behavior changes thrown in (globs2 check, empty cache check). Hopefully, that won't be a problem.
Comment 1 LRN 2017-07-10 12:35:12 UTC
Created attachment 355259 [details] [review]
xdgmime - check for globs2

This adds a globs2 file check to xdg_check_dir(). This way its behaviour
matches the xdg_mime_init_from_directory(), which first tries to load
globs2, then falls back to globs.
Comment 2 LRN 2017-07-10 12:35:17 UTC
Created attachment 355260 [details] [review]
xdgmime - add mime/type->extension lookup

Adds xdg_mime_get_file_exts_from_mime_type(), which takes
a mime/type and looks up possible file extensions that files
matching this type could have.

This initial implementation uses glob hash.
Comment 3 LRN 2017-07-10 12:35:23 UTC
Created attachment 355261 [details] [review]
xdgmime - add a function for mime/type comparison

This function compares "MIME/type; foo=bar" types correctly
by ignoring everything after ';'.
Comment 4 LRN 2017-07-10 12:35:27 UTC
Created attachment 355262 [details] [review]
xdgmime Add a function for parsing MIME/type parameters

Right now it only gets the ext=... parameter, but can be modified
to get other parameters as well.
Comment 5 LRN 2017-07-10 12:35:32 UTC
Created attachment 355263 [details] [review]
xdgmime - return synthetic types for filenames with no matches

When nothing matches foo.bar, return its MIME/type as application/x-extension-bar
Comment 6 LRN 2017-07-10 12:35:38 UTC
Created attachment 355264 [details] [review]
xdgmime - parse application/x-extension-foo to get .foo
Comment 7 LRN 2017-07-10 12:35:43 UTC
Created attachment 355265 [details] [review]
xdgmime - ignore extended parameters in supertype checks
Comment 8 LRN 2017-07-10 12:35:48 UTC
Created attachment 355266 [details] [review]
GContentType - correctly support extended MIME/types
Comment 9 LRN 2017-07-10 12:35:54 UTC
Created attachment 355267 [details] [review]
xdgmime - don't use empty caches

The code will try to use the cache, and only the cache, if it's
available, even if it's empty. Don't let it do that.
Comment 10 LRN 2017-07-10 12:35:59 UTC
Created attachment 355268 [details] [review]
GContentType - integrate filename and data sniffing tighter

Instead of doing both types of checks separately, use the results of
data sniffing to filter out filename sniffing results that can't
apply: if we have a magic signature for MIME/type foo/bar, and our data
does not match that magic signature, it means that this file is not foo/bar,
even though its filename looks like it could be foo/bar.
Comment 11 LRN 2017-07-10 12:36:04 UTC
Created attachment 355269 [details] [review]
xdgmime - implement cache-based extensions lookup

Note that this implementation makes it obvious just how inefficient it is
to make extensions lookups in suffix cache. If this ever becomes a bottleneck,
consider creating a new kind of cache, one optimized for doing mime->ext lookups.
Comment 12 LRN 2017-07-10 12:36:09 UTC
Created attachment 355270 [details] [review]
Testsuite: fix gcontenttype test to pass, with the new logic

* A bogus text-ish file named "*.pot" won't be detected as gettext
  template, because we have a magic for gettext templates, and it
  doesn't match that file, even though it has the right extension.
  Powerpoint is more plausible here (although also wrong).
* Use actual *.pot file contents to correctly detect it as a
  gettext template. Again, feeding it something vaguely text-ish
  will not cut it anymore.
Comment 13 LRN 2017-07-10 12:36:15 UTC
Created attachment 355271 [details] [review]
GContentType - add a function to get extension

Adds g_content_type_get_extension(), which returns the extension
associated with a given content type.
Also add a testcase for it.
Comment 14 LRN 2017-07-10 12:36:20 UTC
Created attachment 355272 [details] [review]
xdgmime - make extension lookup function walk the tree

If it need to find more extensions than it already found,
it gets the list of parents and calls itself recursively on them.

Also, unalias the type before looking it up.

Also change it into a pair of functions (internal and external).
External function calls ..._init() and calls the internal function.
Comment 15 LRN 2017-07-10 12:36:26 UTC
Created attachment 355273 [details] [review]
GContentType - add a function for getting multiple extensions

Also extend the testsuite to test it.

Unfortunately, it has to accept a number-of-extensions-to-look-up parameter,
as the xdgmime API isn't designed to deal with an arbitrary number of things to find.
Comment 16 LRN 2017-07-10 13:46:42 UTC
Something just occurred to me - it's probably not very nice for gcontenttype.c to use _xdg_mime_mime_type_cmp_ext(), as it's a private xdgmime function (although, unlike other _-prefixed functions, it doesn't need xdgmime to be initialized, so calling it is just ugly, not wrong). Rename it (remove the _-prefix), duplicate it in gcontenttype, or just leave the situation as-is?
Comment 17 LRN 2017-07-29 11:40:41 UTC
Portability note:

Cursory examination suggests that MacOS will be able to use this.

It would use UTTypeCreatePreferredIdentifierForTag() with kUTTagClassFilenameExtension to create UTI (Apple's hierarchical type-string thingy) from a file extension (it doesn't seem possible to get UTI without a file extension), and use UTTypeCopyPreferredTagWithClass() with kUTTagClassFilenameExtension to get file extension from a UTI. UTI can also be obtained from 3rd-party applications or just created by code that actually *knows* the type of the data.

Thus, while shared-mime-info doesn't know anything about UTI, it does know about file extensions, and UTI can be determined from file extensions trivially. And vice-versa - UTI can be converted to file extension, and file extension can be converted to MIME/type via shared-mime-info (that is, if you don't have a file to sniff).

UTI is a string and can be trivially stored in MIME/type after ';', the same way file extension will be. Thus if UTI is known in advance, it will survive a full dive into g_content_guess(), and can be used later on.
Comment 18 Friedrich Beckmann 2017-10-14 05:01:52 UTC
Hi, I opened Bug #788401. In the discussion there was also a reference to this proposal here. Looking at the history of gcontenttype-win32.c the split from gcontenttype.c was done in 2012 - minimum introducing the g_content_type_guess result to be win32 specific at that point of time. Whatever the reason was to break the API at that point of time - your first statement in this proposal is that major parts of glib is still written based on the wrong API assumption that this call yields mime types. At least in gdk-pixbuf I could see that this is correct. They use hard coded "text/plain" comparison for comparing content type.

However, since 2012 the poor ".tiff" windows type definitions seem to work. Maybe the decision was wrong to change the API back in the old ages. But just the fact to change API again to fix a problem that seems not to be critical would be a reason for me to be skeptic.
Comment 19 LRN 2017-10-14 07:17:57 UTC
The API doesn't change. GContentType was a string, is a string, and will remain a string.

Your skepticism, if i understand you correctly, is based on the assumption that glib API users running on Windows will see types like "foo/bar; ext=.baz" instead of the expected ".baz". That is not the case. Windows port of glib will continue to return a file extension as the content type (and MacOS port will probably continue to return the UTI string as the content type). The difference lies in the way content type is processed and/or guessed internally.

P.S. I *would* like add an optional code path triggered by a special environment variable, which *will* enable Windows port of glib to return content type as-is, plus the g_content_type_get_ext() function to conveniently convert it to a file extension. That way individual applications will be able to decide for themselves whether they want the old file extensions or the new MIME/types.
Comment 20 Friedrich Beckmann 2017-10-14 07:55:16 UTC
Review of attachment 355266 [details] [review]:

Thanks for the clarification that the API is not supposed to be changed. Maybe I do not understand the code correctly but when I look at the changes in g_content_type_guess, then I had the impression that the return value would change from "text/plain" to "text/plain; .txt" or similar. I just noticed that for example in gdk-pixbuf they do hard string comparisons to "text/plain".

::: gio/gcontenttype.c
@@ +816,3 @@
            * Guess on the first one
            */
+          mimetype = maybe_add_ext (name_mimetypes[0], actual_extension);

I interpret this as adding the extension to the old mimetype, i.e. instead of returning "text/plain", the return value would be "text/plain; .txt"
Comment 21 LRN 2017-10-14 08:24:25 UTC
H-m-m...i'll have to look carefully at the source code to be sure.

That said, we could just chalk it up as a bug in the user code. g_content_type_equals() will correctly compare two content types and pronounce them equal or unequal by ignoring the extra parts. Same goes for g_content_type_is_a() (instead of doing strncmp (type, "image/", ...), for example). At least the code that expects MIME/types have no excuses for not using these functions. The code that expects something else should also be using them (on Windows g_content_type_equals() does g_ascii_strcasecmp() first, then does strcmp on corresponding class key default values).
Comment 22 GNOME Infrastructure Team 2018-05-24 19:41:56 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1279.