Bug 735696 – W32 GContentType support is inadequate

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 735696 - W32 GContentType support is inadequate


Summary:	W32 GContentType support is inadequate


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	win32
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-win32 maintainers

URL:
Whiteboard:

Depends on:	735689
Blocks:

Reported:	2014-08-29 21:14 UTC by LRN
Modified:	2018-05-24 16:59 UTC

See Also:	781020
GNOME target:	---
GNOME version:	---

Attachments
Make xdgmime more portable (6.18 KB, patch) 2014-08-29 21:15 UTC, LRN	none	Details \| Review
Make sure xdgmime reads files in binary mode (3.03 KB, patch) 2014-08-29 21:15 UTC, LRN	none	Details \| Review
W32 GContentType rewrite (57.31 KB, patch) 2014-08-29 21:15 UTC, LRN	none	Details \| Review
Add W32 support to xdgmimecache (5.50 KB, patch) 2014-08-29 21:15 UTC, LRN	none	Details \| Review
Disable xdgcache on W32 for now (1007 bytes, patch) 2014-08-29 21:15 UTC, LRN	none	Details \| Review
Enable contenttype test on W32, tweak it to pass (2.03 KB, patch) 2014-08-29 21:15 UTC, LRN	none	Details \| Review
Expand copyright notice in gcontenttype-win32.c (1.21 KB, patch) 2014-08-29 21:20 UTC, LRN	none	Details \| Review
Make xdgmime more portable (6.91 KB, patch) 2014-09-09 06:53 UTC, LRN	none	Details \| Review
W32 GContentType rewrite (57.32 KB, patch) 2014-09-09 06:54 UTC, LRN	needs-work	Details \| Review
W32 GContentType rewrite (57.22 KB, patch) 2014-09-10 06:19 UTC, LRN	none	Details \| Review
Make xdgmime more portable (8.05 KB, patch) 2014-09-10 09:47 UTC, LRN	none	Details \| Review
W32 GContentType rewrite (47.66 KB, patch) 2014-09-10 09:50 UTC, LRN	none	Details \| Review
Make xdgmime more portable (17.02 KB, patch) 2015-05-02 23:38 UTC, LRN	needs-work	Details \| Review
W32 GContentType rewrite (46.27 KB, patch) 2015-05-02 23:40 UTC, LRN	none	Details \| Review
xdgmime: Finer handling for cases where mmap() is not available (6.52 KB, patch) 2015-05-02 23:41 UTC, LRN	committed	Details \| Review
Enable contenttype test on W32, tweak it to pass (mostly) (2.47 KB, patch) 2015-05-02 23:50 UTC, LRN	committed	Details \| Review
W32 GContentType rewrite (46.31 KB, patch) 2015-05-02 23:57 UTC, LRN	none	Details \| Review
Make xdgmime and glocalfileinfo more compatible with MSVC (5.81 KB, patch) 2015-05-05 01:50 UTC, LRN	none	Details \| Review
W32 GContentType rewrite (54.81 KB, patch) 2015-05-05 02:17 UTC, LRN	needs-work	Details \| Review
Fix nitpicks for attachment 302905 (4.16 KB, patch) 2015-05-18 11:52 UTC, LRN	none	Details \| Review

Description LRN 2014-08-29 21:14:58 UTC

For two reasons:
- content type on W32 is ".ext" (file extension), not MIME/type.
  As such, MIME/type matching and subclassig can't be applied to it.
- implementation relies only on the information found in Windows registry.
  That information is optional in many cases, which means that the
  implementation is often incapable of making the deductions that the
  clients are expecting of it.

My solution is to keep using file extensions for content types (for backward
compatibility with existing glib W32 code; i would have preferred to ditch
them completely, but that seems very unlikely), but convert them to MIME/types
under the hood, then re-use the same xdgmime-based gcontenttype implementation
that the other OSes use (with some extra W32 wrapping to fall back to reading
the Windows registry in case xdgmime does not deliver; also, with some W32-only
heuristics).

That requires xdgmime to be ported to W32, which turned out to be easy enough
(as long as xdgmime is allowed to link to glib, which is not a problem by
itself, although it does make it more difficult to sync xdgmime with upstream).

Comment 1 LRN 2014-08-29 21:15:12 UTC

Created attachment 284847 [details] [review]
Make xdgmime more portable

Comment 2 LRN 2014-08-29 21:15:17 UTC

Created attachment 284848 [details] [review]
Make sure xdgmime reads files in binary mode

This fixes the error when getc() returns EOF before reaching the end of
file on W32.

Comment 3 LRN 2014-08-29 21:15:22 UTC

Created attachment 284849 [details] [review]
W32 GContentType rewrite

- Add extension lookup function to xdgmimeglob
- Rename all g_content_type* function to g_mime_content_type*
- Add gcontenttyle-unix.c, which implements g_content_type* for unix by
  simply calling g_mime_content_type*
- Add mime support gcontenttype-win32.c, which implements g_content_type*
  for W32.
- Rename g_unix_content_type* to g_mime_content_type* and enable their
  use on W32.

Comment 4 LRN 2014-08-29 21:15:27 UTC

Created attachment 284850 [details] [review]
Add W32 support to xdgmimecache

Comment 5 LRN 2014-08-29 21:15:32 UTC

Created attachment 284851 [details] [review]
Disable xdgcache on W32 for now

Comment 6 LRN 2014-08-29 21:15:36 UTC

Created attachment 284852 [details] [review]
Enable contenttype test on W32, tweak it to pass

Comment 7 LRN 2014-08-29 21:20:51 UTC

Created attachment 284855 [details] [review]
Expand copyright notice in gcontenttype-win32.c

Comment 8 LRN 2014-09-09 06:53:01 UTC

Created attachment 285714 [details] [review]
Make xdgmime more portable

v2:
 - Updated to match the new version of fnmatch function (g_utf8_fnmatch, etc)

Comment 9 LRN 2014-09-09 06:54:27 UTC

Created attachment 285715 [details] [review]
W32 GContentType rewrite

- Add extension lookup function to xdgmimeglob
- Rename all g_content_type* function to g_mime_content_type*
- Add gcontenttyle-unix.c, which implements g_content_type* for unix by
  simply calling g_mime_content_type*
- Add mime support gcontenttype-win32.c, which implements g_content_type*
  for W32.
- Rename g_unix_content_type* to g_mime_content_type* and enable their
  use on W32.

v2:
 - Rebased on top of current master (there was a gdesktopappinfo fix that
   needed merging.

Comment 10 Ignacio Casal Quinteiro (nacho) 2014-09-09 07:11:44 UTC

Review of attachment 285715 [details] [review]:

Some nitpicks.

::: gio/gcontenttype-unix.c
@@ +1,1 @@
+/* -*- mode: C; c-file-style: "gnu"; indent-tabs-mode: nil; -*- */

about the name of the files what do we do for other files like this? do we prefix (gunixcontenttype) or suffix like you did?

@@ +105,3 @@
+}
+
+

remove extra line

@@ +266,3 @@
+  return g_mime_content_types_get_registered ();
+}
+

extra line

::: gio/gcontenttype-win32.c
@@ +52,3 @@
+   */
+  gchar **exts;
+} ext_list;

I don't like this name...

@@ +77,3 @@
+
+  if (!GetModuleHandleExA (GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT |
+				GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS,

wrong align

@@ +106,3 @@
+    retval = g_build_filename (filename, subdir, NULL);
+  else
+    retval = g_build_filename (filename, NULL);

use g_strdup in this case, and even better do like this:
retval = subdir ? g_build_filename (filename, subdir, NULL) : g_strdup (filename);

though now I see that you free the filename right down. You could do then:
if (subdir)
  {
    retval = g_build_filename (filename, subdir, NULL);
    g_free (filename);
  }
else
  retval = filename;

@@ +114,3 @@
+
+static void
+on_etc_changed (

wrong alignment

@@ +139,3 @@
+  g_file_info_get_modification_time (fileinfo, &tv);
+
+  g_clear_object (&fileinfo);

do just g_object_unref here

@@ +165,3 @@
+        break;
+
+      etc_mime_types_filename =

assign in same line please

@@ +177,3 @@
+        break;
+
+      etc_mime_types = g_file_new_for_path (etc_mime_types_filename);

is this leaked?

@@ +182,3 @@
+        break;
+
+      etc_monitor =

same here for align

@@ +201,3 @@
+  while (FALSE);
+
+  g_clear_pointer (&etc_dirname, g_free);

use g_free directly

@@ +206,3 @@
+  if (!result)
+    {
+      g_clear_pointer (&etc_mime_types_filename, g_free);

just g_free

Comment 11 LRN 2014-09-10 06:15:02 UTC

(In reply to comment #10)
> Review of attachment 285715 [details] [review]:
> 
> ::: gio/gcontenttype-unix.c
> @@ +1,1 @@
> +/* -*- mode: C; c-file-style: "gnu"; indent-tabs-mode: nil; -*- */
> 
> about the name of the files what do we do for other files like this? do we
> prefix (gunixcontenttype) or suffix like you did?

I didn't do nothing, it was like this when i started.
That said, suffixing is more convenient for me, personally.

> ::: gio/gcontenttype-win32.c
> @@ +52,3 @@
> +   */
> +  gchar **exts;
> +} ext_list;
> 
> I don't like this name...

Alternatives?

> @@ +177,3 @@
> +        break;
> +
> +      etc_mime_types = g_file_new_for_path (etc_mime_types_filename);
> 
> is this leaked?
> 

set_up_mime_types_watch() is only called once, so this object is only allocated once. It's never unreffed though, so in that sense - yes, it's leaked.

Comment 12 LRN 2014-09-10 06:19:07 UTC

Created attachment 285792 [details] [review]
W32 GContentType rewrite

- Add extension lookup function to xdgmimeglob
- Rename all g_content_type* function to g_mime_content_type*
- Add gcontenttyle-unix.c, which implements g_content_type* for unix by
  simply calling g_mime_content_type*
- Add mime support gcontenttype-win32.c, which implements g_content_type*
  for W32.
- Rename g_unix_content_type* to g_mime_content_type* and enable their
  use on W32.

v2:
 - Rebased on top of current master (there was a gdesktopappinfo fix that
   needed merging.
v3:
 - Fixed most nitpicks

Comment 13 LRN 2014-09-10 09:47:40 UTC

Created attachment 285807 [details] [review]
Make xdgmime more portable

v2:
 - Updated to match the new version of fnmatch function (g_utf8_fnmatch, etc)
v3:
 - Ensure that xdgmime only works with UTF-8 filenames

Comment 14 LRN 2014-09-10 09:50:18 UTC

Created attachment 285808 [details] [review]
W32 GContentType rewrite

- Add extension lookup function to xdgmimeglob
- Rename all g_content_type* function to g_mime_content_type*
- Add gcontenttyle-unix.c, which implements g_content_type* for unix by
  simply calling g_mime_content_type*
- Add mime support gcontenttype-win32.c, which implements g_content_type*
  for W32.
- Rename g_unix_content_type* to g_mime_content_type* and enable their
  use on W32.

v2:
 - Rebased on top of current master (there was a gdesktopappinfo fix that
   needed merging.
v3:
 - Fixed most nitpicks
v4:
 - Removed the code that reads /etc/mime.types and uses it to convert
   extensions to MIME/types. Now xdgmime is used for this instead.
   This cuts the amount of new code by half, but does make this implementation
   less threadsafe (xdgmime is not thread-safe, /etc/mime.types-based hash
   table was thread-safe).

Comment 15 Alexander Larsson 2014-09-11 12:45:22 UTC

Review of attachment 285807 [details] [review]:

::: gio/gcontenttype.c
@@ +650,3 @@
+
+          if (basename)
+            basename_utf8 = g_filename_to_utf8 (basename, -1, NULL, NULL, NULL);

This isn't really right.
Even if the filename doesn't propely convert to utf8 we should be able to look up its type.
I know the xdgmime code is a bit boneheaded about utf8, but in general we should not need to assume an encoding, as all globs in our databases are pure 7bit ascii.

We might need to convert to utf8 on win32 though? Or are not all filenames utf8 there?

::: gio/xdgmime/xdgmimecache.c
@@ +447,3 @@
 
+	  /* file_name should have been converted into UTF-8 somewhere up the stack. */
+	  if (g_utf8_fnmatch (ptr, file_name, flags) == G_FNM_MATCH)

There is no need to do utf8 contortions here, pretending everything is 8bit chars is fine. We will never mismatch anything anyway, as all our globs are ascii. Especially since our code has to work on "invalid" filenames anyway, since they can happen.

Comment 16 LRN 2014-09-11 13:07:34 UTC

(In reply to comment #15)
> Review of attachment 285807 [details] [review]:
> 
> ::: gio/gcontenttype.c
> @@ +650,3 @@
> +
> +          if (basename)
> +            basename_utf8 = g_filename_to_utf8 (basename, -1, NULL, NULL,
> NULL);
> 
> This isn't really right.
> Even if the filename doesn't propely convert to utf8 we should be able to look
> up its type.
> I know the xdgmime code is a bit boneheaded about utf8, but in general we
> should not need to assume an encoding, as all globs in our databases are pure
> 7bit ascii.
> 
> We might need to convert to utf8 on win32 though? Or are not all filenames utf8
> there?
> 
> ::: gio/xdgmime/xdgmimecache.c
> @@ +447,3 @@
> 
> +      /* file_name should have been converted into UTF-8 somewhere up the
> stack. */
> +      if (g_utf8_fnmatch (ptr, file_name, flags) == G_FNM_MATCH)
> 
> There is no need to do utf8 contortions here, pretending everything is 8bit
> chars is fine. We will never mismatch anything anyway, as all our globs are
> ascii. Especially since our code has to work on "invalid" filenames anyway,
> since they can happen.

Disregard the g_utf8_fnmatch()-related stuff, that patch was refused (see bug 735689) (the expectation is that if you agree to let gcontenttype patch in, a simplified non-fnmatch-based globbing function will be written and used in xdgmime).

Comment 17 Alexander Larsson 2014-09-11 13:37:07 UTC

Content types are primarlily gotten from using g_file_query_info() and then g_file_info_get_content_type(). The main use of them in the api is:
1) As a way to get an icon for the file (internally)
2) To connect "installed apps" represented by GAppInfo to different kinds of files. i.e. can this app open such a file, what is the default app that opens this kind of files, etc.

On windows, these two are handled (mainly) by using extensions. So, two .txt files would always have the same icon and default app. If we start having gio use the full xdg sniffing machinery it will claim that "CMakeLists.txt" is a "text/x-cmake" and "readme.txt" is a "text/plain", then what will we return from g_app_info_get_recommended_for_type()? Do we suddenly want to be able to open different apps for these? Thats not how win32 typically works. Same for icons, do we want to not show the "standard" registred file icon for a file if we sniff it to be of a more specific type? That is also nontypical for windows apps.

Not to mention that i'm not sure how the implementation would work. The current patch doesn't change gwin32appinfo.c at all, so it still takes the content type and looks for it in the registry. Now that the content type is a mimetype this fails and we will not be able to open any files. Same with icon, query_info will return a mimetype and the win32 g_content_type_get_icon implementation will fail to look up the registred icon for the extension of the file.

So, the change will break application associations and registred file-type-icons on win32. The advantage is that you get more "detailed" icons and content types.

So, what exactly is the goal here? Who is the target for this change? Do you want more kinds of icons in the file selector? Do you want more details on the files so you can pick e.g. highlighting mode for your editor?

I can understand there being situations where having the full power of xdgmime in things like sniffing file types, but I don't think we want to lose the platform integration that content types gives us. I think it might make more sense to add a new api that exposes the full mimetype machinery, and a new mime-type property for GFileInfo. Then applications that need the more finegrained, but not-platform-integrated type can use that.

Comment 18 LRN 2014-09-11 14:14:21 UTC

(In reply to comment #17)
> Content types are primarlily gotten from using g_file_query_info() and then
> g_file_info_get_content_type(). The main use of them in the api is:
> 1) As a way to get an icon for the file (internally)
> 2) To connect "installed apps" represented by GAppInfo to different kinds of
> files. i.e. can this app open such a file, what is the default app that opens
> this kind of files, etc.
> 
> On windows, these two are handled (mainly) by using extensions. So, two .txt
> files would always have the same icon and default app. If we start having gio
> use the full xdg sniffing machinery it will claim that "CMakeLists.txt" is a
> "text/x-cmake" and "readme.txt" is a "text/plain", then what will we return
> from g_app_info_get_recommended_for_type()?  Do we suddenly want to be able to
> open different apps for these? Thats not how win32 typically works. Same for
> icons, do we want to not show the "standard" registred file icon for a file if
> we sniff it to be of a more specific type? That is also nontypical for windows
> apps.

It's not typical, it's more advanced, i'd say. The flip side of the unusual behaviour is that it will be able to correctly open files that have wrong extensions. Also, it will be able to show different icons for files that don't have extensions - something that W32 is normally incapable of doing. I'd call it a feature, not a bug.

> 
> Not to mention that i'm not sure how the implementation would work. The current
> patch doesn't change gwin32appinfo.c at all, so it still takes the content type
> and looks for it in the registry.
GWin32AppInfo will be changed in bug 666831.

> Now that the content type is a mimetype
Content type is still an extension, glib is just better able to reason about it, and does not need to rely exclusively on W32 registry for info about files.

> this
> fails and we will not be able to open any files.
That is not true. It will open files. The .ext -> mimetype -> .ext conversion involved in the process *may* not work, but in that case there's still the registry to look up info in.

> Same with icon, query_info
> will return a mimetype and the win32 g_content_type_get_icon implementation
> will fail to look up the registred icon for the extension of the file.
That is not trye. query_info() will return a file extension, which it will obtain by converting mimetype, which it will obtain by sniffing the file contents and/or file name. If that fails, previous implementation will kick in.

> 
> So, the change will break application associations and registred
> file-type-icons on win32. The advantage is that you get more "detailed" icons
> and content types.
Your statement about the advantage is correct, your statement about the disadvantage is not correct. At most the behaviour would be somewhat different (i.e., for example, opening files without extensions using different programs, showing different icons for them), but not broken.
> 
> So, what exactly is the goal here? Who is the target for this change? Do you
> want more kinds of icons in the file selector? Do you want more details on the
> files so you can pick e.g. highlighting mode for your editor?

To be honest, this patch is a followup for bug 666831. At some point i needed to implement the "give me all apps that handle this content type" function, and i was somewhat disappointed that "content type" is just an extension and that more advanced functionality (like "give me fallback application for that content type") is highly unlikely to ever work.

> 
> I can understand there being situations where having the full power of xdgmime
> in things like sniffing file types, but I don't think we want to lose the
> platform integration that content types gives us. I think it might make more
> sense to add a new api that exposes the full mimetype machinery, and a new
> mime-type property for GFileInfo. Then applications that need the more
> finegrained, but not-platform-integrated type can use that.
If you have any specific cases of breakable contenttype-related behaviour in mind, i can test glib/gedit/evince/gtk-widget-factory/gtk-demo with this patch applied and ensure that results are the same or better (or not, if your assertions are correct and mine were not). I will be able then to explain what exactly happens under the hood in each case, and how is it that it works (or doesn't work). Would that be more productive? Because right now it seems to me that you have not judged correctly the functionality this patch provides.

There are some tests i can do easily. For example, by applying the patch from bug 666831 i can easily get my hands on a hash table that contains all file extensions that Windows registry knows *anything* about. I can then run various gcontenttype routines on them (because, as i have said, contenttype string on W32 is still an ".ext", not a mime type), dump the results, then do the same without this patch applied, then presents the two dumps for comparison. I did do something like this once already (this is how i've decided that reading <root>/etc/mime.types is a bad idea).

Comment 19 Alexander Larsson 2014-09-11 16:41:08 UTC

I think there will be a lot of corner cases where the mapping between
.ext and mimetype will not work, because multiple mimetypes can have
the same glob, and each mimetype can have multiple globs.

For instance, say we have a file called foo.dot, and we try to
calculate its content type. First we guess it, and there are two *.dot
globs, application/msword-template and text/vnd.graphviz, so we will
return the first one and set uncertain to TRUE. Say we happened to
return text/vnd.graphviz (not sure what will happen right now, but one
could tweak the glob weight to make this happen). Now comes the first
problem, that mimetype has two globs. We pick the first, which is
.gv. This means we return ".gv" for "foo.dot" on the first round.

This is the first problem: If we later want to look up the registred
app for this in the registry we will look at .gv, but the file was
.dot. These may have different apps registred.

However, since we returned uncertain=true (means there were multiple
alternatives) from guess() this triggers a content sniff in
query_info() and a new guess based on file data, which might instead
decide on the mimetype "application/msword-template" due to the extra
info, and we return the (only) extension for that, i.e. ".dot".

The second problem is when we later feed this return value to
e.g. g_content_type_get_mime_type, or g_content_type_get_description.
It will try to convert the .ext form to a mimetype, but at this time
we have no sniffing data, so the ".dot" will be converted to graphviz
again, rather than the actual value we decided on (msword).

Also, problems arise if we have a file with an extension unknown to
xdg, we then return NULL when guessing the content type, rather than
something based on the extension, so we can't later open files for this
type.

Basically, since the system application mapping is based on extensions
and there is no simple one-to-one mapping between these and xdg
mimetypes I think the applcation mapping will always fail in some weird
way. So, that means I think we need content-type to continue be purely
.ext based. That doesn't mean we can't *also* calculate a mime type
property and use this for non-application mapping features. For instance
it can be used to pick up better icons, or apps can use it if it needs
a more detailed file type info (say for syntax highlighting).

Such a solution would not be able to handle open a file that sniffed
to the correct mimetype but has the wrong extension, but that is not what
is expected on win32 anyway, and a solution that tried to implement this would
run afoul of the problems above, which i think is worse...

Comment 20 LRN 2014-09-11 19:01:25 UTC

So, we have two possible solutions. Well, three, actually.

1) Don't do anything. Simple.

2) Do as you say - use mime type machinery only for some cases, such as picking better icons or supplying mimetype data for apps that know what they want.

3) Move the .ext/mimetype conversion boundary closer to the client: make mimetype *the* content type on W32. APIs that are aware of the fact that apps are associated with *extensions*, not mimetypes, will be able to do the conversion , the same way as it's done now. This way mimetype will be used in more places and kept around in many cases, preventing one of the situations you've described (where mimetype was lost after converting it back to extension, which negatively affected further results). I'm not entirely sure how this would work, that's just an idea.
This will break user code that uses gcontenttype, since it expects to work with extensions, but will get mimetypes instead. But then i've been hearing a lot lately that gcontenttype is not used much by anything other than GAppInfo, and appinfo is up for modification anyway, so that wouldn't be a problem.
If you are willing to consider such a move, i'm willing to prototype it and see how it can be implemented, which APIs will be affected, what the results will be, etc.

Ultimately, it's up to you. Desrt wants the solution #1, but is willing to listen to your arguments. I'd like to at least see what #3 would look like before abandoning it.

Comment 21 LRN 2015-04-30 23:15:35 UTC

GContentType strikes again: https://bugzilla.gnome.org/show_bug.cgi?id=748727

Comment 22 LRN 2015-04-30 23:23:09 UTC

After rereading last comment from alex, i came up with another possibility.

4) Change the definition of content type for W32: instead of ".ext" make it something like ".ext\\mime/type" (that is, file extension, followed by a backslash, followed by the mimetype). '\\' is not mandatory, just the first thing that i came up with; could also use ';' or anything else that can't be in a file name (and this in an extension) and in mimetype.
Modify code that works with content type to be aware of this format. If it needs extension, take the left part. If it needs mime type, take the right part. This way mimetype won't be lost when content type is passed around, but extensions can still be used as mime types (no '\\' and empty right part).

Comment 23 LRN 2015-05-01 00:09:34 UTC

Dug this a bit.
IETF has at least 2 definitions for content type. This one:

       type-name = reg-name
       subtype-name = reg-name

       reg-name = 1*127reg-name-chars
       reg-name-chars = ALPHA / DIGIT / "!" /
                       "#" / "$" / "&" / "." /
                       "+" / "-" / "^" / "_"

and this one


     type-name = restricted-name
     subtype-name = restricted-name

     restricted-name = restricted-name-first *126restricted-name-chars
     restricted-name-first  = ALPHA / DIGIT
     restricted-name-chars  = ALPHA / DIGIT / "!" / "#" /
                              "$" / "&" / "-" / "^" / "_"
     restricted-name-chars =/ "." ; Characters before first dot always
                                  ; specify a facet name
     restricted-name-chars =/ "+" ; Characters after last plus always
                                  ; specify a structured syntax suffix

There's another one in Section 5.1 of [RFC2045], but it's way too old, too lax and seems to be specifically for e-mail.

Reserved filename characters according to MSDN[1]:
-----------
The following reserved characters:

    < (less than)
    > (greater than)
    : (colon)
    " (double quote)
    / (forward slash)
    \ (backslash)
    | (vertical bar or pipe)
    ? (question mark)
    * (asterisk)

Integer value zero, sometimes referred to as the ASCII NUL character.
Characters whose integer representations are in the range from 1 through 31, except for alternate data streams where these characters are allowed. For more information about file streams, see File Streams.
Any other character that the target file system does not allow.
-----------

The MSDN list appears to be correct. A KB[2] also notes that most of these restrictions are API-based, and that only ':' is truly incompatible with NTFS itself.
This character is not listed as valid for content type in newer RFCs either (not sure what's the status for the old RFC).

Therefore the proposal is to re-define content type on W32 to be:
.ext           (no mime/type is assigned)
.ext:mime/type (mime/type is assigned)
.ext:          (mime/type determination failed, i.e. what would be NULL content-type on *nix; could also tell the API to do not try to detect the type again?)
:mime/type     (no extension, mime/type is assigned; in particular, directories will be ":inode/directory" as they have no extension)
NULL           (no extension, mime/type determination failed)

This would not break API/ABI (since content type remains a string), will break users that assume that content type is an extension.

5) An alternative like (4), but it involves defining an actual content type type (i.e. GContentType) that is not an alias for "char*", but is a struct that contains both extension and mime/type. This will break API and ABI (the upshot is that this will cue users that use the gcontenttype API about the changes; In (4) and (3) they will be screwed up quietly).

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx#naming_conventions
[2] https://kb.acronis.com/content/39790

Comment 24 LRN 2015-05-01 19:51:01 UTC

Dug up my months-old patches and adapted to the new concept. Works, more or less, although i haven't been able to make some functionality work correctly (probably because i've rebased the patches to exclude fnmatch, which obviously hampered some of the functionality xdgmime has; i haven't replaced it with anything, though i probably should do that eventually, as no one else is likely to help me with this).

The real problem, something that my solution (3) wouldn't have had, is that this not only breaks the code that assumed content types to be extensions (W32 code), but also breaks the code that assumed content types to be mime/types (everything else). That includes gtk appchoser, recent manager and filesystem filters, inspector. On glib side - glocalfileinfo at the least. And gwinhttpsfile. Believe it or not, it calls g_file_info_set_content_type() and gives it a mime/type!

The adjustment that they'll have to make is to create a contenttype object (a string) from mime/type and use that for the operations they perform, instead of using a static string. Quarks could be used to avoid storing these pretty much constant strings in local variables and keeping track of them. An optimization in which an ifdef G_OS_WIN32 macro will just prepend ":" to string constants is also an option.

Another way is to add a check for '/' to w32 contenttype code, and assume that a string that does not contain ':', but does contain '/' is, in fact, a pure mime/type. That will probably be preferred by everyone, and i currently can't think of any downside of using this (other than complicating the w32 code a bit more, not that it's simple as it is).

Comment 25 LRN 2015-05-02 23:38:12 UTC

Created attachment 302778 [details] [review]
Make xdgmime more portable

v4:
- Added a simple fnmatch() implementation, made xdgmime use it
- Merged binary mode patch into this patch
- Since cache use is eventually disabled, obsoleted both xdgcache patches

Comment 26 LRN 2015-05-02 23:40:01 UTC

Created attachment 302779 [details] [review]
W32 GContentType rewrite

v5:
- Uses the new ".ext:mime/type" content type
- Thread-safe (probably)
- Merged the copyright notice patch into this patch

Comment 27 LRN 2015-05-02 23:41:14 UTC

Created attachment 302780 [details] [review]
xdgmime: Finer handling for cases where mmap() is not available

Alternative cache patch aimed to make cache-less xdgmime behive nicely.

Comment 28 LRN 2015-05-02 23:45:20 UTC

I'm not very happy with sfnmatch, as i can't prove its correctness, and it is very lax (for example, it matches '?' to itself and doesn't even warn about it), but it works for me.
I thought of making a glob->regex conversion and using GRegex, but wasn't even sure it's the right thing to do.

Win32AppInfo will need patching on top of my appinfo rewrite to support new content type (right now it thinks that content type is ".ext"). And that's probably it, everything else should work even if it passes "mime/type" as content type.

Comment 29 LRN 2015-05-02 23:50:52 UTC

Created attachment 302781 [details] [review]
Enable contenttype test on W32, tweak it to pass (mostly)

This version is somewhat cleaner, as it doesn't need to play with the
types to make the tests pass, just provide the right conditions (a real
directory). Not sure what to do with the PowerPoint/Gettext test. Maybe
replace qsort() with something else?

Comment 30 LRN 2015-05-02 23:57:51 UTC

Created attachment 302782 [details] [review]
W32 GContentType rewrite

v6
- Fixed a bug (->mime not initialized to NULL)
  This is triggered by gedit (which has to be fixed separately, by the way).
  With this change code shouldn't crash at least

Comment 31 Fan, Chun-wei 2015-05-04 07:31:56 UTC

Review of attachment 302778 [details] [review]:

Hi LRN,

Thanks for this, however some comments to follow here:

With blessings, thank you!

::: gio/xdgmime/sfnmatch.c
@@ +1,1 @@
+/*

I hope it is okay for the items here (i.e. in gio/xdgmime/) to depend on GLib items, as it seems that before this patch there is intentionally no dependence on GLib items... Might need someone to answer this question though.

::: gio/xdgmime/xdgmime.c
@@ -42,2 @@
 #include <sys/time.h>
 #include <unistd.h>

For unistd.h and sys/time.h, we don't have them everywhere, so we need to include <windows.h> (for struct timeval) on Windows instead (or at least for Visual Studio builds).

Since we are trying to use gettimeofday(), which is not available everywhere, we can, for now at least, do gettimeofday() on non-Windows and do what is done in g_get_current_time() (in gmain.c), or just use g_get_current_time() if using GLib items are okay here, to get the same behavior on Windows, as this is what that GLib API is meant to do, being a portable wrapper for gettimeofday().

@@ -46,2 @@
 typedef struct XdgDirTimeList XdgDirTimeList;
 typedef struct XdgCallbackList XdgCallbackList;

After the typedef's, #define's and #include's, we need to have extern extern _xdg_glob_hash_lookup_mime_type (...) to avoid implicit declaration of (i.e. C4013) warnings/errors.

::: gio/xdgmime/xdgmimecache.c
@@ -33,2 @@
 #include <fcntl.h>
 #include <unistd.h>

Use io.h on Windows builds.  unistd.h is a wrapper for io.h on Windows anyways.

@@ -41,2 @@
 #ifdef HAVE_MMAP
 #include <sys/mman.h>

The #warning after this line does not play well on some compilers.  Please use #pragma message (...) for Visual Studio (_MSC_VER) builds.

::: gio/xdgmime/xdgmimemagic.c
@@ -801,1 @@
   FILE *magic_file;

For this file, include io.h instead of unistd.h on Windows.

Comment 32 Fan, Chun-wei 2015-05-04 07:35:03 UTC

Review of attachment 302782 [details] [review]:

::: gio/glocalfileinfo.c
@@ +1243,1 @@
       if (!fast && result_uncertain && path != NULL)

There is a ssize_t that is enabled by this patch.  Please change it to gssize so that all compilers wouldn't get upset.

Comment 33 LRN 2015-05-04 07:53:55 UTC

(In reply to Fan, Chun-wei from comment #31)
> Review of attachment 302778 [details] [review] [review]:
> 
> ::: gio/xdgmime/sfnmatch.c
> @@ +1,1 @@
> +/*
> 
> I hope it is okay for the items here (i.e. in gio/xdgmime/) to depend on
> GLib items, as it seems that before this patch there is intentionally no
> dependence on GLib items... Might need someone to answer this question
> though.

Yes, this was somewhat concerning, but eventually i've decided to introduce a glib dependency anyway. It's just not productive to not use glib when it's right here ...

> 
> Since we are trying to use gettimeofday(), which is not available
> everywhere, we can, for now at least, do gettimeofday() on non-Windows and
> do what is done in g_get_current_time() (in gmain.c), or just use
> g_get_current_time() if using GLib items are okay here, to get the same
> behavior on Windows, as this is what that GLib API is meant to do, being a
> portable wrapper for gettimeofday().

g_get_current_time() is probably the right thing to use.

Comment 34 LRN 2015-05-05 01:50:32 UTC

Created attachment 302904 [details] [review]
Make xdgmime and glocalfileinfo more compatible with MSVC

* Use #include <io.h> and <time.h> on W32, not <unistd.h> and <sys/time.h>
* Use g_get_real_time() and gint64, not gettimeofday() and time_t
* Add _xdg_glob_hash_lookup_mime_type() prototype to the header
  to avoid implicit declarations
* Only use #warning when compiling with GCC
* Use gssize instead of ssize_t in glocalfileinfo

Comment 35 LRN 2015-05-05 02:17:06 UTC

Created attachment 302905 [details] [review]
W32 GContentType rewrite

v7:
* Forgot to `git add gcontenttyle-unix.c' :-(

Comment 36 Ignacio Casal Quinteiro (nacho) 2015-05-17 13:59:19 UTC

Review of attachment 302905 [details] [review]:

See initial nitpicks

::: gio/gcontenttype-win32.c
@@ +4,3 @@
  *
  * Copyright (C) 2006-2007 Red Hat, Inc.
+ * Copyright (C) 2014 LRN

put your name

@@ +20,3 @@
  *
+ * Authors: Alexander Larsson <alexl@redhat.com>
+ *          LRN <lrn1986@gmail.com>

also here

@@ +126,3 @@
+  else
+    {
+      if (colon == type)

just do ext = (colon == type) ? ... : ...

@@ +186,3 @@
+{
+  if (t->mime)
+    {

you could remove the braces in single statements

::: gio/xdgmime/xdgmime.h
@@ +93,3 @@
+/* Note the gchar. Strings are allocated by glib and must be freed with g_free() */
+int          xdg_mime_get_file_exts_from_mime_type (const char *mime_type,
+                                                    gchar      *file_exts[],

const gchar *?

::: gio/xdgmime/xdgmimeglob.c
@@ +581,3 @@
+          (*buffer_len) += 1;
+        }
+      n += _xdg_glob_hash_node_lookup_mime_type (glob_hash_node->child, mime_type, buffer, buffer_len, buffer_size, &file_exts[n], n_file_exts - n);

split this line in the parameters

@@ +588,3 @@
+
+  if (glob_hash_node->next)
+    n +=_xdg_glob_hash_node_lookup_mime_type (glob_hash_node->next, mime_type, buffer, buffer_len, buffer_size, &file_exts[n], n_file_exts - n);

also split here

@@ +603,3 @@
+
+static int
+ucs4cmp (xdg_unichar_t *a, xdg_unichar_t *b)

split params

@@ +624,3 @@
+
+static int
+filter_out_ext_dupes (ExtWeight exts[], int n_exts)

split params

Comment 37 LRN 2015-05-18 11:52:45 UTC

Created attachment 303515 [details] [review]
Fix nitpicks for attachment 302905 [details] [review]

Comment 38 Alexander Larsson 2015-05-27 14:42:29 UTC

Review of attachment 302778 [details] [review]:

::: gio/gcontenttype.c
@@ +650,3 @@
+
+          if (basename)
+            basename_utf8 = g_filename_to_utf8 (basename, -1, NULL, NULL, NULL);

I worry a bit about this breaking sniffing on invalid utf8 filenames that do have a utf8 extension. On the other hand, the xdg code (imho wrongly) assumes paths are utf8, so its hard to avoid.

::: gio/xdgmime/sfnmatch.c
@@ +20,3 @@
+
+#include <glib.h>
+#include <string.h>

This implementation seems rather inefficient, doing allocations and whatnot.
Why not just steal some existing implementation, like e.g. the freebsd one:
  http://fxr.watson.org/fxr/source/libkern/fnmatch.c

Comment 39 Alexander Larsson 2015-05-27 14:44:34 UTC

Review of attachment 302780 [details] [review]:

looks ok

Comment 40 Alexander Larsson 2015-05-27 14:45:17 UTC

Review of attachment 302781 [details] [review]:

ok

Comment 41 Alexander Larsson 2015-05-27 15:07:05 UTC

Review of attachment 302905 [details] [review]:

::: gio/xdgmime/xdgmimeglob.c
@@ +629,3 @@
+  int i, j;
+
+  last = n_exts;

As you're sorting the list anyway, why not do that first, making the filtering much easier and more efficient.

Comment 42 Alexander Larsson 2015-05-27 15:07:07 UTC

Review of attachment 302905 [details] [review]:

::: gio/xdgmime/xdgmimeglob.c
@@ +629,3 @@
+  int i, j;
+
+  last = n_exts;

As you're sorting the list anyway, why not do that first, making the filtering much easier and more efficient.

Comment 43 LRN 2015-05-27 16:12:58 UTC

(In reply to Alexander Larsson from comment #38)
> Review of attachment 302778 [details] [review] [review]:
> 
> ::: gio/gcontenttype.c
> @@ +650,3 @@
> +
> +          if (basename)
> +            basename_utf8 = g_filename_to_utf8 (basename, -1, NULL, NULL,
> NULL);
> 
> I worry a bit about this breaking sniffing on invalid utf8 filenames that do
> have a utf8 extension. On the other hand, the xdg code (imho wrongly)
> assumes paths are utf8, so its hard to avoid.

Yes, this is a bit difficult to do right - one has to ensure that all call either use filesystem encoding or utf8 and convert back and forth as needed. The problem is that on W32 filesystem encoding *is* utf8, so i can't just test it on non-latin1 filenames and see what happens.

> 
> ::: gio/xdgmime/sfnmatch.c
> @@ +20,3 @@
> +
> +#include <glib.h>
> +#include <string.h>
> 
> This implementation seems rather inefficient, doing allocations and whatnot.
> Why not just steal some existing implementation, like e.g. the freebsd one:
>   http://fxr.watson.org/fxr/source/libkern/fnmatch.c

Yes, i don't like it too much. If you remember, i've filed a bug for adding fnmatch implementation to glib, and it was refused. The consensus was that string matching needed for xdg is so rudimentary that we could get away with simpler code that supports only the feature needed by xdg. Though my code is probably not a lot simpler than, for example, the one you linked.

Comment 44 Alexander Larsson 2015-05-28 07:05:39 UTC

(In reply to LRN from comment #43)
> 
> Yes, i don't like it too much. If you remember, i've filed a bug for adding
> fnmatch implementation to glib, and it was refused. The consensus was that
> string matching needed for xdg is so rudimentary that we could get away with
> simpler code that supports only the feature needed by xdg. Though my code is
> probably not a lot simpler than, for example, the one you linked.

Well, i think it would be fair to have a configure check for fnmatch and if its not there use an in-tree implementation like the bsd one above, thats not really the same as exporting it as public api.

Comment 45 Alexander Larsson 2015-05-28 08:51:18 UTC

So, I'm not sure i like the ".ext:x/mimetype" thing. Its quite complex, and it differs wildly from what we do on unix, so little existing unix code will get this right.

Lets take a step back and re-think this. 

First of all, we're on windows, and the native way to talk about file types are extensions. For instance, you can't realistically have two apps installed that both handle "*.foo", and we resolve which one to launch based on content sniffing. So, anything that goes from filename => ... => launch app must preserve the actual extension somehow in the content type.

However, it would be nice to be able to use the shared mimetype info to show more detailed information about a file. For instance, we may want to show a better icon in the file selector for say a makefile or a shell script, and gedit may want to select the highlighting mode based on more context. This is fine as long as we don't lose the extension along the way.

This is where things get complicated. With the shared mime type system, each mimetype can map to multiple (or no) extensions (of different priorities), and each extension could be one of multiple mimetypes. In the unix backend we generate "application/x-extension-XXX" mimetypes if you set a custom app for a filetype that does not exist in the mime info db, so conceptually we could say that all extensions map to at least *some* mimetype (even if synthetic). Additionally, mimetypes are in a hierarchical structure in the database, and can have aliases.

According to the mime specs (http://tools.ietf.org/html/rfc2045 and http://tools.ietf.org/html/rfc2046) a valid mimetype can have a parameter, such as "text/plain; charset=iso-8859-1". We don't currently handle this, but we could easily support that in general (mostly by ignoring whats after the semicolon).

First of all, lets define a single (optional) default extension for each mimetype. This will be something like the first glob of the highest weight registered for the unaliased mimetype, or if none, recursively for the parents. There might be none, or it is one more or less randomly picked if there are many, but at least it is consistent amongs all users of the same mime db.

So, lets say we're sniffing a file called "filename.foo". There are various cases that we can end up with:

1) We know nothing about the type. So we assign it an application/x-extension-foo mimetype, which lets us properly decode the extension when launching an app on the file.
2) We sniff (via name and contents) a mimetype app/foobar-file from the database, which has a default extension of ".foo". Then we return app/foobar-file as the mimetype, and we can look up the extenstion from the database to properly launch the app, and we can also do other kinds of mime operations like is_a, equals, get_icon, etc.
3) We sniff a mimetype "app/foobar-file", but it has no default extension, or it is not ".foo".  In this case we return "app/foobar-file; extension=foo". If we teach the content type functions to ignore anything after the semicolor this would still allow you to get icon/description and hiearchy information. We can also teach the win32 implementation of g_app_info_get_recommended_for_type(), etc to look at the extension parameter when looking in the registry.

There is a further complication here which is that we may get a mimetype from somewhere else (say sniffing with no filename) which has no extension parameter. In that case we will in some cases not be able to find a app handler in the registry since we don't know the real extension. I think in this case we could unalias and walk the mimetype hierarchy looking for extensions, trying them until there is none more to try. Probably as good as any.


So, to do here:
* Make sniffing with no match return application/x-extension-XXX
* Make content type function ignore anything after ';'
* Create get_default_extension_for(mimetype) helper which decodes application/x-extension-foo and extension parameters first, and if that fails, looks in the database.
* On win32, set the extension parameter when sniffing if it is not the default extension for the type.
* On win32, Use get_default_extension_for() to look up the extension when launching an app on a type.

Comment 46 LRN 2015-05-28 10:10:09 UTC

(In reply to Alexander Larsson from comment #45)
> So, to do here:
> * Make sniffing with no match return application/x-extension-XXX
> * Make content type function ignore anything after ';'
> * Create get_default_extension_for(mimetype) helper which decodes
> application/x-extension-foo and extension parameters first, and if that
> fails, looks in the database.
> * On win32, set the extension parameter when sniffing if it is not the
> default extension for the type.
> * On win32, Use get_default_extension_for() to look up the extension when
> launching an app on a type.

Sounds reasonable.

One wrinkle (that isn't new, just something i didn't focus on) is that the mime lookup function that i wrote (_xdg_glob_hash_lookup_mime_type) doesn't work with mime cache. If it is to be working on non-W32 as well as W32 (which is the impression i've got), this should be fixed. I could try doing that, but maybe there's someone else who actually understands mime cache?

Comment 47 LRN 2016-05-17 07:02:29 UTC

Decided to give it a go. Current results are available at git://lrn.no-ip.info/git/glib , branch lrn/xdgmime3 (i will rewrite its history anytime i want, so keep that in mind)

Changes are more fine-grained (currently totaling 16 patches), hence the necessity of a public branch instead of attaching my patches here.

Some highlights:
1) Portability patch is mostly the same (i left utf-8 contortions in place; later on someone will have to test this on a system where utf-8 is not the filesystem encoding).
2) Fixed a bug in sfnmatch. By the way, how about using a slice allocator there?
3) Wrote a function that is able to compare content types "foo/bar; ext=.baz" correctly (ignores everything after ';')
4) Wrote a function that is able to parse "foo/bar; ext=.baz" to get the extension (parser is a state machine; it might not handle some corner-cases related to valid/invalid characters, especially in quoted strings)
5) Still no unescaper (quoted strings can have escaped characters, as per RFC)
6) Functions that look for file extension know to check for the ext=... parameter, and also understand synthetic "application/x-extension-ext" types
7) Functions that look up mime/types return synthetic types when file extension is known, but mime/type isn't detected
8) W32 implementation has a special case for content type that starts with '.' - it considers such types to be extensions and uses old fallback functionality for them. At the moment i can't say how correct this falling-back procedure is, as contenttype test does not check this. Otherwise W32 implementation is similar to my previous attempts (i.e. if mime/type-based result is uncertain, try the fallback).
9) Tweaked the handling of empty caches. Again. Now xdgmime checks that cache is actually non-empty before using it. I think this is it. The only thing left is to verify that xdgmime *still* re-loads the cache when it's changed.
10) Changed the logic used to guess content type.
Old logic:
* Find possible content types based on filename alone
* If only one match is found, return that
* Sniff content type based on file contents alone
* If no filename-based matches are found, use type sniffed from file contents
* Otherwise loop through the filename-based matches and check their relation to the content-based match. If content-based match is a supertype for (or is equal to) a filename-based match, use this match. If none are, use the first filename-based match from the list.

New logic:
* Find possible content types based on filename alone
* If only one match is found, return that
* Sniff content type based on file contents, providing it with a list of filename-based matches to eliminate
  * The sniffer loops through all magics and checks the data against them. If a magic doesn't match the data, and the corresponding mime/type is in the list of filename-based matches, remove that entry from the list
* If no filename-based matches (the ones that survived the elimination) are found, use type sniffed from file contents
* Otherwise loop through the remaining filename-based matches and check their relation to the content-based match. If content-based match is a supertype for (or is equal to) a filename-based match, use this match. If none are, use the first filename-based match from the list.

The new logic ensures that the code will not misclassify well-known file types.
For example, it will not guess that file foobar.pot consisting of 7 bytes "Abc Def" is a gettext template file, because we do have a magic for matching gettext template files, and "Abc Def" is not a match.
Old logic would have found it to be a gettext template, because it has .pot extension and its contents are vaguely text-ish (text/plain is a supertype for gettext template -> match).
It's still free to misclassify unknown types (the foobar.pot above will be classified as Powerpoint presentation based on its extension and the fact that the only other .pot candidate (gettext template) was eliminated due to magic not matching it to file contents).

Note that "looping through all magics" is slower than "looping through all magics until one matches". Dunno if it's going to be a problem.

11) Jury-rigged the testuite to pass.
12) Implemented cache-based mime/type->extensions lookup (the one wrinkle i mention in previous comment). It's not very efficient, but it will do until shared-mime-info is updated to produce a new kind of cache.
13) W32 cache support is mostly the same. The difference is that it's now useful, as there's cache-based mime/type->extensions lookup.

Comment 48 LRN 2016-05-27 12:59:45 UTC

Fixed a bug in cache-based mime/type->extension lookup (didn't work with multiple caches correctly; i haven't yet investigated *why* i have multiple caches, i suspect that it uses the same cache twice).

Comment 49 GNOME Infrastructure Team 2018-05-24 16:59:20 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/922.