GNOME Bugzilla – Bug 111848
function to canonicalize file names
Last modified: 2018-06-08 17:33:17 UTC
glibc contains canonicalize_file_name() and realpath() The trivial port to glib is attached. there may be issues of win32, etc. to consider if it was added to glib.
Created attachment 16110 [details] port of canonicalize_file_name to glib
Thoughts: - I think textual canonicalization is frequently what you want instead of following symlinks. Symlinks in Unix are more frequently used for "make strange file system setup logical" rather than for shortcuts. If I go to /web/gtk on www.gtk.org, I don't want to see that as /gimp00/web-gtk/. And in bash cd /web/gtk. cd .. gives you /web not /gimp00. Perhaps a flags field with FOLLOW_SYMLINKS? (I wrote a textual canonicalizer for GtkFileSystemUnix recently) - Windows does need to be dealt with. Some of it is just changing '/' to G_DIR_SEPARATOR, but there are other things as well: - Conditionalizing the symlink stuff - Dealing with drive letters at the start of the string - Not sure what the situation is with maximum path lengths under Windows. It probably doesn't matter since the result can't get longer without symlinks. - Adding any other "windows appropriate canonicalization"
What I'm using the symlink canonicalization for is a cache of file contents (indexed by canonical name because I don't want to deal with having two cache entries for the same file, e.g. if changing that file; though of course hard links are also a problem).
In using this, it turned out to be useful to be able to canonicalize the filename even if the basename doesn't exist. i.e. if you have /foo/bar/baz still canonicalize that even with baz not existing. Perhaps all this is a bit specific to my weird code. ;-) Of course textual canonicalization is independent of file existence.
*** Bug 152751 has been marked as a duplicate of this bug. ***
For the record, this came up in libgsf where in order to get atomic updates we save to a temporary and a rename on top of the original. That requires expansion of the final path component because if it is a symlink it might cross disks.
Hmmm, I just wrote glade_util_canonical_path() in glade3, I could easily supply the code if desired (I'd just have refit the coding style)... My implementation basicly does: g_chdir ( g_path_get_dirname (path) ); g_get_current_dir (); g_chdir ( original dir); return g_build_filename (dest_dir, basename (maybe null), NULL); I decided it was a PITA to do any kind of algorythmic regexp on the path since I'd have to handle path representations on different platforms; and since libc's implementation of realpath does the same "pushd; pwd; popd" thing; I thought it was an acceptable approach. I guess ideally this should work with unavailable paths and provide an option to perform readlink() on systems bearing symlinks...
*** Bug 322822 has been marked as a duplicate of this bug. ***
Created attachment 156296 [details] [review] Implement g_canonicalize_filename I used the above patch and updated it. I removed the internal g_realpath, I removed the buffer argument, I removed the use of PATH_MAX in favour of pathconfig, I partly removed the requirement for the path to exist and I replaced / with G_DIR_SEPARATOR. Left to do is support for resolving non-existing relative paths and handling C:\ and SERVER:\\ on Windows.
If I read the code right, this actually changes "//foo" to "/foo". Those two refer to different files. (But "/foo//bar" is "/foo/bar".)
(In reply to comment #10) > If I read the code right, this actually changes "//foo" to "/foo". > Those two refer to different files. (But "/foo//bar" is "/foo/bar".) The forward slash is no valid character in a unix filename and at best "//foo" is invalid if you don't strip the additional slash.
welinder@sequoia:~> cd /tmp welinder@sequoia:/tmp> cd //tmp welinder@sequoia://tmp> cd ///tmp welinder@sequoia:/tmp> Three or more initial slashes is the same as one. Two is special: http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_11
"A pathname that begins with two successive slashes may be interpreted in an implementation-defined manner" I'm not really sure that clarifies why //tmp shouldn't be resolved to /tmp. For what I want, dash and mksh resolve //tmp to /tmp, unlike bash like in your example.
Created attachment 157503 [details] [review] Implement g_canonicalize_filename #2 This update adds support for DOS and UNC paths, ie C:\Foo, C:Foo and \\HOST\Share. The resulting filename uses the native dir separator and \ for UNC paths. Separators except the first of an absolute path can be / or \. Additional unit tests cover that. Right now the code is system agnostic. We could #ifdef G_OS_WIN32 or leave it. Left to do: the resolve_links argument.
To be consistent with other file name related GLib API, it presumably should take the name in the "GLib file name encoding" which is UTF-8 on Windows, and in that case it is wrong to concatenate what getcwd() fills in which is in "system codepage" with a relative path in UTF-8. (And if it really would be intended to on Windows take the argument in the "system codepage", then it would be wrong to step through the file name one char at a time, as the East Asian codepages are double-byte ones, i.e. most characters take two bytes, and the trailing byte of some of these characters is a slash or backslash!) For Windows, one probably often would wants this function to convert 8.3 names to long names if they exist. The second parameter could be a flag mask that directs whether to do that, or maybe the other way, in addition to resolving symlinks. Symlinks also exist on Windows, and also something similar called junctions. Even shortcuts can be considered similar in functionality, and it might be nice to have a flag to expand shortcuts, too. Also, on Windows and other systems with case-insensitive but case-preseving file systems (at least Mac OS X I think), maybe convert the case of characters in names of existing files to that actually stored in the file system? In general, complications like the above probably mean it's not worth to try to combine the same code for both Windows and Unix here.
*** Bug 615899 has been marked as a duplicate of this bug. ***
See https://bugzilla.gnome.org/attachment.cgi?id=158853 from bug 615899.
I agree with Tor, for Win32 we can use GetFullPathNameW as in the patch of comment 17.
*** Bug 629171 has been marked as a duplicate of this bug. ***
ping for patch of comment 17?
I think we should instead of just boolean resolve_symlinks have an argument of flag bits, or which G_WHATEVER_EXPAND_SYMLINKS would be an obvious one, and I as outlined in comment #15, we could come up with more, especially for Windows.
Another use case for this is in Camel, where we use realpath() to canonicalize the file path to a local mailbox, which may involve symlinks. Portability issues with realpath() and PATH_MAX were cited recently in bug #642783.
Created attachment 183124 [details] [review] updated patch of comment #17, using flags instead of boolean
Ping?
Notice that g_file_get_path () does (almost?) the same job already.
(In reply to comment #25) > Notice that g_file_get_path () does (almost?) the same job already. It doesn't, and if you mean g_file_resolve_relative_path(), it isn't quite it, as you still need to use g_file_read_link() to read the link destination.
*** Bug 394577 has been marked as a duplicate of this bug. ***
Created attachment 371460 [details] [review] fileutils: Add g_canonicalize_filename Getting the canonical filename is a relatively common operation when dealing with symbolic links. This commit exposes GLocalFile's implementation of a filename canonicalizer function, with a few additions to make it more useful for consumers of it. Instead of always assuming g_get_current_dir(), the exposed function allows passing it as an additional parameter. This will be used to fix the GTimeZone code to retrieve the local timezone from a zoneinfo symlink.
Created attachment 371461 [details] [review] glocalfile: Use g_canonicalize_filename Drop the local function in favor of the exposed one.
Created attachment 371462 [details] [review] timezone: Correctly resolve symlink from /etc/localtime The return value of g_file_read_link ("/etc/localtime") can be a relative path in the form of "../usr/share/zoneinfo". This breaks the prefix check that is performed, and makes the timezone identifier be "../usr/share/zoneinfo/America/Sao_Paulo", for example, which breaks other parts of the system. Fix that by canonicalizing the symlink path if we detect is it a relative path.
Created attachment 371474 [details] [review] timezone: Correctly resolve symlink from /etc/localtime Added better tests, added the new function to the doc reference and renamed a variable.
Created attachment 371475 [details] [review] fileutils: Add g_canonicalize_filename I obviously meant to upload this one.
Created attachment 371476 [details] [review] glocalfile: Use g_canonicalize_filename Reuploading to maintain Bugzilla patch order.
Created attachment 371477 [details] [review] timezone: Correctly resolve symlink from /etc/localtime Reuploading to maintain Bugzilla patch order.
(In reply to Georges Basile Stavracas Neto from comment #32) > Created attachment 371475 [details] [review] [review] > fileutils: Add g_canonicalize_filename > > I obviously meant to upload this one. You probably want to respin this one without using g_auto*, since glib is intended to be portable to compilers beyond GCC/Clang.
Created attachment 371493 [details] [review] fileutils: Add g_canonicalize_filename Drop g_autofree usage.
Review of attachment 371476 [details] [review]: Looks good pending review of the other patches.
Review of attachment 371477 [details] [review]: ::: glib/gtimezone.c @@ +448,3 @@ + /* Resolve relative path */ + if (!g_path_is_absolute (resolved_identifier)) Any reason to make this conditional? What if the symlink contains a non-canonical path? (Is that possible?)
Review of attachment 371493 [details] [review]: I was going to say that I think this should have a flags argument (as with previous approaches in thus bug, for example comment #23). However, I think it would be better to handle resolving symlinks (i.e. realpath() behaviour) in a separate function: • Resolving symlinks can fail, so such a function would need error handling. If you just want to canonicalise a path without doing I/O, you’d have to deal with the error handling for no reason. • There are various options about resolving symlinks, shortcuts, and case normalisation wrt the case actually stored in the file system (see comment #15). Those deserve more thought, and make things quite a bit more complex. • Keeping functions which touch the file system, and functions which don’t, separate seems like a good property to have (for much the same reasoning as with the error handling, above). • This patch is available now, uses code which has been tested in GIO for quite a while, and doesn’t need significant changes or in-depth review. So modulo the small changes suggested below, I think this looks good to get into GLib ASAP. ::: glib/gfileutils.c @@ +2480,3 @@ +/** + * g_canonicalize_filename: + * @file_name: (type filename): the name of the file The parameter is actually called `filename`. @@ +2481,3 @@ + * g_canonicalize_filename: + * @file_name: (type filename): the name of the file + * @relative_to: (type filename)(nullable): the relative directory Nitpick: The coding style is generally to put a space between annotations (`@relative_to: (type filename) (nullable): …`) Same for the `Returns` line below. Also, change the description to ‘the relative directory, or %NULL to use the current working directory’ to explain the nullability a bit. @@ +2484,3 @@ + * + * Gets the canonical file name from @filename, using @relative_to as the + * relative path if necessary. It would be good to clarify what ‘canonical’ means. i.e. All triple slashes turned into single slashes, all `..` and `.`s resolved, returned path is guaranteed to be absolute, symlinks are not followed, no file system I/O is done, etc. @@ +2490,3 @@ + * + * If @filename is an absolute path, @relative_to is ignored. Otherwise, + * @relative_to will be used as the relative to path. ‘used as the relative to path’ doesn’t convey much information. How about ‘prepended to @file_name to make it absolute’? @@ +2493,3 @@ + * + * Returns: (type filename)(nullable)(transfer full): a newly allocated + * string with the canonical file path, or %NULL When can you end up returning `NULL`? I can only see `NULL` being returned from the g_return_val_if_fail(), which we don’t document because it’s catching a programmer error. So I don’t think this should be NULLable. In that case, you can document the function as never failing, which is a nice property to have. You should probably also clarify that it will canonicalise paths even if they’re nonexistent. @@ +2500,3 @@ +{ + gchar *canon, *start, *p, *q; + gint i; guint @@ +2508,3 @@ + if (relative_to != NULL) + { + g_return_val_if_fail (g_path_is_absolute (relative_to), NULL); I’d move this to the top of the function as: g_return_val_if_fail (relative_to == NULL || g_path_is_absolute (relative_to), NULL); Allowing it to be non-absolute if @filename is absolute is a bit too relaxed. @@ +2509,3 @@ + { + g_return_val_if_fail (g_path_is_absolute (relative_to), NULL); + cwd = g_strdup (relative_to); You could avoid the allocation here by having separate `cwd` and `cwd_allocated` variables. gchar *cwd_allocated = NULL; const gchar *cwd; if (relative_to != NULL) cwd = relative_to; else cwd = cwd_allocated = g_get_current_dir (); ::: tests/testglib.c @@ +850,3 @@ + { "/double//dash", "../../foo/bar", "/foo/bar" }, + { "/usr/share/foo", ".././././bar", "/usr/share/bar" }, + { "/foo/bar", "../bar/./.././bar", "/foo/bar" }, Would be good to have a test of the triple-slash and double-slash behaviour from comment #12. e.g. { "/etc", "///triple/slash", "/triple/slash" }, { "/etc", "//double/slash", "//double/slash" }, { "///triple/slash", ".", "/triple/slash" }, { "//double/slash", ".", "//double/slash" }, @@ +921,3 @@ g_printerr ("ok\n"); + + if (g_test_verbose()) Nitpick: Missing space before `(` (but this might be copypasta from elsewhere in the file; this test file is pretty crufty). Same below. (Don’t worry about cleaning anything else in the file up. I need to go through and port it all to GTest at some point.)
Created attachment 371560 [details] [review] fileutils: Add g_canonicalize_filename Applied review, thanks Philip.
Comment on attachment 371560 [details] [review] fileutils: Add g_canonicalize_filename The patch is missing one change.
Created attachment 371561 [details] [review] fileutils: Add g_canonicalize_filename This is the correct patch, with more tests, the documentation improvements and et cetera.
Review of attachment 371561 [details] [review]: One more comment to address to make sure we get full coverage of the documented behaviour of the function, then this is good to commit. Thanks :-) ::: tests/testglib.c @@ +856,3 @@ + { "/etc", "//double/slash", "//double/slash" }, + { "///triple/slash", ".", "/triple/slash" }, + { "//double/slash", ".", "//double/slash" }, Could do with a test where @relative_path is %NULL, to verify that the g_get_current_dir() control path is working correctly. Probably best to do as a separate test (rather than putting it in canonicalize_filename_checks) so you can compare the output against g_get_current_dir().
Review of attachment 371561 [details] [review]: I’ll fix these issues and push. ::: tests/testglib.c @@ +937,3 @@ + for (i = 0; i < n_canonicalize_filename_checks; i++) + { + g_autofree gchar *canonical_path = g_canonicalize_filename (canonicalize_filename_checks[i].relative_path, And the g_autofree has to go from here. :-(
(In reply to Philip Withnall from comment #38) > Review of attachment 371477 [details] [review] [review]: > > ::: glib/gtimezone.c > @@ +448,3 @@ > > + /* Resolve relative path */ > + if (!g_path_is_absolute (resolved_identifier)) > > Any reason to make this conditional? What if the symlink contains a > non-canonical path? (Is that possible?) We discussed this in private messages and the only reason to have the conditional in there was to avoid an allocation; I’ve removed the conditional.
All pushed to master, with some tweaks as per my previous comments. Attachment 371476 [details] pushed as 0f37af7 - glocalfile: Use g_canonicalize_filename Attachment 371477 [details] pushed as cb32614 - timezone: Correctly resolve symlink from /etc/localtime Attachment 371561 [details] pushed as b9b642d - fileutils: Add g_canonicalize_filename
And a day-late happy birthday, bug 111848! Sad to say goodbye o/
Can support for ~ be added? Most operating systems resolve it to the home directory.