GNOME Bugzilla – Bug 489862
Basic URI operations
Last modified: 2018-05-24 11:08:41 UTC
URI's are very frequently manipulated these days. It would be convenient, and prevent a lot of hacky-not-quite-right code if GLib had functions to do things like: * Check if an URI is absolute * Resolve an URI relative to a base URI * Check if an URI has a particular scheme * Parse an URI into components
Don't forget escaping.
Escaping is at least partially covered by alex' gurifuncs.h now, the rest could conveivably be added there.
> * Check if an URI has a particular scheme Works with g_uri_parse_scheme () [since glib 2.16] > * Parse an URI into components I'd like to request this as well. Nautilus-Open-Terminal currently uses GnomeVFS [1] just for decomposing an URI into host name, port, user name and path. Additional helpers named like g_uri_parse_user_info () g_uri_parse_host_name () g_uri_parse_host_port () g_uri_parse_path () g_uri_parse_query () g_uri_parse_fragment () would be nice. Is anyone interested in writing those? If not, I'd volunteer for doing so. [1] http://git.gnome.org/cgit/nautilus-open-terminal/tree/src/nautilus-open-terminal.c?id=a883bd21b62065c54e22bb5e400b2aa01306a68f#n133
(In reply to comment #3) > > * Check if an URI has a particular scheme > > Works with g_uri_parse_scheme () [since glib 2.16] If you are not planning to parse the URI into components, it would be more useful to have gboolean g_uri_has_scheme (const char *uri, const char *scheme); instead, because (a) it lets you save a malloc/free, (b) it doesn't require the caller to remember to use g_ascii_strcasecmp(), and (c) it gives us a little room to fudge around future URI syntax modifications. (Eg, http://tools.ietf.org/html/draft-wood-tae-specifying-uri-transports suggests "http++sctp://example.org/" for HTTP-over-SCTP. g_uri_has_scheme() could potentially recognize this as matching "http", while g_uri_parse_scheme() would require the app itself to gain new smarts. Of course, it's possible that this new syntax idea will be rejected.) > > * Parse an URI into components ... > Is anyone interested in writing those? I'd suggest SoupURI (and its regression test) as a starting point. It also does the "Resolve an URI relative to a base URI" part. It is a tiny bit specialized for http URIs. To make it fully generic you'd want to: 1. Possibly don't split the "userinfo" into username and password. (An older version of the RFC had username and password, but this is now deprecated, and at the generic syntax level you're not supposed to assume that it's split into those two subfields.) OTOH, doing this would mean you couldn't hide passwords when converting back to string form... 2. Remove the default port stuff and soup_uri_equal(). (Both are scheme specific, and glib isn't going to know about every scheme, and it would be confusing IMHO to have them work correctly for some schemes but not others.) 3. Remove soup_uri_set_query_from_form(), soup_uri_set_query_from_fields(), and the just_path_and_query argument to soup_uri_to_string(), which are all about doing HTTP, not about parsing URIs.
>> Works with g_uri_parse_scheme () [since glib 2.16] > If you are not planning to parse the URI into components, it would be more > useful to have > gboolean g_uri_has_scheme (const char *uri, const char *scheme); > instead, because (a) it lets you save a malloc/free, (b) it doesn't require the > caller to remember to use g_ascii_strcasecmp(), and (c) it gives us a little > room to fudge around future URI syntax modifications. Thanks for your feedback, you really seem to be into URI handling. When we discuss "future URI syntax modifications", we should discuss strictness of parsing. For instance, the current g_uri_parse_scheme() implementation just demands that the very beginning of the passed-in string is a valid scheme specifier, not that the whole string is a valid URI as such. However, if we actually demand that the passed-in URI is RFC 3986-compliant, we'd have to parse it as a whole, and not just its beginning. My actual idea was to add an internally used function G_GNUC_WARN_UNUSED_RESULT gboolean g_uri_parse (const char *uri, char **scheme, char **user_info, char **host, guint *port, char **path, char **query, char **fragment); which would parse the entire URI, optionally decompose it (of course only the passed-in valid pointers would be malloced) and have a wrapper for the _parse_foo() variants . Talking about g_uri_has_scheme (), wouldn't #define g_uri_is_valid(uri) g_uri_parse(uri, NULL, NULL, NULL, NULL, NULL, NULL, NULL)? and a the user-written code g_uri_is_valid (uri) && (strncmp (uri, "scheme", strlen(scheme)) == 0) be equivalent? However, what about GVFS URIs? Are they all RFC 3986-compliant (i.e. a syntactical subset)? I remember Alex saying that they are not really comparable to classical URIs. Best regards, Christian Neumair
(In reply to comment #5) > My actual idea was to add an internally used function > > G_GNUC_WARN_UNUSED_RESULT gboolean > g_uri_parse (const char *uri, > char **scheme, > char **user_info, > char **host, > guint *port, > char **path, > char **query, > char **fragment); Yeah, you definitely want to parse the URI fully. Not sure if the right way to do that is to to parse it into multiple variables like that, or to have a struct for the decomposed form like SoupURI (and gvfs's internal GDecodedUri). > Talking about g_uri_has_scheme (), wouldn't > > #define g_uri_is_valid(uri) g_uri_parse(uri, NULL, NULL, NULL, NULL, NULL, > NULL, NULL)? > > and a the user-written code > > g_uri_is_valid (uri) && (strncmp (uri, "scheme", strlen(scheme)) == 0) > > be equivalent? No, because you forgot to use g_ascii_strncasecmp() :), and because the scheme name might have "scheme" as a prefix but have additional letters after that (if you ask if the URI has scheme "http", you don't want "https" to match). > However, what about GVFS URIs? Are they all RFC 3986-compliant (i.e. a > syntactical subset)? I remember Alex saying that they are not really comparable > to classical URIs. IIRC, the problem is primarily with semantics, not syntax. Eg, with a "real" ftp URI, ftp://foo.com/bar.txt means "bar.txt in whatever the current directory is after you connect to the server", whereas in gvfs, it means "bar.txt in the root directory of the ftp server". (IIRC) Also, I forget how gvfs deals with character encoding. I think it assumes/requires that everything is UTF-8. But that also reminds me that SoupURI doesn't deal with "IRI"s (Internationalized URIs), and gurifuncs might want to deal with that.
GVfs does not "deal" with character encoding. It assumes uris decode/encode into raw bytes, sidestepping the character encoding (although it does some charset encoding handling in the display name attribute handling, but that is an i/o function beside the raw uri handling).
Please can we have a real GObject GURI or something so we can have a ref counted object. It also makes a tonne of sense from a coherency POW now that we have a GFile. A proper GObject also lets us add (overridable) convenience functions analogous to the g_file_() ones; in particular: GInputStream* g_uri_read (...);
No, that is so entirely a different bug. This is about parsing and reassembling URIs. This API would be used by the API you talk about, but there is no reason to make them part of the same API, any more than we want g_basename and g_build_path to be part of GFile.
(In reply to comment #6) > (In reply to comment #5) > > My actual idea was to add an internally used function > > > > G_GNUC_WARN_UNUSED_RESULT gboolean > > g_uri_parse (const char *uri, > > char **scheme, > > char **user_info, > > char **host, > > guint *port, > > char **path, > > char **query, > > char **fragment); > > Yeah, you definitely want to parse the URI fully. Not sure if the right way to > do that is to to parse it into multiple variables like that, or to have a > struct for the decomposed form like SoupURI (and gvfs's internal GDecodedUri). Dan, sorry for being unclear, my comment was meant mostly as a reaction to the above. What I was trying to root for was something similar to: GURI* g_uri_new (const gchar *uri, GError **error); const gchar* g_uri_get_scheme (GURI *uri); const gchar* g_uri_get_user (GURI *uri); ... etc ... Focussing on API coherency I don't think it makes sense to regard "parsing and reassembling URIs" as completely disjoint from what else you might want to do on a URI.
Anyone working on this? Dan, what stops glib devs from merging this branch? https://github.com/danwinship/glib/commits/guri I found one more glib based library for uri parsing (maybe not as idiomatic as Dan's version) https://github.com/toffaletti/libguri
(In reply to comment #11) > Anyone working on this? I'm not currently actively working on this, and I don't know of anyone else who is. > Dan, what stops glib devs from merging this branch? AFAIR, that branch does not actually compile. And I don't claim that the API that's currently there is in any sense "right". One thing that slowed me down is that it turns out it's actually really hard to make this fully generic. You either have to not automatically handle %-decoding for the user and then make them parse certain subfields themselves (which is lame) or else you need a zillion flags to indicate particular special parsing behaviors for different URI schemes (which is lame). Or keep track of both encoded and unencoded versions of each component so the caller can automatically get the decoded ones for "simple" fields but is still able to reparse the annoying fields themselves... or something...
I think we should use Pareto principle[1] here and make GUri usable in 80% of cases. For other 20% there should be some flags or additional method calls from developers. I belive our 80% consist of http(s)://, file://, ftp://, mailto: and maybe some other. I've rebased your branch ontop of current master and made it buildable. Also i've added simple unit tests for HTML5 parser. I think we should define this 80% of cases and define nice parser API via unit tests. As for me this 80% of cases is enough to add GUri to GLib. What do you think, guys? Let's solve this bug. [1]: http://en.wikipedia.org/wiki/Pareto_principle
Oh. And link to my branch: https://github.com/antono/glib/tree/guri2
hey, I would love to get that, along perhaps with simple ipv4/ipv6 parsing check
is someone still working on this?
Probably up to date version is here: https://github.com/chergert/mongo-glib/blob/master/cut-n-paste/guri.c But noone preparing this for merge.
Also, Christian Hergert have some ideas: https://github.com/chergert/guri
Also the one from qemu, which inherits from libxml2 and libvirt: http://git.qemu.org/?p=qemu.git;a=blob;f=util/uri.c
Someone on another bug mentioned GUri and I realized I should probably dump my work-in-progress since I seem unlikely to ever finish it... It is now rebased and pushed to wip/danw/guri (on git.gnome.org). As compared to the earlier version, this has more extensive API, to support both "I want a GUri structure" use cases and "I just want to split it into other strings" use cases (and likewise the "I want to assemble a valid URI string from these pieces" case, which seems to be pretty common, and which is not supported well by the older API, or SoupURI). It is, at least theoretically, working and ready to land (well, except that you should "git rm glib/guri-notes.txt" first). But it seemed like it wouldn't make sense to land it until someone had done test ports of some of the existing URI-using code in GNOME (eg, libsoup, gvfs, multiple places in evolution) to make sure it really is what we want, API-wise. (The libsoup and evolution uses involve public APIs, so porting them to use GUri is likely to be messy, since it actually has to map between GUri and their existing APIs. Porting gvfs ought to be a little cleaner...)
FWIW, we have something like this in GStreamer too now: http://cgit.freedesktop.org/gstreamer/gstreamer/tree/gst/gsturi.h#n191
Some of the differences between GstUri and GUri: - guri has gerrors - it seems guri implicitely normalizes, gsturi not - gsturi can "join" a reference URI onto a base URI (vs only g_uri_parse_relative) - gsturi allows to compare, copy and modify - gsturi has more path an query functons Sebastian, gst_uri_set_path() == gst_uri_set_path_string() I'd consider adding more functions to copy and modify GUri. I'd leave compare out. In GstUri, I doubt the path and query manipulation functions are so useful, for example GHashTable API is enough for query.
more differences: - it seems gsturi implicitely unescapes, guri not - guri has more unescape functions (string, segment, bytes), and gsturi only relies on g_uri_unescape
(In reply to Marc-Andre Lureau from comment #22) > - gsturi can "join" a reference URI onto a base URI (vs only > g_uri_parse_relative) A GUri always represents an absolute URI, so as it is now you couldn't have a version of g_uri_parse_relative() that took two GUris rather than a GUri and a string. Does that API actually get used in GstUri? > - gsturi allows to compare, copy and modify Comparing can't happen in a generic URI API, because comparison rules are scheme-specific. (default ports, default path, default parameters, case sensitivity, etc) SoupURI is modifiable, although generally URIs only ever get modified as part of initially building them. Eg: port = soup_server_get_port (test_server); test_uri = soup_uri_new ("http://localhost"); soup_uri_set_port (test_uri, port); But even though no one ever *actually* modifies URIs after building them, we still end up having to make copies all the time, just in case someone did modify one. So it has always seemed to me that having immutable refcounted URIs would be better, memory-management-wise, as long as you also had enough good URI-building functions that you didn't need to build them in multiple steps. Eg: port = soup_server_get_port (test_server); test_uri = g_uri_build (G_URI_FLAGS_NONE, "http", NULL, "localhost", port, NULL, NULL, NULL); Maybe not actually an improvement... I haven't tried writing much code with GUri, so maybe it would turn out that this idea was wrong. (In reply to Marc-Andre Lureau from comment #23) > - it seems gsturi implicitely unescapes, guri not It depends on whether you pass G_URI_ENCODED in the flags. There are situations where unescaping will change the meaning of the URI, so it has to be avoided > - guri has more unescape functions (string, segment, bytes), and gsturi only > relies on g_uri_unescape The string and segment functions already exist in glib and were just moved to guri.h from gurifuncs.h. The bytes function is to address bug 620417.
Just a random comment: if I remember correctly in GNet it was always a bit painful to deal with URIs because while it provided functions to escape/unescape them, it was never clear what the 'current state' was, one would have to track that externally, and also it was/is not always clear what one may get as input from certain places (even if it should be of course).
I was about to make a similar comment as Tim: Dan, in the documentation, could you describe what GUri does implicitely wrt "normalize" and "unescape"? (normalize in gsturi also deals with path resolution for ex) It's a bit unfortunate if GUri and GstUri end up with different implicit rules, it's already confusing enough :)
(In reply to Marc-Andre Lureau from comment #26) > It's a bit unfortunate if GUri and GstUri end up with different implicit > rules, it's already confusing enough :) See my comment here :) https://bugzilla.gnome.org/show_bug.cgi?id=725221#c28 We didn't have GstUri in a public release yet, so can still change it in any way.
(In reply to Dan Winship from comment #24) > (In reply to Marc-Andre Lureau from comment #22) > > - gsturi allows to compare, copy and modify > > Comparing can't happen in a generic URI API, because comparison rules are > scheme-specific. (default ports, default path, default parameters, case > sensitivity, etc) Perhaps have a basic 1-1 compare function with extra flags? This could be considered as a seperate later bug imho. > port = soup_server_get_port (test_server); > test_uri = g_uri_build (G_URI_FLAGS_NONE, > "http", NULL, "localhost", port, > NULL, NULL, NULL); > > Maybe not actually an improvement... I haven't tried writing much code with > GUri, so maybe it would turn out that this idea was wrong. > I agree with the rationale for immutable, however I would consider a function to build from an exisiting URI, similar to gst_uri_new_with_base(uri, scheme, usering, host, port...). Then, it would probably be worth adding a function to build back a query string from a HashTable.
"normalization" in guri just means unescaping characters where it's guaranteed that the escaping is unnecessary. eg, "%41" can always be replaced with "A", regardless of the scheme. (But "%2F" can't always be replaced with "/", because that might change the meaning in some cases.) Not sure what you mean about path resolution. g_uri_parse_relative() / g_uri_resolve_relative() do the relative path handling stuff, but nothing else ever modifies path. I think the docs are pretty clear about when strings are and aren't %-encoded... Eg: * If @flags contains %G_URI_ENCODED, then `%`-encoded characters in * @uri_string will remain encoded in the output strings. (If not, * then all such characters will be decoded.)
(In reply to Marc-Andre Lureau from comment #28) > I agree with the rationale for immutable, however I would consider a > function to build from an exisiting URI, similar to > gst_uri_new_with_base(uri, scheme, usering, host, port...). That seems entirely plausible > Then, it would probably be worth adding a function to build back a query > string from a HashTable. Yes. Although in libsoup I ended up adding a GData**->query-string function too (https://developer.gnome.org/libsoup/stable/libsoup-2.4-HTML-Form-Support.html#soup-form-encode-datalist) because some web APIs care about the order the parameters get serialized in.
(In reply to Dan Winship from comment #29) > "normalization" in guri just means unescaping characters where it's > guaranteed that the escaping is unnecessary. eg, "%41" can always be > replaced with "A", regardless of the scheme. (But "%2F" can't always be > replaced with "/", because that might change the meaning in some cases.) > > Not sure what you mean about path resolution. g_uri_parse_relative() / > g_uri_resolve_relative() do the relative path handling stuff, but nothing > else ever modifies path. Ok, why not "normalize paths" too (implicitely or not), that is remove unnecessary "." and ".." ? > I think the docs are pretty clear about when strings are and aren't > %-encoded... Eg: > > * If @flags contains %G_URI_ENCODED, then `%`-encoded characters in > * @uri_string will remain encoded in the output strings. (If not, > * then all such characters will be decoded.) Sorry, I was greping for "unescape".. that's clear enough. thanks
(In reply to Tim-Philipp Müller from comment #25) > Just a random comment: if I remember correctly in GNet it was always a bit > painful to deal with URIs because while it provided functions to > escape/unescape them, it was never clear what the 'current state' was, one > would have to track that externally, and also it was/is not always clear > what one may get as input from certain places (even if it should be of > course). It seems GUri could also use a g_uri_get_flags() to check encoding status, so you can have preconditions on !G_URI_ENCODED for ex.
(In reply to Marc-Andre Lureau from comment #31) > Ok, why not "normalize paths" too (implicitely or not), that is remove > unnecessary "." and ".." ? RFC 3986 only says that this should be done as part of the process of resolving a relative URI against a base URI. Maybe it's implied that you can/should do this when parsing as well? What do other URL libraries do?
(In reply to Dan Winship from comment #33) > (In reply to Marc-Andre Lureau from comment #31) > > Ok, why not "normalize paths" too (implicitely or not), that is remove > > unnecessary "." and ".." ? > > RFC 3986 only says that this should be done as part of the process of > resolving a relative URI against a base URI. Maybe it's implied that you > can/should do this when parsing as well? What do other URL libraries do? using repl.it, I checked: node url.parse: keep path python urlparse: keep path go net/url: keep path java net URL: keep path ruby uri: keep path Btw, I just found https://url.spec.whatwg.org/ which seems to be a more recent attempt to standardize URL. Would be worth checking how it aligns with this API
The WHATWG spec is specifically about URLs in a web context (and is referenced from the HTML5 spec, IIRC). I had thought about having a GUriFlags value to specify using that spec rather than RFC 3986, but never implemented it. There has also been talk about revising/updating 3986 in the IETF, but I'm not sure if that actually started yet.
valgrind complains about: /uri/parsing/relative: ==4258== Invalid read of size 1 ==4258== at 0x4F4B293: remove_dot_segments (guri.c:1041) Trivial fix with: +++ b/glib/guri.c @@ -1037,6 +1037,9 @@ remove_dot_segments (gchar *path) { gchar *p, *q; + if (!*path) + return; +
*** Bug 550110 has been marked as a duplicate of this bug. ***
Created attachment 300470 [details] [review] guri: new URI parsing and generating functions Add a set of new URI parsing and generating functions, including a new parsed-URI type GUri. Move all the code from gurifuncs.c into guri.c, reimplementing some of those functions (and g_string_append_uri_encoded()) in terms of the new code.
I just attached an updated version of Dan GUri for easy review, I fixed a few things: - added tests, coverage at 98% - fixed bug mentionned above - fixed some misc bugs found during testing - small leak in tests - added preconditions - added g_uri_get_flags() - added autoptr and boxed type - renamed GUriFlags G_URI_FLAGS_.. - added _NONE for 0 flags - updated to 2.46 macros I have a wip patch for spice-gtk and I planning to look at gvfs during the weekend.
Created attachment 300472 [details] [review] guri: new URI parsing and generating functions Add a set of new URI parsing and generating functions, including a new parsed-URI type GUri. Move all the code from gurifuncs.c into guri.c, reimplementing some of those functions (and g_string_append_uri_encoded()) in terms of the new code.
Created attachment 300473 [details] [review] guri: new URI parsing and generating functions Add a set of new URI parsing and generating functions, including a new parsed-URI type GUri. Move all the code from gurifuncs.c into guri.c, reimplementing some of those functions (and g_string_append_uri_encoded()) in terms of the new code.
Created attachment 300605 [details] [review] guri: new URI parsing and generating functions Add a set of new URI parsing and generating functions, including a new parsed-URI type GUri. Move all the code from gurifuncs.c into guri.c, reimplementing some of those functions (and g_string_append_uri_encoded()) in terms of the new code.
ping just wanted to give interesting figures from rust cargo (https://crates.io/crates?sort=downloads): ~ 500k downloads of libc (#1) ~ 123k downloads of url (#15) It certainly says something about how useful URL parsing is to devs.
ping? Can we imagine landing GUri next cycle? or what is left?
(In reply to Marc-Andre Lureau from comment #44) > ping? Can we imagine landing GUri next cycle? or what is left? As far as I know, no one has tried porting much existing code to use this API, so we don't really have any idea if it's well-designed for the various use cases or not.
Is this too late for the 3.21 cycle now?
One thing that'd make more more confident in this is copying it into libsoup, and rebasing libsoup's URI parsing on it. Does that make sense?
FWIW, wip/danw/guri in libsoup replaces libsoup's URI-parsing/stringifying code with calls to GUri instead, but I'm not sure that should really make you more confident, since I obviously had libsoup in mind when I wrote this code :). A better test would be whether it can replace *other people's* URI-handling code.
Some patches for GVfs are already proposed, see Bug 746993.
Review of attachment 300605 [details] [review]: Some minor comments: - we need to update the Since tags - we need to remove guri-notes See also the minor comments inline ::: glib/guri.c @@ +310,3 @@ + else + g_free (decoded); + return d - (guchar *)decoded; you free decoded and then you use it? seems bad @@ +795,3 @@ + return TRUE; + + fail: if you use g_clear_pointer you don't need the ifs @@ +1246,3 @@ + + fail: + if (uri) just do g_clear_pointer (&uri, g_uri_unref) ? @@ +1737,3 @@ + hide_fragment ? NULL : uri->fragment); + } + else if you are returning inside the if you don't really need an else block
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/110.