GNOME Bugzilla – Bug 89548
Include a UTF-8 safe escaping function
Last modified: 2018-05-23 23:12:23 UTC
g_strescape() is useless since it mangles UTF-8. We should include a less aggressive escaping function. Something like pango/pango/querymodules.c:escape_string().
Created attachment 156276 [details] [review] Implement g_utf8_strescape The implementation is based on code in Pango, but covers the same characters as g_strescape.
(In reply to comment #1) > Created an attachment (id=156276) [details] [review] > Implement g_utf8_strescape > > The implementation is based on code in Pango, but covers the same characters as > g_strescape. You patch fixed bug #692746 and works for my friend project. Very need to add in master branch.
Created attachment 244610 [details] [review] Update patch for one of component
(In reply to comment #1) > Created an attachment (id=156276) [details] [review] > Implement g_utf8_strescape > > The implementation is based on code in Pango, but covers the same characters as > g_strescape. >diff --git a/docs/reference/glib/tmpl/string_utils.sgml b/docs/reference/glib/tmpl/string_utils.sgml >index faaec7d..76cac8a 100644 >--- a/docs/reference/glib/tmpl/string_utils.sgml >+++ b/docs/reference/glib/tmpl/string_utils.sgml >@@ -775,6 +775,7 @@ them. Additionally all characters in the range 0x01-0x1F (everything > below SPACE) and in the range 0x7F-0xFF (all non-ASCII chars) are > replaced with a '\' followed by their octal representation. Characters > supplied in @exceptions are not escaped. >+See g_utf8_strescape() for a function that doesn't mangle UTF-8 characters. > </para> > > <para> Need update. >diff --git a/glib/glib.symbols b/glib/glib.symbols >index ee9da31..b959049 100644 >--- a/glib/glib.symbols >+++ b/glib/glib.symbols >@@ -1506,6 +1506,7 @@ g_utf8_offset_to_pointer > g_utf8_pointer_to_offset > g_utf8_prev_char > g_utf8_strchr >+g_utf8_strescape > g_utf8_strlen > g_utf8_strncpy > g_utf8_strrchr Need update. File not found. >diff --git a/glib/gunicode.h b/glib/gunicode.h >index 78b259e..1e5bb3c 100644 >--- a/glib/gunicode.h >+++ b/glib/gunicode.h >@@ -297,6 +297,10 @@ gchar* g_utf8_strncpy (gchar *dest, > const gchar *src, > gsize n); > >+gchar * >+g_utf8_strescape (const gchar *source, >+ const gchar *exceptions); >+ > /* Find the UTF-8 character corresponding to ch, in string p. These > functions are equivalants to strchr and strrchr */ > gchar* g_utf8_strchr (const gchar *p, Need update
Created attachment 244612 [details] [review] Implement g_utf8_strescape. Not fully. Need help from developer. e.g. Christian Dywan
Created attachment 244615 [details] [review] Implement g_utf8_strescape. Updated. Not fully. Need help from developer. e.g. Christian Dywan
Created attachment 244623 [details] [review] Implement g_utf8_strescape Review please.
Created attachment 244625 [details] [review] Implement g_utf8_strescape (update since: 2.38) The implementation is based on code in Pango, but covers the same characters as g_strescape.
Created attachment 244655 [details] [review] Implement g_utf8_strescape (add reference docs) The implementation is based on code in Pango, but covers the same characters as g_strescape.
Created attachment 244700 [details] [review] Implement g_utf8_strescape. Patch for review. * Added in docs/reference/glib/glib-sections.txt info * Added in glib/gstrfuncs.c note for new function * Added in glib/gunicode.h initialization function * Added in glib/gutf8.c body of function The implementation is based on code in Pango, but covers the same characters as g_strescape.
Created attachment 244890 [details] suggested function code Not sure what the exact goal is. Input is UTF-8 and output is UTF-8 with the usual (as in g_strescape) characters escaped, except for exceptions? Is g_strcompress expected to invert the compression? The attached is closely based on g_strescape but with obvious porting to UTF-8, and actually respects the exceptions argument. Not tested, of course :)
(In reply to comment #11) > Created an attachment (id=244890) [details] > suggested function code > > Not sure what the exact goal is. Input is UTF-8 and output is UTF-8 with the > usual (as in g_strescape) characters escaped, except for exceptions? Is > g_strcompress expected to invert the compression? > > The attached is closely based on g_strescape but with obvious porting to UTF-8, > and actually respects the exceptions argument. > > Not tested, of course :) > case '"': > g_string_append (result, "\\\""); > break; Why do you write?
Created attachment 244895 [details] Test compile Peter patch in small program Peter, I tested you patch in small program. 1. You forgot to declare *p From original g_strescape: const guchar *p; 2. Sorry. In the previous post I didn't see.
Created attachment 244910 [details] [review] Implement g_utf8_strescape. version 2. Implement g_utf8_strescape. The implementation is based on code in Pango, but covers the same characters as g_strescape. Function g_utf8_strescape is based on patch of Peter Bloomfield with some changes (for works). Patch for review. * Added in docs/reference/glib/glib-sections.txt info * Added in glib/gstrfuncs.c note for new function * Added in glib/gunicode.h initialization function * Added in glib/gutf8.c body of function
Hi Igor, On reflection, I believe the "while" loop should be something like: while (*p) { gunichar c = g_utf8_get_char (p); ... p = g_utf8_next_char (p); } because g_utf8_get_char may never return '\0'. Because the source is valid UTF-8, p will ultimately point to the trailing null byte.
(In reply to comment #15) > Hi Igor, > > On reflection, I believe the "while" loop should be something like: > > while (*p) > { > gunichar c = g_utf8_get_char (p); > ... > p = g_utf8_next_char (p); > } > > because g_utf8_get_char may never return '\0'. Because the source is valid > UTF-8, p will ultimately point to the trailing null byte. Peter, see my latest attachment please.
Created attachment 244931 [details] [review] Implement g_utf8_strescape. version 3. Implement g_utf8_strescape. The implementation is based on code in Pango, but covers the same characters as g_strescape. Function g_utf8_strescape is based on patch of Peter Bloomfield with some changes (for works). Patch for review. * Added in docs/reference/glib/glib-sections.txt info * Added in glib/gstrfuncs.c note for new function * Added in glib/gunicode.h initialization function * Added in glib/gutf8.c body of function Changes: v1: * Based on patch of Christian Dywan with some changes (update to actually files and directories) v2: * Function code based on patch of Peter Bloomfield with some changes: const gchar *p; p = (gchar *) utf8_source; v3: * Cycle in function code updated while (*p) { c = g_utf8_get_char (p); ... p = g_utf8_next_char (p); }
Created attachment 244978 [details] suggested function code including octal escapes for c < ' ' So the next question is whether to octal-escape the other control characters (c < ' '), as in g_strescape. As I recall, it prevented some brain-dead printers from self-destructing, but whether it's still needed anywhere is unclear. Anyway, here's a version that includes it (with some other minor simplifications).
(In reply to comment #18) > Created an attachment (id=244978) [details] > suggested function code including octal escapes for c < ' ' > > So the next question is whether to octal-escape the other control characters (c > < ' '), as in g_strescape. As I recall, it prevented some brain-dead printers > from self-destructing, but whether it's still needed anywhere is unclear. > Anyway, here's a version that includes it (with some other minor > simplifications). Peter, new function ignores \n.
(In reply to comment #18) > Created an attachment (id=244978) [details] > suggested function code including octal escapes for c < ' ' > > So the next question is whether to octal-escape the other control characters (c > < ' '), as in g_strescape. As I recall, it prevented some brain-dead printers > from self-destructing, but whether it's still needed anywhere is unclear. > Anyway, here's a version that includes it (with some other minor > simplifications). Peter, I'm sorry. I tested in my programm exception "\n" and didn't see it.
Created attachment 244982 [details] [review] Implement g_utf8_strescape. version 4. Implement g_utf8_strescape. Patch for review. * Added in docs/reference/glib/glib-sections.txt info * Added in glib/gstrfuncs.c note for new function * Added in glib/gunicode.h initialization function * Added in glib/gutf8.c body of function Changes: v1: * Based on patch of Christian Dywan with some changes (update to actually files and directories) v2: * Function code based on patch of Peter Bloomfield with some changes: const gchar *p; p = (gchar *) utf8_source; v3: * Cycle in function code updated while (*p) { c = g_utf8_get_char (p); ... p = g_utf8_next_char (p); } v4: * Updated function of Peter Bloomfield (included octal escapes for c < ' ')
If we're escaping unicode strings and producing unicode strings then we should do it in exactly the way that GVariant does it for its text format (and then I could reuse this function there). Specifically, see this fragment: while (*str) { gunichar c = g_utf8_get_char (str); if (c == quote || c == '\\') g_string_append_c (string, '\\'); if (g_unichar_isprint (c)) g_string_append_unichar (string, c); else { g_string_append_c (string, '\\'); if (c < 0x10000) switch (c) { case '\a': g_string_append_c (string, 'a'); break; case '\b': g_string_append_c (string, 'b'); break; case '\f': g_string_append_c (string, 'f'); break; case '\n': g_string_append_c (string, 'n'); break; case '\r': g_string_append_c (string, 'r'); break; case '\t': g_string_append_c (string, 't'); break; case '\v': g_string_append_c (string, 'v'); break; default: g_string_append_printf (string, "u%04x", c); break; } else g_string_append_printf (string, "U%08x", c); } str = g_utf8_next_char (str); } ie: we use \unnnn and \Unnnnnnnn style escapes instead of octals and we do it for all non-printable characters (not just the ones in the traditional ascii control range).
(In reply to comment #22) > If we're escaping unicode strings and producing unicode strings then we should > do it in exactly the way that GVariant does it for its text format (and then I > could reuse this function there). > > Specifically, see this fragment: > if (c == quote || c == '\\') I don't understand that such quote...
Sorry. 'quote' is defined from a bit further up to be either «"» or «'», depending on what type of quote is being used by the printer (to avoid putting this type of quote inside of the string, while still allowing the other).
Owen: what was your original need for this function?
Conversation from irc: desrt: ignatenkobrain: in essence, you're creating a function that is trying to escape unicode strings desrt: but you're doing it in a very ascii-centric way ignatenkobrain: desrt: can I initializate quote as " ? desrt: ignatenkobrain: i think it might be helpful to allow the user to provide you with a string of extra characters to escape or not to escape desrt: sort of like how g_uri_escape_string() takes a 'reserved_chars_allowed' argument desrt: because if i plan to use "quotes" then it would be perfectly okay to have ' unescaped desrt: but if i plan to use 'quotes' then ' needs to be escaped but " can be left alone desrt: iirc GVariant does a fairly simple thing here: if the contains a «'» then it uses «"», otherwise it defaults to «'» I'm not developer and I need help.
(In reply to comment #24) > Sorry. 'quote' is defined from a bit further up to be either «"» or «'», > depending on what type of quote is being used by the printer (to avoid putting > this type of quote inside of the string, while still allowing the other). You mean to use similar construction? But I don't understand, what functions need to be involved. gchar * g_utf8_strescape (const gchar *source, const gchar *exceptions const gchar *quotes) { ... quote = g_variant_new_string (quotes); g_variant_get_variant (quote); ... }
(In reply to comment #25) > Owen: what was your original need for this function? 2002-07-31 16:52:42 UTC was almost 11 years ago! It coudl be that I noticed that g_strescape() was broken when I tried to use it in Pango (in the place referenced above), and then filed this bug.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/4.