Bug 89548 – Include a UTF-8 safe escaping function

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 89548 - Include a UTF-8 safe escaping function


Summary:	Include a UTF-8 safe escaping function


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.0.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:	691085

Reported:	2002-07-31 16:52 UTC by Owen Taylor
Modified:	2018-05-23 23:12 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
Implement g_utf8_strescape (3.05 KB, patch) 2010-03-16 16:01 UTC, Christian Dywan	none	Details \| Review
Update patch for one of component (1.71 KB, patch) 2013-05-18 12:37 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape. Not fully. (2.25 KB, patch) 2013-05-18 13:03 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape. Updated. Not fully. (2.72 KB, patch) 2013-05-18 13:50 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape (2.74 KB, patch) 2013-05-18 15:22 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape (update since: 2.38) (2.74 KB, patch) 2013-05-18 15:45 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape (add reference docs) (3.11 KB, patch) 2013-05-18 19:52 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape. (3.11 KB, patch) 2013-05-19 10:44 UTC, Igor Gnatenko	none	Details \| Review
suggested function code (1.29 KB, text/plain) 2013-05-21 01:00 UTC, Peter Bloomfield		Details
Test compile Peter patch in small program (1.61 KB, text/plain) 2013-05-21 04:37 UTC, Igor Gnatenko		Details
Implement g_utf8_strescape. version 2. (3.55 KB, patch) 2013-05-21 12:16 UTC, Igor Gnatenko	none	Details \| Review
Implement g_utf8_strescape. version 3. (3.55 KB, patch) 2013-05-21 14:56 UTC, Igor Gnatenko	none	Details \| Review
suggested function code including octal escapes for c < ' ' (1.42 KB, text/plain) 2013-05-21 17:47 UTC, Peter Bloomfield		Details
Implement g_utf8_strescape. version 4. (3.69 KB, patch) 2013-05-21 18:37 UTC, Igor Gnatenko	none	Details \| Review

Description Owen Taylor 2002-07-31 16:52:42 UTC

g_strescape() is useless since it mangles UTF-8. We should
include a less aggressive escaping function. Something like
pango/pango/querymodules.c:escape_string().

Comment 1 Christian Dywan 2010-03-16 16:01:29 UTC

Created attachment 156276 [details] [review]
Implement g_utf8_strescape

The implementation is based on code in Pango, but covers the same characters as g_strescape.

Comment 2 Igor Gnatenko 2013-05-18 11:56:09 UTC

(In reply to comment #1)
> Created an attachment (id=156276) [details] [review]
> Implement g_utf8_strescape
> 
> The implementation is based on code in Pango, but covers the same characters as
> g_strescape.
You patch fixed bug #692746 and works for my friend project. Very need to add in master branch.

Comment 3 Igor Gnatenko 2013-05-18 12:37:53 UTC

Created attachment 244610 [details] [review]
Update patch for one of component

Comment 4 Igor Gnatenko 2013-05-18 12:45:45 UTC

(In reply to comment #1)
> Created an attachment (id=156276) [details] [review]
> Implement g_utf8_strescape
> 
> The implementation is based on code in Pango, but covers the same characters as
> g_strescape.
>diff --git a/docs/reference/glib/tmpl/string_utils.sgml b/docs/reference/glib/tmpl/string_utils.sgml
>index faaec7d..76cac8a 100644
>--- a/docs/reference/glib/tmpl/string_utils.sgml
>+++ b/docs/reference/glib/tmpl/string_utils.sgml
>@@ -775,6 +775,7 @@ them. Additionally all characters in the range 0x01-0x1F (everything
> below SPACE) and in the range 0x7F-0xFF (all non-ASCII chars) are
> replaced with a '\' followed by their octal representation. Characters
> supplied in @exceptions are not escaped.
>+See g_utf8_strescape() for a function that doesn't mangle UTF-8 characters.
> </para>
> 
> <para>
Need update.
>diff --git a/glib/glib.symbols b/glib/glib.symbols
>index ee9da31..b959049 100644
>--- a/glib/glib.symbols
>+++ b/glib/glib.symbols
>@@ -1506,6 +1506,7 @@ g_utf8_offset_to_pointer
> g_utf8_pointer_to_offset
> g_utf8_prev_char
> g_utf8_strchr
>+g_utf8_strescape
> g_utf8_strlen
> g_utf8_strncpy
> g_utf8_strrchr
Need update. File not found.
>diff --git a/glib/gunicode.h b/glib/gunicode.h
>index 78b259e..1e5bb3c 100644
>--- a/glib/gunicode.h
>+++ b/glib/gunicode.h
>@@ -297,6 +297,10 @@ gchar* g_utf8_strncpy (gchar       *dest,
> 		       const gchar *src,
> 		       gsize        n);
> 
>+gchar *
>+g_utf8_strescape      (const gchar *source,
>+                       const gchar *exceptions);
>+
> /* Find the UTF-8 character corresponding to ch, in string p. These
>    functions are equivalants to strchr and strrchr */
> gchar* g_utf8_strchr  (const gchar *p,
Need update

Comment 5 Igor Gnatenko 2013-05-18 13:03:40 UTC

Created attachment 244612 [details] [review]
Implement g_utf8_strescape. Not fully.

Need help from developer. e.g. Christian Dywan

Comment 6 Igor Gnatenko 2013-05-18 13:50:46 UTC

Created attachment 244615 [details] [review]
Implement g_utf8_strescape. Updated. Not fully.

Need help from developer. e.g. Christian Dywan

Comment 7 Igor Gnatenko 2013-05-18 15:22:35 UTC

Created attachment 244623 [details] [review]
Implement g_utf8_strescape

Review please.

Comment 8 Igor Gnatenko 2013-05-18 15:45:39 UTC

Created attachment 244625 [details] [review]
Implement g_utf8_strescape (update since: 2.38)

The implementation is based on code in Pango, but covers the same characters as
g_strescape.

Comment 9 Igor Gnatenko 2013-05-18 19:52:05 UTC

Created attachment 244655 [details] [review]
Implement g_utf8_strescape (add reference docs)

The implementation is based on code in Pango, but covers the same characters as
g_strescape.

Comment 10 Igor Gnatenko 2013-05-19 10:44:29 UTC

Created attachment 244700 [details] [review]
Implement g_utf8_strescape.

Patch for review.

* Added in docs/reference/glib/glib-sections.txt info
* Added in glib/gstrfuncs.c note for new function
* Added in glib/gunicode.h initialization function
* Added in glib/gutf8.c body of function

The implementation is based on code in Pango, but covers the same characters as
g_strescape.

Comment 11 Peter Bloomfield 2013-05-21 01:00:26 UTC

Created attachment 244890 [details]
suggested function code

Not sure what the exact goal is.  Input is UTF-8 and output is UTF-8 with the usual (as in g_strescape) characters escaped, except for exceptions?  Is g_strcompress expected to invert the compression?

The attached is closely based on g_strescape but with obvious porting to UTF-8, and actually respects the exceptions argument.

Not tested, of course :)

Comment 12 Igor Gnatenko 2013-05-21 04:20:16 UTC

(In reply to comment #11)
> Created an attachment (id=244890) [details]
> suggested function code
> 
> Not sure what the exact goal is.  Input is UTF-8 and output is UTF-8 with the
> usual (as in g_strescape) characters escaped, except for exceptions?  Is
> g_strcompress expected to invert the compression?
> 
> The attached is closely based on g_strescape but with obvious porting to UTF-8,
> and actually respects the exceptions argument.
> 
> Not tested, of course :)

>            case '"':
>              g_string_append (result, "\\\"");
>              break;

Why do you write?

Comment 13 Igor Gnatenko 2013-05-21 04:37:51 UTC

Created attachment 244895 [details]
Test compile Peter patch in small program

Peter, I tested you patch in small program.
1. You forgot to declare *p
From original g_strescape:
const guchar *p; 
2. Sorry. In the previous post I didn't see.

Comment 14 Igor Gnatenko 2013-05-21 12:16:50 UTC

Created attachment 244910 [details] [review]
Implement g_utf8_strescape. version 2.

Implement g_utf8_strescape.

The implementation is based on code in Pango, but covers the same characters as
g_strescape.
Function g_utf8_strescape is based on patch of Peter Bloomfield with some changes (for works).

Patch for review.

* Added in docs/reference/glib/glib-sections.txt info
* Added in glib/gstrfuncs.c note for new function
* Added in glib/gunicode.h initialization function
* Added in glib/gutf8.c body of function

Comment 15 Peter Bloomfield 2013-05-21 13:06:26 UTC

Hi Igor,

On reflection, I believe the "while" loop should be something like:

  while (*p)
    {
      gunichar c = g_utf8_get_char (p);
...
      p = g_utf8_next_char (p);
    }

because g_utf8_get_char may never return '\0'.  Because the source is valid UTF-8, p will ultimately point to the trailing null byte.

Comment 16 Igor Gnatenko 2013-05-21 13:11:56 UTC

(In reply to comment #15)
> Hi Igor,
> 
> On reflection, I believe the "while" loop should be something like:
> 
>   while (*p)
>     {
>       gunichar c = g_utf8_get_char (p);
> ...
>       p = g_utf8_next_char (p);
>     }
> 
> because g_utf8_get_char may never return '\0'.  Because the source is valid
> UTF-8, p will ultimately point to the trailing null byte.
Peter, see my latest attachment please.

Comment 17 Igor Gnatenko 2013-05-21 14:56:34 UTC

Created attachment 244931 [details] [review]
Implement g_utf8_strescape. version 3.

Implement g_utf8_strescape.

The implementation is based on code in Pango, but covers the same characters as
g_strescape.
Function g_utf8_strescape is based on patch of Peter Bloomfield with some
changes (for works).

Patch for review.

* Added in docs/reference/glib/glib-sections.txt info
* Added in glib/gstrfuncs.c note for new function
* Added in glib/gunicode.h initialization function
* Added in glib/gutf8.c body of function

Changes:

v1:
* Based on patch of Christian Dywan with some changes (update to actually files and directories)

v2:
* Function code based on patch of Peter Bloomfield with some changes:
const gchar *p;
p = (gchar *) utf8_source;

v3:
* Cycle in function code updated
while (*p)
  {
    c = g_utf8_get_char (p);
...
    p = g_utf8_next_char (p);
  }

Comment 18 Peter Bloomfield 2013-05-21 17:47:41 UTC

Created attachment 244978 [details]
suggested function code including octal escapes for c < ' '

So the next question is whether to octal-escape the other control characters (c < ' '), as in g_strescape.  As I recall, it prevented some brain-dead printers from self-destructing, but whether it's still needed anywhere is unclear.  Anyway, here's a version that includes it (with some other minor simplifications).

Comment 19 Igor Gnatenko 2013-05-21 18:00:51 UTC

(In reply to comment #18)
> Created an attachment (id=244978) [details]
> suggested function code including octal escapes for c < ' '
> 
> So the next question is whether to octal-escape the other control characters (c
> < ' '), as in g_strescape.  As I recall, it prevented some brain-dead printers
> from self-destructing, but whether it's still needed anywhere is unclear. 
> Anyway, here's a version that includes it (with some other minor
> simplifications).

Peter, new function ignores \n.

Comment 20 Igor Gnatenko 2013-05-21 18:25:29 UTC

(In reply to comment #18)
> Created an attachment (id=244978) [details]
> suggested function code including octal escapes for c < ' '
> 
> So the next question is whether to octal-escape the other control characters (c
> < ' '), as in g_strescape.  As I recall, it prevented some brain-dead printers
> from self-destructing, but whether it's still needed anywhere is unclear. 
> Anyway, here's a version that includes it (with some other minor
> simplifications).

Peter, I'm sorry. I tested in my programm exception "\n" and didn't see it.

Comment 21 Igor Gnatenko 2013-05-21 18:37:14 UTC

Created attachment 244982 [details] [review]
Implement g_utf8_strescape. version 4.

Implement g_utf8_strescape.

Patch for review.

* Added in docs/reference/glib/glib-sections.txt info
* Added in glib/gstrfuncs.c note for new function
* Added in glib/gunicode.h initialization function
* Added in glib/gutf8.c body of function

Changes:

v1:
* Based on patch of Christian Dywan with some changes (update to actually files
and directories)

v2:
* Function code based on patch of Peter Bloomfield with some changes:
const gchar *p;
p = (gchar *) utf8_source;

v3:
* Cycle in function code updated
while (*p)
  {
    c = g_utf8_get_char (p);
...
    p = g_utf8_next_char (p);
  }

v4:
* Updated function of Peter Bloomfield (included octal escapes for c < ' ')

Comment 22 Allison Karlitskaya (desrt) 2013-05-22 13:33:50 UTC

If we're escaping unicode strings and producing unicode strings then we should do it in exactly the way that GVariant does it for its text format (and then I could reuse this function there).

Specifically, see this fragment:

        while (*str)
          {
            gunichar c = g_utf8_get_char (str);

            if (c == quote || c == '\\')
              g_string_append_c (string, '\\');

            if (g_unichar_isprint (c))
              g_string_append_unichar (string, c);

            else
              {
                g_string_append_c (string, '\\');
                if (c < 0x10000)
                  switch (c)
                    {
                    case '\a':
                      g_string_append_c (string, 'a');
                      break;

                    case '\b':
                      g_string_append_c (string, 'b');
                      break;

                    case '\f':
                      g_string_append_c (string, 'f');
                      break;

                    case '\n':
                      g_string_append_c (string, 'n');
                      break;

                    case '\r':
                      g_string_append_c (string, 'r');
                      break;

                    case '\t':
                      g_string_append_c (string, 't');
                      break;

                    case '\v':
                      g_string_append_c (string, 'v');
                      break;

                    default:
                      g_string_append_printf (string, "u%04x", c);
                      break;
                    }
                 else
                   g_string_append_printf (string, "U%08x", c);
              }

            str = g_utf8_next_char (str);
          }


ie: we use \unnnn and \Unnnnnnnn style escapes instead of octals and we do it for all non-printable characters (not just the ones in the traditional ascii control range).

Comment 23 Igor Gnatenko 2013-05-22 14:19:22 UTC

(In reply to comment #22)
> If we're escaping unicode strings and producing unicode strings then we should
> do it in exactly the way that GVariant does it for its text format (and then I
> could reuse this function there).
> 
> Specifically, see this fragment:
>             if (c == quote || c == '\\')
I don't understand that such quote...

Comment 24 Allison Karlitskaya (desrt) 2013-05-22 14:22:15 UTC

Sorry.  'quote' is defined from a bit further up to be either «"» or «'», depending on what type of quote is being used by the printer (to avoid putting this type of quote inside of the string, while still allowing the other).

Comment 25 Allison Karlitskaya (desrt) 2013-05-22 14:26:15 UTC

Owen: what was your original need for this function?

Comment 26 Igor Gnatenko 2013-05-22 17:13:43 UTC

Conversation from irc:
desrt:  ignatenkobrain: in essence, you're creating a function that is trying to escape unicode strings
desrt:  but you're doing it in a very ascii-centric way
ignatenkobrain:  desrt: can I initializate quote as " ?
desrt:  ignatenkobrain: i think it might be helpful to allow the user to provide you with a string of extra characters to escape or not to escape
desrt:  sort of like how g_uri_escape_string() takes a 'reserved_chars_allowed' argument
desrt:  because if i plan to use "quotes" then it would be perfectly okay to have ' unescaped
desrt:  but if i plan to use 'quotes' then ' needs to be escaped but " can be left alone
desrt:  iirc GVariant does a fairly simple thing here: if the contains a «'» then it uses «"», otherwise it defaults to «'»

I'm not developer and I need help.

Comment 27 Igor Gnatenko 2013-05-23 10:18:40 UTC

(In reply to comment #24)
> Sorry.  'quote' is defined from a bit further up to be either «"» or «'»,
> depending on what type of quote is being used by the printer (to avoid putting
> this type of quote inside of the string, while still allowing the other).
You mean to use similar construction? But I don't understand, what functions need to be involved.

gchar *
g_utf8_strescape (const gchar *source,
                  const gchar *exceptions
                  const gchar *quotes)
  {
...
    quote = g_variant_new_string (quotes);
    g_variant_get_variant (quote);
...
  }

Comment 28 Owen Taylor 2013-05-23 19:07:38 UTC

(In reply to comment #25)
> Owen: what was your original need for this function?

2002-07-31 16:52:42 UTC was almost 11 years ago! 

It coudl be that I noticed that g_strescape() was broken when I tried to use it in Pango (in the place referenced above), and then filed this bug.

Comment 29 GNOME Infrastructure Team 2018-05-23 23:12:23 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/4.