Bug 399216 – New feature: Glib::ustring::compose()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 399216 - New feature: Glib::ustring::compose()


Summary:	New feature: Glib::ustring::compose()


Status:	RESOLVED FIXED

Product:	glibmm
Classification:	Bindings
Component:	strings
Version:	2.13.x
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Daniel Elstner
QA Contact:	gtkmm-forge

URL:
Whiteboard:

Depends on:
Blocks:	447496

Reported:	2007-01-22 00:53 UTC by Daniel Elstner
Modified:	2007-12-03 09:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Proof of concept implementation of the compose API (16.49 KB, patch) 2007-01-22 01:05 UTC, Daniel Elstner	none	Details \| Review

Description Daniel Elstner 2007-01-22 00:53:45 UTC

glibmm should provide the functionality to compose message strings from a format template and a list of arguments.  Substituting references in a template string instead of simply concatenating strings is absolutely necessary to enable proper internationalization.  The STL stream interface is unfortunately not suitable for this purpose.

Prior art:

1. compose mini-library by Ole Laursen: http://people.iola.dk/olau/compose/
2. custom minimal implementation in regexxer: http://svn.gnome.org/viewcvs/regexxer/trunk/src/translation.h?view=markup

Ole Laursen's implementation supports formatting values of arbitrary type by passing the arguments to compose() through an std::wostringstream.  The minimal variant in regexxer supports only string arguments.

I think glibmm should also offer facilities for formatted conversion to strings, as Ole Laursen's compose library already does.  However, I'm currently somewhat in favor of providing a separate format() API, rather than having compose() do both.  The main new user-visible functions would then be:

static ustring ustring::compose(const char* format, const ustring& s1, ...);
template <class T1, ...> static ustring ustring::format(const T1& a1, ...);

Example usage of the proposed API:

using Glib::ustring;
const int percentage = 50;
const ustring text = ustring::compose("%1%% done", ustring::format(percentage));

This is subject to change.  I'm going to outline my reasons for favoring the separate API on the mailing list, so that it can be discussed before the final decision is made.

Comment 1 Daniel Elstner 2007-01-22 01:05:26 UTC

Created attachment 80848 [details] [review]
Proof of concept implementation of the compose API

The attached patch is a proof-of-concept implementation of the separated compose and format API I'm leaning towards at the moment.  The main advantage of the separation is that the implementation is small and straightforward, which should be clearly visible from the patch.  Applies to glibmm trunk.

The patch does not yet add the appropriate configure checks for wchar_t and std::wstring.  The final version will need to have these checks.

Comment 2 Murray Cumming 2007-01-23 11:45:02 UTC

> using Glib::ustring;
> const int percentage = 50;
> const ustring text = ustring::compose("%1%% done",
> ustring::format(percentage));

Could you explain why we can't just do:
  const ustring text = ustring::compose("%1%% done", 50);
?

Is %1% something standard with printf, or something new? Would %d be as valid?

Also, I don't understand the need for wchar. I thought that was something we never needed to use with UTF-8.

Comment 3 Daniel Elstner 2007-01-23 14:32:08 UTC

> Could you explain why we can't just do:
>  const ustring text = ustring::compose("%1%% done", 50);
> ?

This would work with Ole Laursen's compose mini-library.  But as I said above, I'm somewhat in favor of separating the compose and format functionality.  I'll hopefully have the time today to outline my reasons for this on the mailing list, as promised.

> Is %1% something standard with printf, or something new? Would %d be as valid?

No, %d would not be valid because compose() is not printf().  However, the format is already used by Qt (and therefore known as qt-format).  It is supported by gettext, too, and I have been using it in regexxer for years now.  Note that the whole idea of compose() is to offer a typesafe means of message formatting.  The %d or %f or whatever would make no sense, as these are used to specify the type of the argument.  %1 just means that the first argument is to be substituted.

> Also, I don't understand the need for wchar. I thought that was something we
> never needed to use with UTF-8.

We don't.  It's an implementation detail hidden behind the compose/format API.  The problem is that e.g. thousands separators are defined by single characters, not strings.  In a locale where the thousand separator doesn't fit into a single byte, plain std::ostream will truncate it.  In fact, such subtleties are one more reason to have the compose functionality in glibmm, so our users don't have to put up with that.  Also note that using wchar_t has the advantage of avoiding locale->UTF-8 conversion through iconv, since it always holds UCS-4 code points on modern Linux systems, independently of the locale.  This can be detected at compile time; just have a look at the patch.

Comment 4 Murray Cumming 2007-02-10 13:41:00 UTC

> > Also, I don't understand the need for wchar. I thought that was something we
> > never needed to use with UTF-8.
>
> We don't.  It's an implementation detail hidden behind the compose/format API. 
> The problem is that e.g. thousands separators are defined by single characters,
> not strings.  In a locale where the thousand separator doesn't fit into a
> single byte, plain std::ostream will truncate it.

So, wouldn't it be clearer to use gunichar somehow? wchar tends to suggest UCS2 or UCS4. Why is it inconceivable that a thousands separator could be more than one unicode character?

Comment 5 Daniel Elstner 2007-02-12 04:03:06 UTC

Er, I think you misunderstood. My fault, I should probably have added a little more context to my explanation. What ustring::format() does is this:

    template <class T1> inline // static
    ustring ustring::format(const T1& a1)
    {
      ustring::FormatStream buf;
      buf.stream() << a1;
      return buf.to_string();
    }

ustring::format() is overloaded for up to N arguments, with N=6 in my current patch. It simply writes each argument to a temporary stream and returns the accumulated result as a string. The STL stream is encapsulated in a private FormatStream class in order to move as much code out of the template as possible. buf.stream() returns a reference to the internal STL stream. Now, what Ole Laursen has discovered is that this internal stream must be a wide character stream, i.e. std::wostringstream and not just std::ostringstream.

Why? Because some of the locale-defined characters that are implicitly produced by string formatting could end up truncated otherwise. The thousands separator, for instance, could be a code point outside the ASCII range. One realistic example would be U+066C ARABIC THOUSANDS SEPARATOR. All is fine if you deal with code points as a whole -- that is, when all code points have the same fixed storage size, and an object of the stream's character type always holds exactly one such code point.

Now, if you introduce multi-byte encodings such as UTF-8 into the picture, this holds no longer true. An object of the stream's "character" type does not necessarily contain a whole code point anymore. The single wchar_t object 0x66C (assuming Unicode here) becomes a string of 2 char objects: 0xD9 0xAC. Thus, it's no longer a single object of the stream's so-called "character" type, but actually a string.

When internationalization was added to the STL stream interface, apparently no-one envisioned that some day people would want to use a multi-byte encoding as internal encoding in an application. The common mantra back then was to use a fixed-width encoding internally and a multi-byte encoding externally. Earlier multi-byte encodings were stateful and thus cumbersome, and often inefficient to process -- especially if the generic C library functions were used. In no way did the C++ standard anticipate the huge success of UTF-8, and most importantly that applications would start using it for everything *independently of the locale*. If UTF-8 is the only encoding, applications can hard-code the multi-byte processing. Because of the cunningly simple design of the UTF-8 encoding scheme, the performance considerations thus become moot.

But since noone envisioned that, we inherited the legacy of a stream locale interface that simply doesn't allow one to use a string as the thousands separator. It must be a single object of the stream's "character" type. See std::numpunct::thousands_sep():

<http://www.tacc.utexas.edu/services/userguides/pgi/pgC++_lib/stdlibcr/num_2619.htm#Public%20Member%20Functionsthousands_sep()>

That's why we must use std::wostringstream to format the string. The fact that probably only very few people will be aware of this nasty catch (I wasn't until I read it on Ole Laursen's website) makes it all the more necessary to add this formatting API to glibmm.

And on the implementation side, it's not so bad actually. Arguably going through wchar_t is cleaner in some sense, as it e.g. allows for skipping the locale conversion on modern systems (glibc always uses UCS-4 and win32 always uses UTF-16 for wchar_t). We only need to have configure check for its availability. In any case our users won't have to worry about it.

Pheww!

(I still intend to do the promised write-up for the mailing list about my compose/format API split proposal. Gimme a couple of days.)

Comment 6 Jonathon Jongsma 2007-05-02 17:18:41 UTC

So what's the chance of something like this being proposed and included in 2.12?

Comment 7 Murray Cumming 2007-05-30 16:48:47 UTC

> Now, what Ole Laursen has discovered is that this internal stream must be a wide
> character stream, i.e. std::wostringstream and not just std::ostringstream.
>
> Why? Because some of the locale-defined characters that are implicitly produced
> by string formatting could end up truncated otherwise. The thousands separator,
> for instance, could be a code point outside the ASCII range. One realistic
> example would be U+066C ARABIC THOUSANDS SEPARATOR.

OK. I'd like to see this in code comments.

Can wchar handle any Unicode character?

Comment 8 Murray Cumming 2007-06-10 10:45:24 UTC

I would like to see a patch for possible inclusion in glibmm 2.13/2.14.

Comment 9 Daniel Elstner 2007-08-12 02:52:25 UTC

OK, I went ahead and committed a preliminary implementation of the message compose and format API in order to give people something to play with. The API is of course still open to discussion; I'm going to ask for opinions on the mailing list later today. The necessary configure checks for wide stream support are included with this commit, as well as an implementation optimized for UTF-16 on Windows.

2007-08-12  Daniel Elstner  <danielk@openismus.com>

	* glib/glibmm/ustring.{cc,h}: Add preliminary implementation of
	a message compose and format API (bug #399216).  The API design
	is not final and still open for discussion.
	(ustring::compose): New set of static methods for composing
	internationalized text messages by substituting placeholders
	in a template string.
	(ustring::format): New set of static methods for locale-dependent
	formatting of numbers and other streamable objects to strings.
	(ustring::compose_argv): New static method which implements the
	common functionality of the compose() overloads.
	(ustring::FormatStream): New helper class which implements the
	type-independent functionality of the format() templates.
	(operator>>): New operator overload for std::wistream.
	(operator<<): New operator overload for std::wostream.

2007-08-12  Daniel Elstner  <danielk@openismus.com>

	* scripts/dk-feature.m4: New file, defining M4 utility macros for
	feature testing.  These macros are part of my personal autoconf
	library and are not specific to glibmm, as indicated by the "DK_"
	namespace prefix.

	* configure.in (AC_INIT): Switch to the non-deprecated usage of
	AC_INIT() by passing project name and version number as arguments.
	This is necessary to define a couple of auxiliary macros.
	(AC_PREREQ): Bump Autoconf version requirement to 2.58.
	(AC_CONFIG_SRCDIR): Point to project-specific source file.
	(AC_CONFIG_MACRO_DIR): Declare scripts/ as M4 directory.
	(AM_INIT_AUTOMAKE): Switch to non-deprecated usage.
	(AC_CHECK_SIZEOF): Use to determine the size of wchar_t.
	(DK_CHECK_FEATURE): Use new feature test macro to check for
	support of wide-character streams.

	* config.h.in (SIZEOF_WCHAR_T): Add #undef template.
	* glib/glibmmconfig.h.in (GLIBMM_HAVE_WIDE_STREAM): Likewise.

Comment 10 Murray Cumming 2007-08-13 08:14:01 UTC

We are past API freeze, so please make sure that this is only in HEAD, branching if necessary.

Comment 11 Daniel Elstner 2007-08-13 10:32:47 UTC

OK, I retroactively created a branch "glibmm-2-14" that excludes the changes made after July 30th.

Comment 12 Daniel Elstner 2007-08-15 01:41:21 UTC

OK, I committed the API change as discussed on the mailing list:

2007-08-15  Daniel Elstner  <danielk@openismus.com>

	* glib/glibmm/ustring.{cc,h} (ustring::compose_argv): Rename
	"format" argument to "fmt" to avoid name clashes with the method
	of the same name.
	(ustring::compose): Make the type of each substitution parameter
	a template argument, and invoke ustring::format() implicitly for
	non-string arguments.  Explicit invocation of ustring::format() is
	still necessary in order to apply I/O manipulators to an argument.
	(ustring::Stringify): New auxiliary template class used in the
	implementation of ustring::compose().

	* examples/compose/main.cc (show_examples): Omit explicit calls
	to ustring::format() where possible.

Comment 13 Murray Cumming 2007-11-30 11:14:07 UTC

So can we close this bug?

Comment 14 Daniel Elstner 2007-12-03 09:47:24 UTC

OK. Closing the bug because the feature has been implemented in SVN trunk.