Bug 93787 – Outputting ustring with operator << converts implicitly

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 93787 - Outputting ustring with operator << converts implicitly


Summary:	Outputting ustring with operator << converts implicitly


Status:	RESOLVED FIXED

Product:	gtkmm
Classification:	Bindings
Component:	reference documentation
Version:	2.0
Hardware:	Other other

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkmm-forge
QA Contact:	gtkmm-forge

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2002-09-20 17:43 UTC by Ole Laursen
Modified:	2005-09-27 19:56 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch that adds a short notice to the tutorial about this problem (4.24 KB, patch) 2002-11-10 11:18 UTC, Ole Laursen	none	Details \| Review
Second try on a patch to "Programming with gtkmm". (4.45 KB, patch) 2002-11-11 20:07 UTC, Ole Laursen	none	Details \| Review

Description Ole Laursen 2002-09-20 17:43:09 UTC

This results in problems with displaying the error in a user-friendly way
(i.e. using a message dialog instead of just outputting to a perhaps
non-existing terminal). For instance, catching a Glib::Error and
concatenating the return value of the what() member with some extra
information will usually not give a proper UTF-8 ustring when either GTK+
or the program is i18nized.

Perhaps GTK+ converts the messages as a convenience - all the examples I
could find in the documentation simply display the error with
fprintf(stderr, ...). However, the conversion results in surprising
behaviour that is undetectable unless running in an i18nized environment.
And displaying critical errors in the terminal is _really_ unfriendly.

So if not too much Gtkmm application code depends on the format of the
what() messages (I guess this is the case for GTK+ itself), I propose
undoing the conversion in the wrapper with locale_to_utf8.

Else, at least it badly needs documenting. But I won't provide a patch
before a decision is made. :-)

Comment 1 Daniel Elstner 2002-09-21 15:52:55 UTC

Huh?  The messages are of course supposed to be UTF-8 encoded -- I
never encountered anything else.  Could you provide a small test case,
please?

Comment 2 Ole Laursen 2002-09-21 18:45:24 UTC

While constructing the test example, I found out that the what()
strings actually are in UTF-8 (the examples in the GTK+ documentation
without conversions must be buggy, then). So the problem is that the
function I'm using for composing the error message uses an
ostringstream for converting the arguments generically, like:

  compose("%1. Check your installation or contact the distributor.",
          ex.what());

And ustrings are silently converted to the locale encoding when using
operator <<. I've changed the subject line to match this.

IMHO this is too clever. Another scenario where this might silently
break things is outputting to files. What about refrain from doing the
conversion and instead emphasizing that output to console must be
converted?

Comment 3 Daniel Elstner 2002-09-21 19:27:14 UTC

Yes, I wasn't entirely happy with it either -- but it seemed to be the
sanest solution.  The original implementation of operator<< for
Glib::ustring actually did the conversion for cout/cerr/clog only. 
Apart from being even more confusing it causes trouble with string
streams. If you do:

out << 123 << ustring;

then the stream converts the locale's string representation of 123
into an encoding that isn't necessarily UTF-8.  For libstdc++-v2 with
setlocale() called, that will always be the locale's encoding.  This
behaviour is broken -- libstdc++-v3 does it right and allows you to
setup a global default std::locale object, and you can also assign a
different locale setting to single streams.

But even then you're lost because it's impossible to retrieve the name
of the encoding the stream is using (if there should be a way I've
overlooked please tell me!).  Since the most probable stream encoding
is the locale's default encoding, I decided to always do the conversion.

Yes, file output is an issue.  I'm using ustring::raw() in my programs
to make clear I want to write the raw data, i.e. UTF-8 to the stream.
 I understand this isn't pleasant either -- both ways seem to be
flawed somehow.

It's almost unbelievable:  the darn std::locale concept is immensely
powerful and allows stuff I'm never going to use -- but simply
retrieving a stream's encoding doesn't seem possible :(

Any ideas?

--Daniel

Comment 4 Ole Laursen 2002-09-21 20:22:53 UTC

Ah, I see. I wasn't aware of the stringstream conversion of numbers,
though it somehow makes sense once you think about it. I guess this
leaves one with two possibilities:

 1. Construct ASCII or UTF-8 stringstreams and change the operator <<
to never convert the output.

 2. Do all character manipulation involving streams with the locale's
encoding and convert back using locale_to_utf8.

My version of The C++ Programming Language didn't have the extra
appendix with information about std::locale. Is it easy to construct
streams in the C locale? If so, I think that's the cleanest solution.
Perhaps it would even be possible to provide a few typedefs/templates
such as Glib::ustringstream?

In any case, it probably needs to be mentioned somewhere that ordinary
stringstreams are incompatible with Glibmm, unlike std::string.

Comment 5 Ole Laursen 2002-09-21 20:34:25 UTC

Oops, a C locale isn't good enough. Then floats wouldn't be localized.
Question is whether it is possible to simply set the encoding to UTF-8?

Comment 6 Daniel Elstner 2002-09-21 20:49:59 UTC

Creating streams with an arbitrary locale object is easy, just do:

stream.imbue(std::locale("C"));

but that's not the issue.  You can't use a C locale stream because
then you'd lose any i18n functionality.  Apart from decimal point
differences in Western countries, an exotic language's numbers might
not even be representable in ASCII.  I think this applies to some
Arabian countries -- although they're using the decimal system (which
they invented ;) the digits look different.

So you've to go for option 2 -- at least until everyone has switched
to a UTF-8 locale (which I've already done, by the way ;)

What exactly do you mean by "ordinary stringstreams are incompatible
with Glibmm"?  If we continue using option 2 (as gtkmm does now) you
just have to pay attention to convert strstream.str() to UTF-8 before
assigning the data to a ustring.  Did you mean just that or something
else?

Comment 7 Daniel Elstner 2002-09-21 20:57:27 UTC

Mid-air collision ;)

You can set the encoding to UTF-8 by doing e.g.:

stream.imbue(std::locale("de_DE.UTF-8"));

It's also possible to change the LC_CTYPE category only.  But as you
can see from the example, you've to know a complete locale name which
also has to actually *exist* on the system.  Even if you assume Linux,
there are distros (e.g. Debian) that only generate the locales
explicitely requested by the user.

Probably even worse is that libstdc++-v2 doesn't support std::locale.
 Which in practice means your code will require g++ 3.0 or newer.

Comment 8 Daniel Elstner 2002-09-21 21:18:26 UTC

I just realized that it might be possible to detect a stream's
encoding by retrieving the locale name, temporarily switching to this
locale via setlocale(), and then calling nl_langinfo() to get the charset.

That would only work on Unix with libstdc++-v3 though, and it doesn't
solve the problem that the string returned by stringstream::str() is
most probably not UTF-8.  Darn!

Comment 9 Ole Laursen 2002-09-22 10:00:26 UTC

Yes, by incompatible I was thinking of the conversion issue. They are
not directly compatible unless you remember converting the strings. 

One way out of some of the mess would perhaps be to set forth a
convention that all std::strings be in the locale's encoding with
automatic conversions back and forth? With that approach, I wouldn't
have encountered my problem, and I think the semantics are easier to
remember and comprehend then.

I don't think you can be sure that all characters will survive the
conversion back and forth unaltered, though. But I can't see we can do
anything about that in any case. Also, if the strings are localised
using the same language, chances are the locale's encoding will
support all characters.

At least if you know that std::strings are supposed to be locale
dependent, you know that you can't count not loosing data at all. A
raw() method may still provide a unchanged string for e.g. files or
verbatim keys, then.

Comment 10 Murray Cumming 2002-10-02 10:12:24 UTC

I don't want to read and understand all of that. Can you both tell me
whether you think this is still a bug, or an enhancement, and
summarise why.

Comment 11 Ole Laursen 2002-10-03 20:09:26 UTC

I think it's more of a bug than an enhancement.

To summarize: the problem is that the standard streams treat their
characters in a locale-dependent way (i.e. with different encodings
for different locales). 

Currently, ustrings convert their contents to the locale-dependent
encoding when used in conjuction with operator <<. This makes sense
since the stream needs to get the data in the same encoding as the
characters it already contains. But it also means that using ustrings
in conjuction with stringstreams will silently result in a string
which isn't in UTF-8 (i.e., unless the locale happens to be UTF-8).

Unless you're running the program in an i18nized environment, you'll
never discover this problem. Pango refuses to show labels with invalid
UTF-8, so e.g. it might cause error messages to never show up.

So I propose the following:

 * make the decision that std::strings should always be in
   locale-dependent encoding
 * when converting std::strings to ustrings, always convert
   implicitly from locale-dependent encoding to UTF-8
 * keep a few overrides, like raw(), which still produces
   UTF-8 std::strings
 * put a big notice in the tutorial and in the documentation
   for ustrings

This would give a more seamless integration with the standard library.
I don't think it's possible to guarantee that all characters survive
the conversion back and forth, but hopefully people will use a locale
that supports all the characters of their language. So this might be a
non-existant problem.

Comment 12 Daniel Elstner 2002-10-10 14:26:37 UTC

I think this would create more problems than it solves.

1) Using std::string instead of Glib::ustring for UTF-8 strings would
lead to unexpected results.

2) Charset conversion is quiet costly and shouldn't be performed in an
implicit type conversion.

3) The encoding of std::string is undefined and should stay that way.
 For instance, we always use std::string for filenames because their
encoding could be either UTF-8 or the locale encoding if
G_BROKEN_FILENAMES is set.

To summarize, I think it's a bad idea to associate std::string with
any particular encoding.  The C++ standard library doesn't do that
either.  Regarding std::ostringstream it's sufficient to write:

Glib::ustring ustr = Glib::locale_to_utf8(stream.str());

That isn't too hard, we should just mention it in the documentation.

--Daniel

Comment 13 Murray Cumming 2002-10-11 18:17:25 UTC

Where should we mention it exactly?

Comment 14 Ole Laursen 2002-10-12 15:20:07 UTC

I'm not feeling quite convinced yet. :-)

As I see it, there is no solution that will solve all possible needs;
the question is really more that of what situations require explicit
character conversions. As it is now, it's possible to put UTF-8
strings in std::strings without any trouble, but using streams
requires explicit conversions. Isn't it much more useful the other way
round? Why wouldn't you just use a Glib::ustring if you wanted to
store UTF-8 strings?

It also seems much more likely that if you really need a std::string,
it needs to be in the locale's encoding. It's usually announced
heavily if a library wants or outputs UTF-8 so it's not too difficult
to remember explicit measures in that case. It's not so the other way
round. With the current approach, boost::lexical_cast results in
corrupted strings without anyone noticing.

Thing is, you can't convert much to a string without going through a
stringstream. So in effect the standard already does enforce a
particular encoding on strings, at least if you want to avoid the mess
of having std::strings with different encodings in the same program.
Besides, we wouldn't really be enforcing anything for all
std::strings, we would just change the assumed encoding of
std::strings used with Gtkmm from UTF-8 to the locale's encoding.
Extra functions can handle the case where this assumption is wrong.

The cost of the conversions is not important here, I think. Unless
you're sloppy with your string types, the conversions are actually
needed, whether implicit or not.


If I've not managed to convince you, I think it would be better to
remove the implicit std::string conversion completely and introduce
something like string_from_utf8 and string_from_locale instead. This
would at least force one to think about the issue.

Comment 15 Daniel Elstner 2002-10-12 21:12:29 UTC

No, you didn't manage to convince me ;)

You completely ignored my filenames example.  Even if I would agree
with all your other points this would still be a show stopper.

Regarding the use of std::string for UTF-8 strings:  This is a) often
absolutely unavoidable e.g. when reading from a standard stream
without using operator>>() (which isn't approriate most of the time).
 And b) it's useful for optimization purposes.  A lot of generic
string operations work just fine with UTF-8 since it was designed to
be as compatible as possible.

Further:
"Thing is, you can't convert much to a string without going through a
stringstream. So in effect the standard already does enforce a
particular encoding on strings [...]"

This is actually not true.  A stream doesn't have a locale assigned by
default.  The reasons why I decided to convert to the current locale
in operator<<() are a) because libstdc++-v2 (the one delivered with
gcc-2.95.x) is broken and uses the current C locale for all streams.
And b) even if you have proper locale support (as in libstdc++-v3)
there's no portable way to say "use the current locale but replace its
charset with UTF-8".  Which means you kinda have to use the charset
dictated by the current locale in i18nized apps.

Next point: "If I've not managed to convince you, I think it would be
better to remove the implicit std::string conversion completely and
introduce something like string_from_utf8 and string_from_locale
instead. This would at least force one to think about the issue."

While forcing users to think about the issue would be nice, we just
can't do that.  There are several places in the gtkmm API where the
compiler cannot know the encoding of a string but the programmer can.
 Take Glib::convert() for instance.  It's using std::string as
arguments because the encoding is *unknown* to the compiler, even
though the programmer might know it.

Now one could argue "why don't you use Glib::locale_{to,from}_utf8()
instead".  Well, because I might want to acquire a persistent
Glib::IConv handle and use Glib::IConv::convert().  Even if I use it
only to convert to/from UTF-8 it has to support arbitrary conversions
as well.

Convinced? ;)

Cheers,
--Daniel

Comment 16 Ole Laursen 2002-10-14 15:09:37 UTC

Regarding the filenames: I apologise, but the reason I didn't
comment on that is that I didn't quite understand the problem
- I still don't. :-)

Does the situation change? If you're relying on using
std::strings from Gtkmm file names together with ustrings,
your program is buggy, right? Since you will create invalid
ustrings if G_BROKEN_FILENAMES is set? And noone ever check
that, do they? I've never heard of this problem before now.

Unless there's more to it, it sounds like a separate issue
that should either be fixed by adding documentation or by
shielding users of Gtkmm from it by replacing std::string with
ustring and automatically converting to UTF-8 if
G_BROKEN_FILENAMES is set.


You're somewhat right about reading from a standard (e.g.
file) stream - but you wouldn't go through std::string, would
you? The members return char *'s - so by default interpret
these as locale-encoded, unless you explicitly say what
encoding to use with e.g. a "ustring from_raw(char *)" helper.

It's true that this is more work than it is now, but OTOH lazy
coding still works; only if it's important to output UTF-8 is
it necessary to think a little.

About the generic string operations - is this a big problem? I
myself have 1-2 I use regularly, and it's no big deal to
templatize them if they really need to work on both
std::string and ustring (this also works for all string
algorithms, not just those that don't examine individual
characters):

  template <String> String strip(const String &s);

If you don't have access to source code, there's always a
"std::string raw()" method in ustring. I don't want to ban
other std::string encodings, just change the default
assumption of Gtkmm.


I don't quite understand what you write about streams not
having a locale assigned by default. Are you saying that my
point is wrong from the Standard point of view, but right from
the practical point of view? If so, the point, though not the
wording, is still valid, isn't it?


I'm trying to understand what you're saying about banning the
implicit conversions of ustring to std::string and std::string
to ustring. If I get your point, the problem is that you might
change the encoding to UTF-8 with IConv::convert but still get
a std::string. But then you could just pass it through a
"ustring from_raw(std::string)" that didn't perform any
conversions except from changing the type of the C++ object,
couldn't you?


I personally like the implicit locale-encoded std::strings
best since it changes the current behaviour least, except for
when you're interacting with the outside (as with file I/O). I
have a feeling that banning implicit conversions may change
quite a lot more.

Have I missed any points?

Comment 17 Ole Laursen 2002-10-14 19:48:00 UTC

> The members return char *'s - so by default interpret
> these as locale-encoded, unless you explicitly say what
> encoding to use with e.g. a "ustring from_raw(char *)" helper.

I just realised that this isn't quite true - whether to interpret
"char *" as locale-encoded is a separate issue. If such strings are
interpreted as such, the usual gettext scheme will break down. If not,
I think the situation gets too complex.

I looked at the operator<</>> definitions again. Operator >> also
implicitly converts back. So boost::lexical_cast<ustring>(something)
should actually work. Given that Murray just announced complete freeze
Wednesday, I think I'll just shut up then. :-)

I'm not closing the bug since the issue probably needs to be
documented. Too bad we can't get a properly behaving stringstream with
the current design of the Standard.

Comment 18 Daniel Elstner 2002-10-15 00:44:27 UTC

About the stream default locale issue: yeah you're right, what I wrote
was really ambiguous and could be interpreted as both pro and con ;) 
But you got right what I was trying to say: There are standards which
say don't, there is a practical approach which says yes, and neither
of them is really good or even just better than the other.

It seems we've come to an agreement that trying to automate anything
would be quite complicated, and not the perfect solution either.  So I
reckon we've to do it the hard way and start writing documentation ;)

Let's leave this bug open till the documentation got added.

--Daniel

Comment 19 Daniel Elstner 2002-10-15 00:49:00 UTC

Regarding the filename problem:  You don't have to check for
G_BROKEN_FILENAMES because Glib::filename_{from,to}_utf8() does that
for you.  Just use them and forget about it.

The question is where to put information like that... Maybe right on
the front page in big letters saying "READ ME FIRST OR DIE!" :)

Comment 20 Ole Laursen 2002-10-15 12:54:42 UTC

Hehe.

Regarding where to put the information, as a starter I'd vote for both
the tutorial together with the information about ustring and in the
reference documentation on the ustring page.

Comment 21 Murray Cumming 2002-10-16 09:54:10 UTC

I need to add an "Internationalization and Translation" chapter to the
book sometime. Maybe this would be appropriate for that chapter. Or
maybe it should be added to the information about Glib::ustring.

It would be nice if someone could write some text about this issue so
that it could be used there.

Comment 22 Ole Laursen 2002-10-16 18:24:14 UTC

I was once thinking about what to put in the i18n chapter. Perhaps I
could restart thinking about it this weekend when I get some more free
time, and see if I can get something written down. I recently
installed psgml anyway.

A short notice in the reference documentation on ustring is needed
too, though (currently it says that using std::string instead of
ustring is basically fine; it just needs to mention what happens with
a std::stringstream).

Comment 23 Daniel Elstner 2002-10-17 04:59:25 UTC

I added a paragraph about streams to the Glib::ustring class
documentation, including a tiny std::ostringstream usage example.

Comment 24 Ole Laursen 2002-11-10 11:17:20 UTC

I think it's about time we close this bug. I'm attaching a patch that
adds a short notice in the ustring section in "Programming with gtkmm".

Comment 25 Ole Laursen 2002-11-10 11:18:58 UTC

Created attachment 12214 [details] [review]
Patch that adds a short notice to the tutorial about this problem

Comment 26 Murray Cumming 2002-11-10 22:19:04 UTC

The paragraph about "character length" is a repetition of similar text
in the same section.

The other paragraph talks about "outputting with <<" but surely it's
only relevant to using << with the iostreams. It does not explain to
me what conversions need to be done or why - really, I don't know yet.

Also, I suspect that these conversions are only really necessary when
using non-utf8 locales. Or maybe I'm wrong. It would be nice to tell
people exactly when (at runtime) they will hit the problem that you
are telling them how to solve.

Comment 27 Ole Laursen 2002-11-11 20:06:08 UTC

I've tried refactoring the notice. Hope this is better. The reason I
didn't give an example but just refered to the reference documentation
is that the one given there is slightly complex:

 std::ostringstream output;
 output.imbue(std::locale("")); // use the user's locale for this str
 output << percentage << " % done";
 label->set_text(Glib::locale_to_utf8(output.str()));

I don't know why the second line is necessary. But I've just
copy-yanked it all now.

Comment 28 Ole Laursen 2002-11-11 20:07:13 UTC

Created attachment 12234 [details] [review]
Second try on a patch to "Programming with gtkmm".

Comment 29 Murray Cumming 2002-11-11 23:28:22 UTC

That seems better, though I still don't really understand the problem.
At least I know that I should worry whenever I use Glib::ustring with
a std::[i|o]stream.

Comment 30 Murray Cumming 2002-11-13 16:03:56 UTC

I have added this text to the "Internationalization and Translation"
chapter. I have also added a link to the Glib::ustring reference
documentation to the Basics chapter. That reference documentation
talks about iostreams too. Thanks.

Comment 31 Daniel Elstner 2002-12-29 00:27:27 UTC

Re.

 output.imbue(std::locale("")); // use the user's locale for this str

That line definitely is necessary in standard C++, unless you
explicitely installed a global default locale via
std::locale::global(std::locale(""));

However, I think writing code that depends on global settings to work
correctly is generally a bad idea -- the explicit imbue() will always
work.  And I'd consider the ability to use stream-specific locales an
important feature.  For instance, configuration files should be
locale-independent.  Again, to avoid depending on global settings use:

  output.imbue(std::locale::classic());

and do _not_ use Glib::locale_to_utf8(), because the classic locale is
based on ASCII which is a subset of UTF-8.

Unfortunately libstdc++-v2 (the one that comes with g++ 2.95.x)
doesn't implement std::locale and all streams do always use the global
locale installed by setlocale() (which has no effect in libstdc++-v3).
 Thus you need to #ifdef around the std::locale stuff for now.  But
that won't solve that config file problem -- you have to convert
floats to strings manually to get it right, seriously.

I hope I could clarify matters a bit ;)

Comment 32 Ole Laursen 2005-09-17 21:10:58 UTC

Although this bug is closed long time ago, I just want to add I've discovered
the hard way that always using the encoding of the locale with std::strings is
not always good enough either.

When putting a floating point number into a stringstream, one may get a string
with extended characters out because the decimal point is converted to a special
decimal point character. This is _not_ guaranteed to be saved correctly in a
std::string because the characters may not be wide enough.

In fact I have experienced a crash with libstdc++ because it clipped a multibyte
UTF-8 decimal point character to a single byte instead of two bytes. This
happened _even though_ the encoding of the locale was UTF-8 so that one would
have thought that a valid UTF-8 std::string could have been produced (for the
record, the actual crash appeared when the program afterwards tried to convert
the malformed UTF-8 std::string to a ustring).

Instead I suggest converting to and from wchar* with Glib::convert and using
wstringstreams. This has worked flawlessly for me (in my string composition
library), except for occasional bugs in the implementation of specific locales
in the standard library.

Comment 33 Murray Cumming 2005-09-27 05:40:10 UTC

This really looks like discussion for the mailing list. Here, you just get to
talk to me, instead of lots more informed people.

> Although this bug is closed long time ago, I just want to add I've discovered
> the hard way that always using the encoding of the locale with std::strings is
> not always good enough either.

I guess that makes sense. If you're locale is UTF-8 then you shouldn't be using
std::strings with API that gives/takes Glib::ustrings.

Comment 34 Ole Laursen 2005-09-27 19:56:40 UTC

My fault, I'm not continuing the discussion, I just meant to clarify things for
people who search the bug database since someone recently asked me about the
conclusion of this bug. I don't know how he digged up this bug, but if he could,
others could too, I reckoned. :-)