GNOME Bugzilla – Bug 93787
Outputting ustring with operator << converts implicitly
Last modified: 2005-09-27 19:56:40 UTC
This results in problems with displaying the error in a user-friendly way (i.e. using a message dialog instead of just outputting to a perhaps non-existing terminal). For instance, catching a Glib::Error and concatenating the return value of the what() member with some extra information will usually not give a proper UTF-8 ustring when either GTK+ or the program is i18nized. Perhaps GTK+ converts the messages as a convenience - all the examples I could find in the documentation simply display the error with fprintf(stderr, ...). However, the conversion results in surprising behaviour that is undetectable unless running in an i18nized environment. And displaying critical errors in the terminal is _really_ unfriendly. So if not too much Gtkmm application code depends on the format of the what() messages (I guess this is the case for GTK+ itself), I propose undoing the conversion in the wrapper with locale_to_utf8. Else, at least it badly needs documenting. But I won't provide a patch before a decision is made. :-)
Huh? The messages are of course supposed to be UTF-8 encoded -- I never encountered anything else. Could you provide a small test case, please?
While constructing the test example, I found out that the what() strings actually are in UTF-8 (the examples in the GTK+ documentation without conversions must be buggy, then). So the problem is that the function I'm using for composing the error message uses an ostringstream for converting the arguments generically, like: compose("%1. Check your installation or contact the distributor.", ex.what()); And ustrings are silently converted to the locale encoding when using operator <<. I've changed the subject line to match this. IMHO this is too clever. Another scenario where this might silently break things is outputting to files. What about refrain from doing the conversion and instead emphasizing that output to console must be converted?
Yes, I wasn't entirely happy with it either -- but it seemed to be the sanest solution. The original implementation of operator<< for Glib::ustring actually did the conversion for cout/cerr/clog only. Apart from being even more confusing it causes trouble with string streams. If you do: out << 123 << ustring; then the stream converts the locale's string representation of 123 into an encoding that isn't necessarily UTF-8. For libstdc++-v2 with setlocale() called, that will always be the locale's encoding. This behaviour is broken -- libstdc++-v3 does it right and allows you to setup a global default std::locale object, and you can also assign a different locale setting to single streams. But even then you're lost because it's impossible to retrieve the name of the encoding the stream is using (if there should be a way I've overlooked please tell me!). Since the most probable stream encoding is the locale's default encoding, I decided to always do the conversion. Yes, file output is an issue. I'm using ustring::raw() in my programs to make clear I want to write the raw data, i.e. UTF-8 to the stream. I understand this isn't pleasant either -- both ways seem to be flawed somehow. It's almost unbelievable: the darn std::locale concept is immensely powerful and allows stuff I'm never going to use -- but simply retrieving a stream's encoding doesn't seem possible :( Any ideas? --Daniel
Ah, I see. I wasn't aware of the stringstream conversion of numbers, though it somehow makes sense once you think about it. I guess this leaves one with two possibilities: 1. Construct ASCII or UTF-8 stringstreams and change the operator << to never convert the output. 2. Do all character manipulation involving streams with the locale's encoding and convert back using locale_to_utf8. My version of The C++ Programming Language didn't have the extra appendix with information about std::locale. Is it easy to construct streams in the C locale? If so, I think that's the cleanest solution. Perhaps it would even be possible to provide a few typedefs/templates such as Glib::ustringstream? In any case, it probably needs to be mentioned somewhere that ordinary stringstreams are incompatible with Glibmm, unlike std::string.
Oops, a C locale isn't good enough. Then floats wouldn't be localized. Question is whether it is possible to simply set the encoding to UTF-8?
Creating streams with an arbitrary locale object is easy, just do: stream.imbue(std::locale("C")); but that's not the issue. You can't use a C locale stream because then you'd lose any i18n functionality. Apart from decimal point differences in Western countries, an exotic language's numbers might not even be representable in ASCII. I think this applies to some Arabian countries -- although they're using the decimal system (which they invented ;) the digits look different. So you've to go for option 2 -- at least until everyone has switched to a UTF-8 locale (which I've already done, by the way ;) What exactly do you mean by "ordinary stringstreams are incompatible with Glibmm"? If we continue using option 2 (as gtkmm does now) you just have to pay attention to convert strstream.str() to UTF-8 before assigning the data to a ustring. Did you mean just that or something else?
Mid-air collision ;) You can set the encoding to UTF-8 by doing e.g.: stream.imbue(std::locale("de_DE.UTF-8")); It's also possible to change the LC_CTYPE category only. But as you can see from the example, you've to know a complete locale name which also has to actually *exist* on the system. Even if you assume Linux, there are distros (e.g. Debian) that only generate the locales explicitely requested by the user. Probably even worse is that libstdc++-v2 doesn't support std::locale. Which in practice means your code will require g++ 3.0 or newer.
I just realized that it might be possible to detect a stream's encoding by retrieving the locale name, temporarily switching to this locale via setlocale(), and then calling nl_langinfo() to get the charset. That would only work on Unix with libstdc++-v3 though, and it doesn't solve the problem that the string returned by stringstream::str() is most probably not UTF-8. Darn!
Yes, by incompatible I was thinking of the conversion issue. They are not directly compatible unless you remember converting the strings. One way out of some of the mess would perhaps be to set forth a convention that all std::strings be in the locale's encoding with automatic conversions back and forth? With that approach, I wouldn't have encountered my problem, and I think the semantics are easier to remember and comprehend then. I don't think you can be sure that all characters will survive the conversion back and forth unaltered, though. But I can't see we can do anything about that in any case. Also, if the strings are localised using the same language, chances are the locale's encoding will support all characters. At least if you know that std::strings are supposed to be locale dependent, you know that you can't count not loosing data at all. A raw() method may still provide a unchanged string for e.g. files or verbatim keys, then.
I don't want to read and understand all of that. Can you both tell me whether you think this is still a bug, or an enhancement, and summarise why.
I think it's more of a bug than an enhancement. To summarize: the problem is that the standard streams treat their characters in a locale-dependent way (i.e. with different encodings for different locales). Currently, ustrings convert their contents to the locale-dependent encoding when used in conjuction with operator <<. This makes sense since the stream needs to get the data in the same encoding as the characters it already contains. But it also means that using ustrings in conjuction with stringstreams will silently result in a string which isn't in UTF-8 (i.e., unless the locale happens to be UTF-8). Unless you're running the program in an i18nized environment, you'll never discover this problem. Pango refuses to show labels with invalid UTF-8, so e.g. it might cause error messages to never show up. So I propose the following: * make the decision that std::strings should always be in locale-dependent encoding * when converting std::strings to ustrings, always convert implicitly from locale-dependent encoding to UTF-8 * keep a few overrides, like raw(), which still produces UTF-8 std::strings * put a big notice in the tutorial and in the documentation for ustrings This would give a more seamless integration with the standard library. I don't think it's possible to guarantee that all characters survive the conversion back and forth, but hopefully people will use a locale that supports all the characters of their language. So this might be a non-existant problem.
I think this would create more problems than it solves. 1) Using std::string instead of Glib::ustring for UTF-8 strings would lead to unexpected results. 2) Charset conversion is quiet costly and shouldn't be performed in an implicit type conversion. 3) The encoding of std::string is undefined and should stay that way. For instance, we always use std::string for filenames because their encoding could be either UTF-8 or the locale encoding if G_BROKEN_FILENAMES is set. To summarize, I think it's a bad idea to associate std::string with any particular encoding. The C++ standard library doesn't do that either. Regarding std::ostringstream it's sufficient to write: Glib::ustring ustr = Glib::locale_to_utf8(stream.str()); That isn't too hard, we should just mention it in the documentation. --Daniel
Where should we mention it exactly?
I'm not feeling quite convinced yet. :-) As I see it, there is no solution that will solve all possible needs; the question is really more that of what situations require explicit character conversions. As it is now, it's possible to put UTF-8 strings in std::strings without any trouble, but using streams requires explicit conversions. Isn't it much more useful the other way round? Why wouldn't you just use a Glib::ustring if you wanted to store UTF-8 strings? It also seems much more likely that if you really need a std::string, it needs to be in the locale's encoding. It's usually announced heavily if a library wants or outputs UTF-8 so it's not too difficult to remember explicit measures in that case. It's not so the other way round. With the current approach, boost::lexical_cast results in corrupted strings without anyone noticing. Thing is, you can't convert much to a string without going through a stringstream. So in effect the standard already does enforce a particular encoding on strings, at least if you want to avoid the mess of having std::strings with different encodings in the same program. Besides, we wouldn't really be enforcing anything for all std::strings, we would just change the assumed encoding of std::strings used with Gtkmm from UTF-8 to the locale's encoding. Extra functions can handle the case where this assumption is wrong. The cost of the conversions is not important here, I think. Unless you're sloppy with your string types, the conversions are actually needed, whether implicit or not. If I've not managed to convince you, I think it would be better to remove the implicit std::string conversion completely and introduce something like string_from_utf8 and string_from_locale instead. This would at least force one to think about the issue.
No, you didn't manage to convince me ;) You completely ignored my filenames example. Even if I would agree with all your other points this would still be a show stopper. Regarding the use of std::string for UTF-8 strings: This is a) often absolutely unavoidable e.g. when reading from a standard stream without using operator>>() (which isn't approriate most of the time). And b) it's useful for optimization purposes. A lot of generic string operations work just fine with UTF-8 since it was designed to be as compatible as possible. Further: "Thing is, you can't convert much to a string without going through a stringstream. So in effect the standard already does enforce a particular encoding on strings [...]" This is actually not true. A stream doesn't have a locale assigned by default. The reasons why I decided to convert to the current locale in operator<<() are a) because libstdc++-v2 (the one delivered with gcc-2.95.x) is broken and uses the current C locale for all streams. And b) even if you have proper locale support (as in libstdc++-v3) there's no portable way to say "use the current locale but replace its charset with UTF-8". Which means you kinda have to use the charset dictated by the current locale in i18nized apps. Next point: "If I've not managed to convince you, I think it would be better to remove the implicit std::string conversion completely and introduce something like string_from_utf8 and string_from_locale instead. This would at least force one to think about the issue." While forcing users to think about the issue would be nice, we just can't do that. There are several places in the gtkmm API where the compiler cannot know the encoding of a string but the programmer can. Take Glib::convert() for instance. It's using std::string as arguments because the encoding is *unknown* to the compiler, even though the programmer might know it. Now one could argue "why don't you use Glib::locale_{to,from}_utf8() instead". Well, because I might want to acquire a persistent Glib::IConv handle and use Glib::IConv::convert(). Even if I use it only to convert to/from UTF-8 it has to support arbitrary conversions as well. Convinced? ;) Cheers, --Daniel
Regarding the filenames: I apologise, but the reason I didn't comment on that is that I didn't quite understand the problem - I still don't. :-) Does the situation change? If you're relying on using std::strings from Gtkmm file names together with ustrings, your program is buggy, right? Since you will create invalid ustrings if G_BROKEN_FILENAMES is set? And noone ever check that, do they? I've never heard of this problem before now. Unless there's more to it, it sounds like a separate issue that should either be fixed by adding documentation or by shielding users of Gtkmm from it by replacing std::string with ustring and automatically converting to UTF-8 if G_BROKEN_FILENAMES is set. You're somewhat right about reading from a standard (e.g. file) stream - but you wouldn't go through std::string, would you? The members return char *'s - so by default interpret these as locale-encoded, unless you explicitly say what encoding to use with e.g. a "ustring from_raw(char *)" helper. It's true that this is more work than it is now, but OTOH lazy coding still works; only if it's important to output UTF-8 is it necessary to think a little. About the generic string operations - is this a big problem? I myself have 1-2 I use regularly, and it's no big deal to templatize them if they really need to work on both std::string and ustring (this also works for all string algorithms, not just those that don't examine individual characters): template <String> String strip(const String &s); If you don't have access to source code, there's always a "std::string raw()" method in ustring. I don't want to ban other std::string encodings, just change the default assumption of Gtkmm. I don't quite understand what you write about streams not having a locale assigned by default. Are you saying that my point is wrong from the Standard point of view, but right from the practical point of view? If so, the point, though not the wording, is still valid, isn't it? I'm trying to understand what you're saying about banning the implicit conversions of ustring to std::string and std::string to ustring. If I get your point, the problem is that you might change the encoding to UTF-8 with IConv::convert but still get a std::string. But then you could just pass it through a "ustring from_raw(std::string)" that didn't perform any conversions except from changing the type of the C++ object, couldn't you? I personally like the implicit locale-encoded std::strings best since it changes the current behaviour least, except for when you're interacting with the outside (as with file I/O). I have a feeling that banning implicit conversions may change quite a lot more. Have I missed any points?
> The members return char *'s - so by default interpret > these as locale-encoded, unless you explicitly say what > encoding to use with e.g. a "ustring from_raw(char *)" helper. I just realised that this isn't quite true - whether to interpret "char *" as locale-encoded is a separate issue. If such strings are interpreted as such, the usual gettext scheme will break down. If not, I think the situation gets too complex. I looked at the operator<</>> definitions again. Operator >> also implicitly converts back. So boost::lexical_cast<ustring>(something) should actually work. Given that Murray just announced complete freeze Wednesday, I think I'll just shut up then. :-) I'm not closing the bug since the issue probably needs to be documented. Too bad we can't get a properly behaving stringstream with the current design of the Standard.
About the stream default locale issue: yeah you're right, what I wrote was really ambiguous and could be interpreted as both pro and con ;) But you got right what I was trying to say: There are standards which say don't, there is a practical approach which says yes, and neither of them is really good or even just better than the other. It seems we've come to an agreement that trying to automate anything would be quite complicated, and not the perfect solution either. So I reckon we've to do it the hard way and start writing documentation ;) Let's leave this bug open till the documentation got added. --Daniel
Regarding the filename problem: You don't have to check for G_BROKEN_FILENAMES because Glib::filename_{from,to}_utf8() does that for you. Just use them and forget about it. The question is where to put information like that... Maybe right on the front page in big letters saying "READ ME FIRST OR DIE!" :)
Hehe. Regarding where to put the information, as a starter I'd vote for both the tutorial together with the information about ustring and in the reference documentation on the ustring page.
I need to add an "Internationalization and Translation" chapter to the book sometime. Maybe this would be appropriate for that chapter. Or maybe it should be added to the information about Glib::ustring. It would be nice if someone could write some text about this issue so that it could be used there.
I was once thinking about what to put in the i18n chapter. Perhaps I could restart thinking about it this weekend when I get some more free time, and see if I can get something written down. I recently installed psgml anyway. A short notice in the reference documentation on ustring is needed too, though (currently it says that using std::string instead of ustring is basically fine; it just needs to mention what happens with a std::stringstream).
I added a paragraph about streams to the Glib::ustring class documentation, including a tiny std::ostringstream usage example.
I think it's about time we close this bug. I'm attaching a patch that adds a short notice in the ustring section in "Programming with gtkmm".
Created attachment 12214 [details] [review] Patch that adds a short notice to the tutorial about this problem
The paragraph about "character length" is a repetition of similar text in the same section. The other paragraph talks about "outputting with <<" but surely it's only relevant to using << with the iostreams. It does not explain to me what conversions need to be done or why - really, I don't know yet. Also, I suspect that these conversions are only really necessary when using non-utf8 locales. Or maybe I'm wrong. It would be nice to tell people exactly when (at runtime) they will hit the problem that you are telling them how to solve.
I've tried refactoring the notice. Hope this is better. The reason I didn't give an example but just refered to the reference documentation is that the one given there is slightly complex: std::ostringstream output; output.imbue(std::locale("")); // use the user's locale for this str output << percentage << " % done"; label->set_text(Glib::locale_to_utf8(output.str())); I don't know why the second line is necessary. But I've just copy-yanked it all now.
Created attachment 12234 [details] [review] Second try on a patch to "Programming with gtkmm".
That seems better, though I still don't really understand the problem. At least I know that I should worry whenever I use Glib::ustring with a std::[i|o]stream.
I have added this text to the "Internationalization and Translation" chapter. I have also added a link to the Glib::ustring reference documentation to the Basics chapter. That reference documentation talks about iostreams too. Thanks.
Re. output.imbue(std::locale("")); // use the user's locale for this str That line definitely is necessary in standard C++, unless you explicitely installed a global default locale via std::locale::global(std::locale("")); However, I think writing code that depends on global settings to work correctly is generally a bad idea -- the explicit imbue() will always work. And I'd consider the ability to use stream-specific locales an important feature. For instance, configuration files should be locale-independent. Again, to avoid depending on global settings use: output.imbue(std::locale::classic()); and do _not_ use Glib::locale_to_utf8(), because the classic locale is based on ASCII which is a subset of UTF-8. Unfortunately libstdc++-v2 (the one that comes with g++ 2.95.x) doesn't implement std::locale and all streams do always use the global locale installed by setlocale() (which has no effect in libstdc++-v3). Thus you need to #ifdef around the std::locale stuff for now. But that won't solve that config file problem -- you have to convert floats to strings manually to get it right, seriously. I hope I could clarify matters a bit ;)
Although this bug is closed long time ago, I just want to add I've discovered the hard way that always using the encoding of the locale with std::strings is not always good enough either. When putting a floating point number into a stringstream, one may get a string with extended characters out because the decimal point is converted to a special decimal point character. This is _not_ guaranteed to be saved correctly in a std::string because the characters may not be wide enough. In fact I have experienced a crash with libstdc++ because it clipped a multibyte UTF-8 decimal point character to a single byte instead of two bytes. This happened _even though_ the encoding of the locale was UTF-8 so that one would have thought that a valid UTF-8 std::string could have been produced (for the record, the actual crash appeared when the program afterwards tried to convert the malformed UTF-8 std::string to a ustring). Instead I suggest converting to and from wchar* with Glib::convert and using wstringstreams. This has worked flawlessly for me (in my string composition library), except for occasional bugs in the implementation of specific locales in the standard library.
This really looks like discussion for the mailing list. Here, you just get to talk to me, instead of lots more informed people. > Although this bug is closed long time ago, I just want to add I've discovered > the hard way that always using the encoding of the locale with std::strings is > not always good enough either. I guess that makes sense. If you're locale is UTF-8 then you shouldn't be using std::strings with API that gives/takes Glib::ustrings.
My fault, I'm not continuing the discussion, I just meant to clarify things for people who search the bug database since someone recently asked me about the conclusion of this bug. I don't know how he digged up this bug, but if he could, others could too, I reckoned. :-)