Bug 583992 – Provide conversion between wide and narrow unicode strings in glibmm

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 583992 - Provide conversion between wide and narrow unicode strings in glibmm


Summary:	Provide conversion between wide and narrow unicode strings in glibmm


Status:	RESOLVED OBSOLETE

Product:	glibmm
Classification:	Bindings
Component:	strings
Version:	2.20.x
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkmm-forge
QA Contact:	gtkmm-forge

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-05-27 12:49 UTC by Chris Vine
Modified:	2018-05-22 12:09 UTC

See Also:
GNOME target:	---
GNOME version:	2.25/2.26

Attachments
Patch for unicode.hg and unicode.ccg (6.78 KB, patch) 2009-05-27 12:50 UTC, Chris Vine	needs-work	Details \| Review

Description Chris Vine 2009-05-27 12:49:12 UTC

A steady trickle of posts on the gtkmm mailing lists have related to converting between utf-8 encodings held in Glib::ustring objects and utf-16 or utf-32 encodings held in std::wstring objects.  This seems to be a particular issue with MS Windows users who want to interface gtkmm with windows related modules providing their unicode strings by std::wstring objects.  There seems to be a demand for something for this in glibmm.  For Unix-like system users, it will also be useful to those using wide string unicode libraries but who want to save text to file in a portable, non-endian and relatively space efficient form (utf-8).

A patch is attached which does this, prepared against glibmm-2.20.0.  It follows the format of the conversion functions in convert.h and convert.cc, although it appears in unicode.h and unicode.cc because the utf-8<=>utf-16/ucs-4 conversion functions are in gunicode.h in glib.  It is also logical because both target and source are in unicode format, and the conversions between them are lossless.

unicode.h is included in numerous other glibmm header files and there are some header circularities at work which mean that if glibmm/error.h or glibmm/ustring is included in unicode.h, the compilation of a number of other files will fail.  Instead the patch to unicode.h only provides a forward declaration of Glib::Error and Glib::ustring, and I have #include(d) glibmmconfig.h in it so as to pick up the value of GLIBMM_EXCEPTIONS_ENABLED.  That is fine because types which only form (or form part of) return types and function paramaters, require declaration but not definition in the header file.

An alternative is to place the two new functions in convert.h/convert.cc, where this circularity problem does not arise.

The two new functions are called Glib::narrow_to_wide() and Glib::wide_to_narrow() as this seems to me to be most descriptive name, but they could be called something different.  They are in the unicode group but not the unicode namespace.  I have provided extensive comment documentation for doxygen on their use.

I have tested this on Latin-1, Hangul and Thai characters using a system with wchar_t size of 4.  I do not have a windows machine available but it should work: the code in both cases will be the same - only a different unicode conversion function is required as between utf-16 and ucs-4.

Other information:
Patch to be attached.

Comment 1 Chris Vine 2009-05-27 12:50:33 UTC

Created attachment 135434 [details] [review]
Patch for unicode.hg and unicode.ccg

Comment 2 Daniel Elstner 2009-05-27 13:52:43 UTC

I don't like depending on the if () expression to be resolved at compile time, since it might trigger warnings about unreachable code.  Also, it isn't necessary since the preprocessor is sufficient.  For inspiration, look at how the wide stream conversion is done for ustring::format() in ustring.cc.  The necessary configure magic is already in place, you can just make use of the same macros.  Also, note that there may be platforms where wchar_t is neither UCS-4 nor UTF-16.  The implementation in ustring.cc falls back to iconv() in that case.

Also note that conversion from wide strings to ustrings already works just fine, by using ustring::format():

    const Glib::ustring s = Glib::ustring::format(L"foo");

Comment 3 Daniel Elstner 2009-05-27 14:00:49 UTC

Sorry, I missed this comment in your code:

  // SIZEOF_WCHAR_T is not defined in config.h for all builds

That's something that should be fixed rather than worked around.

Comment 4 Chris Vine 2009-05-27 17:35:19 UTC

The trouble with using Glib::ustring::format() is that it only goes one way. Someone interesting in interfacing with modules passing unicode in wide strings is likely to be interested in going both ways.

On SIZEOF_WCHAR_T, ustring.cc itself has a conditional inclusion of config.h and my recollection is that this is something to do with MSVC builds. I do not know enough about the windows build system to deal with that case, and this form of conditional compilation on compile time constants is quite a common idiom. I do not know of a compiler which does warn about this particular usage (gcc doesn't).

On codesets, the functions I have proposed are clearly documented as only dealing with unicode (ie utf-8<=>utf-16/utf-32 codesets), and if anything else is tried a conversion exception is thrown. That seems to me to be acceptable, and iconv() does not cover every codeset known to the world anyway so there is always going to be the risk of an excluded case. Sometimes the best is the enemy of the good (particularly if the best cannot reasonably be achieved), and it seems to me that the approach should be to document clearly the cases it will cover.

As it happens, if you cover unicode you will get 99.9% of the cases (a percentage which will also increase over time). Possibly if you really think this is an issue the functions could be put into the unicode namespace as well as the unicode module to make the point even clearer.[1]

On a linked issue, I have noticed one other point which is that to catch Glib::ConvertError you need to include glibmm/convert.h, and a user might expect unicode.h to do that for her. However, including it directly in unicode.h causes the cyclical header problems I have referred to.

[1] Yes, the functions don't cover UTF-7 but no-one uses that and it is irrelevant to std::wstring and Glib::ustring.

Comment 5 Daniel Elstner 2009-05-27 19:43:20 UTC

(In reply to comment #4)
> The trouble with using Glib::ustring::format() is that it only goes one way.

Yes, I know.  I just mentioned it because I didn't know if you were aware of it.  Of course conversion into the other direction is important as well.

> On SIZEOF_WCHAR_T, ustring.cc itself has a conditional inclusion of config.h
> and my recollection is that this is something to do with MSVC builds.

Ah, right, it doesn't use glibmmconfig.h but config.h (which is not installed anywhere), for which it is customary to make the include conditional.  But there is no reason to not add config.h to the MSVC builds if one is needed, or alternatively, just define SIZEOF_WCHAR_T using command line arguments.  (We know it is always 2 for MSVC builds).

> I do not know of a compiler which does warn about this particular usage
> (gcc doesn't).

It depends on the warning settings, I guess.  I definitely have seen GCC warn on unreachable code before.

> On codesets, the functions I have proposed are clearly documented as only
> dealing with unicode (ie utf-8<=>utf-16/utf-32 codesets), and if anything else
> is tried a conversion exception is thrown.  That seems to me to be acceptable,
> and iconv() does not cover every codeset known to the world anyway so there is
> always going to be the risk of an excluded case.

This is not really about codesets but about std::wstring.  If we provide a function that operates on std::wstring or returns std::wstring, we should make a reasonable attempt to actually do just that.  We really shouldn't have platform-specific API unless absolutely necessary.

> As it happens, if you cover unicode you will get 99.9% of the cases (a
> percentage which will also increase over time).  Possibly if you really think
> this is an issue the functions could be put into the unicode namespace as well
> as the unicode module to make the point even clearer.

Have you seen the implementation in ustring.cc?  It just falls back to iconv and uses the generic charset alias WCHAR_T.  Everything else is then up to iconv.  It's quite likely that there are iconv implementations which don't support this alias, but for GNU iconv it seems to work.  As an attempt to cover the generic case, I considered this to be good enough.  If it happens to not work on some odd platform, we still handle it as a special case as soon as we know about it.

> On a linked issue, I have noticed one other point which is that to catch
> Glib::ConvertError you need to include glibmm/convert.h, and a user might
> expect unicode.h to do that for her.  However, including it directly in
> unicode.h causes the cyclical header problems I have referred to.

I haven't looked into the dependency problem at all yet.  Let's get the behavior sorted first.

Comment 6 Chris Vine 2009-05-27 21:47:00 UTC

gcc will warn about unreachable code on out of range integer comparisons (such as comparing an unsigned integer with -1) but I have never seen it complain about a comparison on a compile-time constant.  However, I suppose you could assume that if config.h is absent that the size of wchar_t is 2, but as I say why guess when everything you need is there with sizeof?  I suggest using sizeof as a fall-back in cases where config.h is absent or SIZEOF_WHAR_T is defined as 0 (on looking through the configure macros it seemed to me that that was a possible outcome).  In the unlikely event of a compiler choosing to complain, the build system can be changed to provide config.h.

I had seen what you had done in ustring.cc.  On reflection I agree with what you say about catering for non-Unicode wide character sets, since it can be done.  It's use in non-unicode cases however will be next to 0.  That would mean putting it in convert.h/convert.cc, so the issue on unicode.h dependencies ceases to be relevant.

Comment 7 Chris Vine 2009-05-27 21:58:52 UTC

Actually, if g_iconv() is used as a fall-back, any test for the size of wchar_t for conditional compilation can probably be made unnecessary.  You will end up dealing with wchar_t, whatever it may be.  I suspect sizeof can just be used a multiplier to calculate the number of bytes to convert.

Comment 8 Daniel Elstner 2009-05-28 00:33:39 UTC

(In reply to comment #6)
> I had seen what you had done in ustring.cc.  On reflection I agree with what
> you say about catering for non-Unicode wide character sets, since it can be
> done.  It's use in non-unicode cases however will be next to 0.

There is still the option of wchar_t having exactly the same representation as char, which is allowed by the standard and could be reasonable e.g. on an embedded system.

> That would mean putting it in convert.h/convert.cc, so the issue on unicode.h
> dependencies ceases to be relevant.

Yes, I was going to suggest that.  Another option would be to add a constructor to Glib::ustring for the conversion from std::wstring, and a wide() method which returns an std::wstring.  I'm not sure though, what do you think?

(In reply to comment #7)
> Actually, if g_iconv() is used as a fall-back, any test for the size of
> wchar_t for conditional compilation can probably be made unnecessary.  You
> will end up dealing with wchar_t, whatever it may be.  I suspect sizeof can
> just be used a multiplier to calculate the number of bytes to convert.

Yes, that's true.  I remember now that I added the sizeof check only as an additional safety measure, mainly for the WIN32 case. I was a bit worried that some compilers might actually have an option to change the size of wchar_t.  Well, the sizeof check isn't really airtight either.  I guess can live without it.

Sigh.  Curse you, MSVC++!  The sizeof check of autoconf is really neat in that it is able to get the value without actually executing any code.  Thus it works even when cross-compiling for the mingw32 target.  Ah, well, I guess Armin will appreciate the reduced maintenance burden for his MSVC builds...

Comment 9 Chris Vine 2009-05-28 09:02:04 UTC

Gosh, systems with the same size of char and wchar_t but with different locale encodings.  That's getting esoteric.

When I said I had seen your format function in ustring.cc I had seen your stream formatting.  On rereading this I now think you were talking about something different.  Is this something in git?  The new functions cannot be called wide_to_narrow() and narrow_to_wide() any more, because they will not be concerned only with converting between wide and narrow unicode representations.  I suggest either your approach or new functions in convert.h called wide_to_utf8() and wide_from_utf8().  I think I prefer the second approach because I suspect it is where people will look and it is closer to the existing narrow encoding conversion functions - they comprise locale_to/from_utf8() rather than being implemented through a conversion constructor taking a std::string argument and a locale() method returning a std::string object.  However, to be honest I should go where your instincts take you.  Either seem to me to be fine.

Comment 10 Daniel Elstner 2009-05-28 22:57:38 UTC

(In reply to comment #9)
> Gosh, systems with the same size of char and wchar_t but with different locale
> encodings.  That's getting esoteric.

That would indeed be esoteric.  I haven't actually considered that possibility.  I'd expect the standard to require that char is always a subset of wchar_t, so that c == narrow(widen(c)).  At least I hope it does. :-)  That wouldn't rule out ASCII char and ISO-8859-1 wchar_t, though.

But since we can have a catch-all fallback, all is fine.  What's important is that the API works.  __STDC_ISO_10646__ and WIN32 are special-cased mainly to improve performance, and because we *know* what a wchar_t is when either of these is defined.  Well, and also because the iconv WCHAR_T alias is probably not something one should rely on for a major target platform. ;-)

> When I said I had seen your format function in ustring.cc I had seen your
> stream formatting.  On rereading this I now think you were talking about
> something different.  Is this something in git?

No, it's indeed the wide stream formatting I was talking about:
http://git.gnome.org/cgit/glibmm/tree/glib/glibmm/ustring.cc#n1366

Hm, actually it's possible already to use an std::wstringstream to convert between ustring and wstring by using these operators. ;-)  Of course that doesn't mean that we shouldn't add a more convenient and straightforward interface.  After all std::wstring is part of the STL and not all that exotic.

> The new functions cannot be
> called wide_to_narrow() and narrow_to_wide() any more, because they will not
> be concerned only with converting between wide and narrow unicode
> representations.

Actually, the term "narrow" wasn't appropriate for UTF-8 anyway.  The character set doesn't change and is still wide.  A UTF-8 byte is not a character at all.  Back then, when these terms and the corresponding API was standardized, no-one seems to have thought of the possibility that using multi-byte encodings as internal representation would some day become popular and even the norm.  They haven't thought about i18n hard enough to realize that random access indexing of code points is not all that useful in reality.

> I suggest either your approach or new functions in convert.h called
> wide_to_utf8() and wide_from_utf8().  I think I prefer the second approach
> because I suspect it is where people will look and it is closer to the
> existing narrow encoding conversion functions

Funny, I'd have thought it's the other way around and people look at the ustring class first. :-)

Anyway, I think you are right that the functionality is closer to what locale_to_utf8() and locale_from_utf8() do.  Most importantly, these conversions can fail.  Well, actually they cannot fail if the input is valid and wchar_t holds UCS-4 or UTF-16, but unfortunately we cannot assume that Unicode is being used.

Comment 11 Chris Vine 2009-05-29 00:18:18 UTC

My point about char and wchar_t being of the same size but containing bytes with different character encodings was only that if they were the same encodings and the same size then locale_to/from_utf8() would do the trick for either (with a reinterpret_cast for wchar_t since in C++ wchar_t is not a typedef).

When using "narrow" and "wide", I was not using them in the ctype facet sense, which indeed assumes fixed size characters.  The codecvt facet for streams does of course cater for multibyte encodings.  On your reading, "wide" is just as unacceptable as "narrow", as Unicode is just a series of values between 0 and 0x10fff.  How they are encoded into a particular series of memory locations is a separate matter.  Where the representation is in units of bytes (utf-8) having the same size as char I think "narrow" is a reasonable description.  Likewise where the representation is in units of wchar_t, I think "wide" is a reasonable description.  The natural size of Unicode, if you want to view it that way, is 21 bits, which on most (all?) platforms is the size of neither char nor wchar_t.  But let's not argue about semantics.

Are you going to do this or were you expecting me to have a go at it?

Comment 12 Murray Cumming 2010-03-26 09:53:13 UTC

I'm not going to read through all this. Can someone summarize the status. You might with a test case that the change would allow to work.

Comment 13 Murray Cumming 2010-04-07 13:19:57 UTC

Please?

Comment 14 Chris Vine 2010-04-08 09:20:27 UTC

The short version is that I provided a patch in the Unicode module which converted between UTF-8 and UCS4/UTF-16 (depending on sizeof(wchar_t)) and vice versa.  Daniel thought it lacked ambition and should cover a generic (not necessarily UCS4/UTF-16) wide character to UTF-8 conversion and back, which it looks as if he persuaded me was right.

Then there was a discussion about how to deal with the fact that MSVC builds do not include config.h, so not allowing use of the SIZEOF_WCHAR_T autoconf macro.  My suggestion was to use g_iconv() where config.h doesn't exist, which avoids using an if block on the compile time constant sizeof (as opposed to #if block which cannot be used as sizeof is not a manifest constant).  The conversation then ended on questioning who was going to implement this revised proposal.

There also looked to be a problem with cyclical headers.

It is a bit like looking at code which you wrote 2 years ago.  You know you understood it at the time.

Comment 15 Daniel Elstner 2010-04-08 17:21:01 UTC

Sorry for taking so long to follow up on this.

The big picture is: It's not a change, but a feature request. Chris and I converged on what functionality should be offered, and all that is left now is the implementation. Which shouldn't be too difficult as the interesting bits already exist in the wide streaming operators for Glib::ustring.

Comment 16 Clemens Lang 2018-04-16 15:18:24 UTC

> Have you seen the implementation in ustring.cc?  It just falls back to iconv and uses the generic charset alias WCHAR_T.  Everything else is then up to iconv.  It's quite likely that there are iconv implementations which don't support this alias, but for GNU iconv it seems to work.  As an attempt to cover the generic case, I considered this to be good enough.  If it happens to not work on some odd platform, we still handle it as a special case as soon as we know about it.

I'd like to revive this ticket and confirm that this assumption was correct. I am currently seeing a crash on macOS 10.12 when attempting to start inkscape 0.92.2 built against glibmm 2.56.0 due to what I think is a character conversion issue.

See https://trac.macports.org/ticket/56214 for the details and https://trac.macports.org/ticket/56214#comment:6 for a minimal example. The fallback to iconv with WCHAR_T that's currently in the glibmm code does not work on macOS, presumably because libc++ uses UTF32LE for wchar_t, while iconv assumes it's something else.

Should I open a separate ticket to track this crash or is it appropriate in this ticket? Should I ask the developers of iconv to fix (or reject) WCHAR_T on macOS?

Comment 17 GNOME Infrastructure Team 2018-05-22 12:09:34 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glibmm/issues/9.