Bug 747772 – Having hardcoded utf8 strings in the source code does not play cool with msvc

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 747772 - Having hardcoded utf8 strings in the source code does not play cool with msvc


Summary:	Having hardcoded utf8 strings in the source code does not play cool with msvc


Status:	RESOLVED NOTGNOME

Product:	pango
Classification:	Platform
Component:	win32
Version:	1.36.x
Hardware:	Other Windows

Importance:	Normal major
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-04-13 11:23 UTC by Ignacio Casal Quinteiro (nacho)
Modified:	2016-03-15 03:34 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Ignacio Casal Quinteiro (nacho) 2015-04-13 11:23:35 UTC

pango provides some sample strings in https://git.gnome.org/browse/pango/tree/pango/pango-language-sample-table.h

These strings do not play well with msvc since it end ups failing the utf-8 validation with them. It would probably be better to handle this with an ini file and load those strings at runtime.

The way to reproduce this issue:
 * build gtk3 i.e using https://github.com/nice-software/gtk-win32/
 * set the system language format to i.e spanish
 * launch gtk3-demo
 * Several warning of the type of "(gtk3-demo.exe:2492): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()" will appear

Comment 1 Ignacio Casal Quinteiro (nacho) 2015-04-13 11:29:07 UTC

Another possibility instead of using an ini file would be to preprocess the file and to convert all the strings with non-ascii chars into escape sequences, which would probably make more sense since also the strings are already there translated and they are suppose to not change at run-time

Comment 2 Ignacio Casal Quinteiro (nacho) 2015-04-13 14:08:42 UTC

Downstream I fixed this problem by escaping the utf-8 chars. See https://github.com/nice-software/gtk-win32/commit/560ce11285ccd57a31a3a1594b2e40f666792ede

Comment 3 Fan, Chun-wei 2015-04-13 14:41:07 UTC

Hello Nacho,

I couldn't reproduce this on my Visual Studio 2008 and 2013 builds.

I think I know the source of the problem after seeing build.ps1, because I see that the script is trying to add BOM to the file to avoid C4819, which actually causes bad binaries to be built (as it bit me before).  I think the way to avoid it would be to set the files that had BOM added to them to be just as UTF8 (*without* BOM) and set the correct "Language for non-Unicode Programs"[1], such as English (United States) or English (United Kingdom), reboot when prompted, and redo the build.

This C4819 issue hits quite a number of projects, including QT, libreoffice, KDE and Mozilla, for references.

Can you try to do these and let me know how things went for you?  I am bending towards closing this as resolved-notgnome as a result.

[1]: See the screen shots for Windows 7 (and 8.x) at https://www.coscom.co.jp/learnjapanese801/japanesefont/nonunicode_win7.html

Hope this clears things up.

With blessings, thank you!

Comment 4 Ignacio Casal Quinteiro (nacho) 2015-04-13 14:47:22 UTC

Fan, this makes more sense. I was actually wonder why those boms were added since when researching about this I saw that they should not be added. I will test and get back to you.

Comment 5 Ignacio Casal Quinteiro (nacho) 2015-04-13 14:58:21 UTC

It works like a charm now after removing the BOMs. Thanks Fan.

Comment 6 Arnav Singh 2015-04-13 16:32:11 UTC

Fan: nacho brought this issue to my attention in HexChat's IRC channel today.

I was aware that you had said adding a BOM would not work in your MSVCCompilationOfGTKStack article, but since in practice I saw HexChat had no problems with rendering non-English text I ignored it. However I do see the warnings nacho is talking about when running gtk-demo.exe so I can take a look into fixing it.

Changing system locale to en-US is certainly an option, but I consider it a desperate last resort.

From some googling I found this: http://stackoverflow.com/a/20005850/545475 Have you tried this? (It is supported on VS 2010 and 2013 but not 2012, apparently.) I'm going to try this later tonight, unless you've already tried it and it doesn't work.

Comment 7 Arnav Singh 2015-04-13 19:02:49 UTC

I've confirmed that adding the pragma from the linked SO answer works with VS2013.

UTF8 string literals are embedded in utf8 in the output binary: http://i.imgur.com/IpK4lVw.png

gtk-demo.exe reports no errors, and localized text on widgets is still rendered correctly: http://i.imgur.com/fhcK3fc.png

Comment 8 Behdad Esfahbod 2015-04-13 19:11:49 UTC

Should we add the pragma to pango itself?

Comment 9 Arnav Singh 2015-04-13 19:18:54 UTC

Up to you.

Note that this is needed for more than pango. These are the files in the whole gtk stack that the script adds the BOM and pragma to. (I wasn't very scientific about this - I just added it to each file one-by-one if cl.exe threw a C4819 for it):

glib:
gio\gdbusaddress.c
gio\gfile.c
glib\gmacros.h
glib\gmain.c

gtk:
gdk\gdkkeyuni.c

pango:
pango\break.c
pango\pango-language-sample-table.h

Comment 10 Behdad Esfahbod 2015-04-13 19:28:45 UTC

Pushed this to Pango:

https://git.gnome.org/browse/pango/commit/?id=c8b1d5bea67155ae51cb9a145d43df79a52faeb4

and something similar to HarfBuzz.  Feel free to file bugs against glib and gtk.  I don't think asking this is unreasonable for any of those files, except for gmacros.h possibly.

Comment 11 Behdad Esfahbod 2015-04-13 19:30:38 UTC

I fixed gmacros.h as well.

Comment 12 Behdad Esfahbod 2015-04-13 19:35:01 UTC

I'm actually fixing the rest of them.  No need to file bug.

Comment 13 Behdad Esfahbod 2015-04-13 19:38:44 UTC

I checked the other ones you listed (glib and gtk), they all have UTF-8 only in comments.  Is that really a problem at all?

Comment 14 Ignacio Casal Quinteiro (nacho) 2015-04-13 19:59:16 UTC

(In reply to Behdad Esfahbod from comment #13)
> I checked the other ones you listed (glib and gtk), they all have UTF-8 only
> in comments.  Is that really a problem at all?

I guess we should just change the comments to not have utf-8 stuff and make msvc happy? Also maybe let's change the bug to fixed?

Comment 15 Behdad Esfahbod 2015-04-13 20:04:17 UTC

(In reply to Ignacio Casal Quinteiro (nacho) from comment #14)
> (In reply to Behdad Esfahbod from comment #13)
> > I checked the other ones you listed (glib and gtk), they all have UTF-8 only
> > in comments.  Is that really a problem at all?
> 
> I guess we should just change the comments to not have utf-8 stuff and make
> msvc happy?

That's what I started doing, but then was asked whether this actually *is* a problem.  Does the compiler warn or err?

> Also maybe let's change the bug to fixed?

Comment 16 Arnav Singh 2015-04-13 22:50:08 UTC

I think it wasn't clear - there are two different problems.

1. The presence of characters anywhere in the file that are not in the system code page causes cl.exe to emit C4819. Since these characters are utf-8 characters, this can be fixed by adding a UTF-8 BOM to the file. C4819 is a warning by default, but in those three projects it is considered an error.

2. The presence of the UTF-8 BOM only lets the compiler *parse* the file without warnings. It does not affect the emit of these characters in the final executable. So if these characters are present inside string literals, these constants will be mangled in the binary. There is no warning or error for this. This is fixed by adding the pragma.

Comment 17 Behdad Esfahbod 2015-04-13 23:56:56 UTC

Ok.  The glib ones are easy enough to fix.  The GDK one, I'm fine with removing the accents from the comments.  In both cases Matthias should ok it first.

Comment 18 Fan, Chun-wei 2015-04-14 04:11:55 UTC

Hi,

Apparently quite a bit of things happened overnight (in my part of the world).

My take on adding the pragmas would be as follows (I know it's a bit long):
-If we are on Visual Studio 2008 or 2012, I take this pragma as
unsupported. There is a hotfix KB980263 for 2008, but it's not
readily available for download from Microsoft (it's an obtain-by-
request thing).
-Note that if there are tools or scripts that process the source
files, the addition of BOM *may* not be liked by such tools/scripts.
-This pragma actually broke some glib tests for me, namely hostutils,
utf8-pointer and search (and possibly some that I have missed). My
original intention upon seeing what happened in this bug was to use
the pragma on 2010 and 2013 (and later) in msvc_recommended_pragmas.h,
but running the glib tests reveal that this pragma is not the optimal
solution. This happens whether or not the source file is in UTF8
(no BOM or with BOM) and when the pragma is used, although as Arnav
pointed out the GTK+ widgets displayed the texts properly. This means
the built binaries are still considered broken, in my point of view.
-The source files where we get C4819 varies on the language version
of Windows (as Arnav mentioned), so it's going to be hard to get the
right files to add the BOM programatically that works for all people,
as HexChat did in the build scripts.
Those C4819 warnings can come with error C2001 (newline in constant),
as the compiler would not parse the sources properly during a C4819
scenario, before we add BOM to it. Such C4819 warnings (and C2001
errors) will occur unless we add BOM (which is bad due to the previous
point) or we set the "Language for non-Unicode Programs" properly to
English (United States).

As a result, I would recommend against using this pragma for Visual Studio builds and build the stack after setting "Language for non-Unicode Programs" correctly. Behdad, probably you want to revert changes where you had using the pragmas for Visual Studio for Pango, Harfbuzz and anything else that came across you for this.

With blessings, thank you!

Comment 19 Fan, Chun-wei 2015-04-14 04:55:54 UTC

Hi,

"as HexChat did" should read "as HexChat tried to do"...

Sorry for the noise.

With blessings.

Comment 20 Arnav Singh 2015-04-14 05:13:09 UTC

Correct. HC's build script requires both the BOM and the pragma, so even with the commits to add the pragma, it'll continue to have to add the BOM.

Regarding the failing tests, perhaps it's because the tests are wrong? I haven't checked.

Either way, I'm totally fine with upstream not adding the pragma and accomodating non-en-US locale builds. For HC, since it seems to not be a problem I will stick with adding the BOM and pragma for now, (and investigate the failing tests to see if they're a genuine problem).

Comment 21 Behdad Esfahbod 2015-04-14 06:03:20 UTC

(In reply to Fan, Chun-wei from comment #18)

> As a result, I would recommend against using this pragma for Visual Studio
> builds and build the stack after setting "Language for non-Unicode Programs"
> correctly.  Behdad, probably you want to revert changes where you had using
> the pragmas for Visual Studio for Pango, Harfbuzz and anything else that
> came across you for this.

I'm not following.  What's the problem with keeping the pragma?  I don't understand what goes wrong.  I can, of course, revert, but want to understand.

Comment 22 Fan, Chun-wei 2015-04-14 07:22:15 UTC

Hello Behdad,

(In reply to Behdad Esfahbod from comment #21)
> I'm not following.  What's the problem with keeping the pragma?  I don't
> understand what goes wrong.  I can, of course, revert, but want to
> understand.

For example, with the pragma (when building the code under English-US and no BOM is added to the source files), the hostutils (under glib/tests) test would fail with the following:
GLib:ERROR:hostutils.c:107:test_to_ascii: assertion failed (non_round_trip_names
[i].ascii_name == ascii): ("xn--5ca.idn.icann.org" == "xn--iba3im31h.idn.icann.o
rg")

Which corresponded to the test line:
{ "Å.idn.icann.org", "xn--5ca.idn.icann.org", TRUE, TRUE }, (line 64 of hostutils.c)

The test also fails when adding BOM to the sources and using the pragma.

Which seems to me that the utf-8 character was not correctly handled by the pragma in this case.

But without the pragma during the build, when building under English-US, the test will pass.

Similar things goes for the search-utils and utf8-pointer test programs.

With blessings.

Comment 23 Arnav Singh 2015-04-14 09:20:23 UTC

The test passes for me with BOM and pragma on all the files I listed in the previous post. I'm on US English Windows 7 set to Japanese locale.

Looking at the binary, the constant is correctly present http://i.imgur.com/2yjupqF.png (0xe2 0x84 0xab is the utf-8 encoding for Å)

When debugging in the VS debugger, it gets displayed as "â„«.idn.icann.org" by the inspector / watch window, but that's just because they display each character individually instead of as utf-8. The actual call to g_hostname_to_ascii correctly returns "xn--5ca.idn.icann.org".

"xn--iba3im31h.idn.icann.org" would be the punycode for "â„«.idn.icann.org" Can you check how it's encoded in your binary?

Comment 24 Arnav Singh 2015-04-14 09:26:03 UTC

(To clarify, "xn--iba3im31h.idn.icann.org" would be the punycode for "â„«.idn.icann.org" if each character of the latter were *individually* encoded as a separate utf-8 character, and not the sequence 0xe2 0x84 0xab that corresponds to the single utf-8 character Å.)

Comment 25 Fan, Chun-wei 2015-04-14 14:56:32 UTC

Hi,

I think all in all, this is what I'd like to say about this:

-I am not all that comfortable with this pragma, as:
--It is only supported on Visual Studio 2010 and 2013,
and 2008 SP1, if one can obtain the hotfix from
Microsoft under KB980263, which is something that needs
to be requested directly from Microsoft, not as a ready-to
-download item. Note that 2012 does not support this, so the
only known reliable way on 2008 and 2012 (and possibly 2015 and
later) is to set "Language for non-Unicode Programs" properly
prior to building.
--This is undocumented, so is subject to change (as we seen
in the 2012 case)

-It can be hard to maintain the list files where we want to add BOM
to, as C4819 pops up depending on the language version and selected
locale of the system.
For instance, I am running a Chinese version of Windows 7/8.1, and
even disregarding C4819 in my builds does not give me a successful
build because at times as I would get the C2001 "newline in constant"
error along with C4819, and there are a number of gtk+-3.x sources that
are not listed in HexChat's build scripts that I need to add BOM
to in order to have the build successfully continue.

-Assuming what Arnav says is true in comments 23 and 24, not all
people will know that. The test programs are not likely to change
due to the use of the pragma, as the expected strings are for
all standard builds (For Windows, that means MinGW binaries and MSVC
binaries, provided that "Language for non-Unicode Programs" is set
correctly. The failed tests here would not be a good sign for people,
given that they pass before using the pragma.

-At some point we may be using a tool or script to process the sources
or headers to generate items, and adding BOM could interfere with the
process. Things like this have hit me when I was trying to build
KDE items previously, before knowing my remedy to the C4819 issue.

-I think it is still easier for one to have "Language for
non-Unicode Programs" set properly prior to building, by all
means.

---

If, we are still going by the pragma route, I think the best way to do it would be to add that into GLib's msvc_recommended_pragmas.h instead of being in the individual sources/headers of the GNOME (or GNOME-component-using) parts of the GTK+ stack, but don't turn it on by default. This is done as the Visual Studio projects of the GNOME components all force-include this header file. It can be activated with both of these conditions being true:
-The compiler supports the pragma (i.e. 2010 and 2013 for now)
-A specific preprocessor macro is defined in the projects (or property
sheets), with possibly another macro defined for 2008 if the user
is certain that he has the hotfix for KB980263.

This means that we still treat C4819 as an error by default, unless the 2 aforementioned conditions are met.

So, this is another reason why I think we don't want to do this unconditionally for Visual Studio builds directly in pango-language-sample-table.h, for instance.

With blessings.

Comment 26 Arnav Singh 2015-04-14 15:59:11 UTC

I agree. As I said in #20, HC would need both the BOM and the pragma, so upstream just adding the pragma is not helpful anyway. It's okay for upstream to officially require en-US system locale to compile the sources. HC will continue to add BOM and pragma to files that I discover cannot be compiled in my locale and support building on non-en-US locales, since I do not see any problems with it for HC's use.

Comment 27 Behdad Esfahbod 2015-04-14 20:33:40 UTC

Thanks.  Reverted pragmas.

Comment 28 Fan, Chun-wei 2016-03-15 03:34:28 UTC

Hi,

In case someone might be interested in this piece of info... Though I think Visual Studio should have supported this long time ago.

https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

With blessings and cheers!