GNOME Bugzilla – Bug 747772
Having hardcoded utf8 strings in the source code does not play cool with msvc
Last modified: 2016-03-15 03:34:28 UTC
pango provides some sample strings in https://git.gnome.org/browse/pango/tree/pango/pango-language-sample-table.h These strings do not play well with msvc since it end ups failing the utf-8 validation with them. It would probably be better to handle this with an ini file and load those strings at runtime. The way to reproduce this issue: * build gtk3 i.e using https://github.com/nice-software/gtk-win32/ * set the system language format to i.e spanish * launch gtk3-demo * Several warning of the type of "(gtk3-demo.exe:2492): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()" will appear
Another possibility instead of using an ini file would be to preprocess the file and to convert all the strings with non-ascii chars into escape sequences, which would probably make more sense since also the strings are already there translated and they are suppose to not change at run-time
Downstream I fixed this problem by escaping the utf-8 chars. See https://github.com/nice-software/gtk-win32/commit/560ce11285ccd57a31a3a1594b2e40f666792ede
Hello Nacho, I couldn't reproduce this on my Visual Studio 2008 and 2013 builds. I think I know the source of the problem after seeing build.ps1, because I see that the script is trying to add BOM to the file to avoid C4819, which actually causes bad binaries to be built (as it bit me before). I think the way to avoid it would be to set the files that had BOM added to them to be just as UTF8 (*without* BOM) and set the correct "Language for non-Unicode Programs"[1], such as English (United States) or English (United Kingdom), reboot when prompted, and redo the build. This C4819 issue hits quite a number of projects, including QT, libreoffice, KDE and Mozilla, for references. Can you try to do these and let me know how things went for you? I am bending towards closing this as resolved-notgnome as a result. [1]: See the screen shots for Windows 7 (and 8.x) at https://www.coscom.co.jp/learnjapanese801/japanesefont/nonunicode_win7.html Hope this clears things up. With blessings, thank you!
Fan, this makes more sense. I was actually wonder why those boms were added since when researching about this I saw that they should not be added. I will test and get back to you.
It works like a charm now after removing the BOMs. Thanks Fan.
Fan: nacho brought this issue to my attention in HexChat's IRC channel today. I was aware that you had said adding a BOM would not work in your MSVCCompilationOfGTKStack article, but since in practice I saw HexChat had no problems with rendering non-English text I ignored it. However I do see the warnings nacho is talking about when running gtk-demo.exe so I can take a look into fixing it. Changing system locale to en-US is certainly an option, but I consider it a desperate last resort. From some googling I found this: http://stackoverflow.com/a/20005850/545475 Have you tried this? (It is supported on VS 2010 and 2013 but not 2012, apparently.) I'm going to try this later tonight, unless you've already tried it and it doesn't work.
I've confirmed that adding the pragma from the linked SO answer works with VS2013. UTF8 string literals are embedded in utf8 in the output binary: http://i.imgur.com/IpK4lVw.png gtk-demo.exe reports no errors, and localized text on widgets is still rendered correctly: http://i.imgur.com/fhcK3fc.png
Should we add the pragma to pango itself?
Up to you. Note that this is needed for more than pango. These are the files in the whole gtk stack that the script adds the BOM and pragma to. (I wasn't very scientific about this - I just added it to each file one-by-one if cl.exe threw a C4819 for it): glib: gio\gdbusaddress.c gio\gfile.c glib\gmacros.h glib\gmain.c gtk: gdk\gdkkeyuni.c pango: pango\break.c pango\pango-language-sample-table.h
Pushed this to Pango: https://git.gnome.org/browse/pango/commit/?id=c8b1d5bea67155ae51cb9a145d43df79a52faeb4 and something similar to HarfBuzz. Feel free to file bugs against glib and gtk. I don't think asking this is unreasonable for any of those files, except for gmacros.h possibly.
I fixed gmacros.h as well.
I'm actually fixing the rest of them. No need to file bug.
I checked the other ones you listed (glib and gtk), they all have UTF-8 only in comments. Is that really a problem at all?
(In reply to Behdad Esfahbod from comment #13) > I checked the other ones you listed (glib and gtk), they all have UTF-8 only > in comments. Is that really a problem at all? I guess we should just change the comments to not have utf-8 stuff and make msvc happy? Also maybe let's change the bug to fixed?
(In reply to Ignacio Casal Quinteiro (nacho) from comment #14) > (In reply to Behdad Esfahbod from comment #13) > > I checked the other ones you listed (glib and gtk), they all have UTF-8 only > > in comments. Is that really a problem at all? > > I guess we should just change the comments to not have utf-8 stuff and make > msvc happy? That's what I started doing, but then was asked whether this actually *is* a problem. Does the compiler warn or err? > Also maybe let's change the bug to fixed?
I think it wasn't clear - there are two different problems. 1. The presence of characters anywhere in the file that are not in the system code page causes cl.exe to emit C4819. Since these characters are utf-8 characters, this can be fixed by adding a UTF-8 BOM to the file. C4819 is a warning by default, but in those three projects it is considered an error. 2. The presence of the UTF-8 BOM only lets the compiler *parse* the file without warnings. It does not affect the emit of these characters in the final executable. So if these characters are present inside string literals, these constants will be mangled in the binary. There is no warning or error for this. This is fixed by adding the pragma.
Ok. The glib ones are easy enough to fix. The GDK one, I'm fine with removing the accents from the comments. In both cases Matthias should ok it first.
Hi, Apparently quite a bit of things happened overnight (in my part of the world). My take on adding the pragmas would be as follows (I know it's a bit long): -If we are on Visual Studio 2008 or 2012, I take this pragma as unsupported. There is a hotfix KB980263 for 2008, but it's not readily available for download from Microsoft (it's an obtain-by- request thing). -Note that if there are tools or scripts that process the source files, the addition of BOM *may* not be liked by such tools/scripts. -This pragma actually broke some glib tests for me, namely hostutils, utf8-pointer and search (and possibly some that I have missed). My original intention upon seeing what happened in this bug was to use the pragma on 2010 and 2013 (and later) in msvc_recommended_pragmas.h, but running the glib tests reveal that this pragma is not the optimal solution. This happens whether or not the source file is in UTF8 (no BOM or with BOM) and when the pragma is used, although as Arnav pointed out the GTK+ widgets displayed the texts properly. This means the built binaries are still considered broken, in my point of view. -The source files where we get C4819 varies on the language version of Windows (as Arnav mentioned), so it's going to be hard to get the right files to add the BOM programatically that works for all people, as HexChat did in the build scripts. Those C4819 warnings can come with error C2001 (newline in constant), as the compiler would not parse the sources properly during a C4819 scenario, before we add BOM to it. Such C4819 warnings (and C2001 errors) will occur unless we add BOM (which is bad due to the previous point) or we set the "Language for non-Unicode Programs" properly to English (United States). As a result, I would recommend against using this pragma for Visual Studio builds and build the stack after setting "Language for non-Unicode Programs" correctly. Behdad, probably you want to revert changes where you had using the pragmas for Visual Studio for Pango, Harfbuzz and anything else that came across you for this. With blessings, thank you!
Hi, "as HexChat did" should read "as HexChat tried to do"... Sorry for the noise. With blessings.
Correct. HC's build script requires both the BOM and the pragma, so even with the commits to add the pragma, it'll continue to have to add the BOM. Regarding the failing tests, perhaps it's because the tests are wrong? I haven't checked. Either way, I'm totally fine with upstream not adding the pragma and accomodating non-en-US locale builds. For HC, since it seems to not be a problem I will stick with adding the BOM and pragma for now, (and investigate the failing tests to see if they're a genuine problem).
(In reply to Fan, Chun-wei from comment #18) > As a result, I would recommend against using this pragma for Visual Studio > builds and build the stack after setting "Language for non-Unicode Programs" > correctly. Behdad, probably you want to revert changes where you had using > the pragmas for Visual Studio for Pango, Harfbuzz and anything else that > came across you for this. I'm not following. What's the problem with keeping the pragma? I don't understand what goes wrong. I can, of course, revert, but want to understand.
Hello Behdad, (In reply to Behdad Esfahbod from comment #21) > I'm not following. What's the problem with keeping the pragma? I don't > understand what goes wrong. I can, of course, revert, but want to > understand. For example, with the pragma (when building the code under English-US and no BOM is added to the source files), the hostutils (under glib/tests) test would fail with the following: GLib:ERROR:hostutils.c:107:test_to_ascii: assertion failed (non_round_trip_names [i].ascii_name == ascii): ("xn--5ca.idn.icann.org" == "xn--iba3im31h.idn.icann.o rg") Which corresponded to the test line: { "Å.idn.icann.org", "xn--5ca.idn.icann.org", TRUE, TRUE }, (line 64 of hostutils.c) The test also fails when adding BOM to the sources and using the pragma. Which seems to me that the utf-8 character was not correctly handled by the pragma in this case. But without the pragma during the build, when building under English-US, the test will pass. Similar things goes for the search-utils and utf8-pointer test programs. With blessings.
The test passes for me with BOM and pragma on all the files I listed in the previous post. I'm on US English Windows 7 set to Japanese locale. Looking at the binary, the constant is correctly present http://i.imgur.com/2yjupqF.png (0xe2 0x84 0xab is the utf-8 encoding for Å) When debugging in the VS debugger, it gets displayed as "â„«.idn.icann.org" by the inspector / watch window, but that's just because they display each character individually instead of as utf-8. The actual call to g_hostname_to_ascii correctly returns "xn--5ca.idn.icann.org". "xn--iba3im31h.idn.icann.org" would be the punycode for "â„«.idn.icann.org" Can you check how it's encoded in your binary?
(To clarify, "xn--iba3im31h.idn.icann.org" would be the punycode for "â„«.idn.icann.org" if each character of the latter were *individually* encoded as a separate utf-8 character, and not the sequence 0xe2 0x84 0xab that corresponds to the single utf-8 character Å.)
Hi, I think all in all, this is what I'd like to say about this: -I am not all that comfortable with this pragma, as: --It is only supported on Visual Studio 2010 and 2013, and 2008 SP1, if one can obtain the hotfix from Microsoft under KB980263, which is something that needs to be requested directly from Microsoft, not as a ready-to -download item. Note that 2012 does not support this, so the only known reliable way on 2008 and 2012 (and possibly 2015 and later) is to set "Language for non-Unicode Programs" properly prior to building. --This is undocumented, so is subject to change (as we seen in the 2012 case) -It can be hard to maintain the list files where we want to add BOM to, as C4819 pops up depending on the language version and selected locale of the system. For instance, I am running a Chinese version of Windows 7/8.1, and even disregarding C4819 in my builds does not give me a successful build because at times as I would get the C2001 "newline in constant" error along with C4819, and there are a number of gtk+-3.x sources that are not listed in HexChat's build scripts that I need to add BOM to in order to have the build successfully continue. -Assuming what Arnav says is true in comments 23 and 24, not all people will know that. The test programs are not likely to change due to the use of the pragma, as the expected strings are for all standard builds (For Windows, that means MinGW binaries and MSVC binaries, provided that "Language for non-Unicode Programs" is set correctly. The failed tests here would not be a good sign for people, given that they pass before using the pragma. -At some point we may be using a tool or script to process the sources or headers to generate items, and adding BOM could interfere with the process. Things like this have hit me when I was trying to build KDE items previously, before knowing my remedy to the C4819 issue. -I think it is still easier for one to have "Language for non-Unicode Programs" set properly prior to building, by all means. --- If, we are still going by the pragma route, I think the best way to do it would be to add that into GLib's msvc_recommended_pragmas.h instead of being in the individual sources/headers of the GNOME (or GNOME-component-using) parts of the GTK+ stack, but don't turn it on by default. This is done as the Visual Studio projects of the GNOME components all force-include this header file. It can be activated with both of these conditions being true: -The compiler supports the pragma (i.e. 2010 and 2013 for now) -A specific preprocessor macro is defined in the projects (or property sheets), with possibly another macro defined for 2008 if the user is certain that he has the hotfix for KB980263. This means that we still treat C4819 as an error by default, unless the 2 aforementioned conditions are met. So, this is another reason why I think we don't want to do this unconditionally for Visual Studio builds directly in pango-language-sample-table.h, for instance. With blessings.
I agree. As I said in #20, HC would need both the BOM and the pragma, so upstream just adding the pragma is not helpful anyway. It's okay for upstream to officially require en-US system locale to compile the sources. HC will continue to add BOM and pragma to files that I discover cannot be compiled in my locale and support building on non-en-US locales, since I do not see any problems with it for HC's use.
Thanks. Reverted pragmas.
Hi, In case someone might be interested in this piece of info... Though I think Visual Studio should have supported this long time ago. https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/ With blessings and cheers!