GNOME Bugzilla – Bug 301935
Invalid byte sequence with g_locale_from_utf8()
Last modified: 2011-02-18 16:14:05 UTC
Distribution/Version: Gentoo Base System version 1.6.11 As per http://mail.gnome.org/archives/gtkmm-list/2005-April/msg00261.html, here is a test case. I'm using a patched version of 2.10.0. I'll attach the patch, test app, and test xml. $ g++ -Wall -ggdb3 $(pkg-config --cflags --libs libxml++-2.6) test.cc -o test $ ./test test.xml ch = 'ò</name> </maintainer> ' s = 'Diego Petten' text = 'Diego Petten' Invalid byte sequence in conversion input note: I had the same results when I ran the dom_parser/sax_parser example on the same xml.
Created attachment 45656 [details] test.cc
Created attachment 45657 [details] test.xml
Created attachment 45658 [details] [review] libxmlpp-2.10.0-debug.diff Prints ch+len and the resulting Glib::ustring. Also has fix to the SaxParserCallback::on_characters() bug.
Confirmed with libxml++ from cvs. Without catching the Glib::Error exception, this is the backtrace:
+ Trace 58706
I guess we need to find out exactly what bytes are in the Glib::ustring.
Breakpoint 1, print_node (node=0x80554b8, indentation=4) at main.cc:65 65 std::cout << "text = \"" << nodeText->get_content() << "\"" << std::endl; (gdb) print nodeText->get_content() $1 = {static npos = 4294967295, string_ = {static npos = 4294967295, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x804c31c "Diego Pettenò"}}} (gdb) The bytes in the Glib::ustring looks good. I could reproduce the problem independently of libxml++ (cf attachement).
Created attachment 45694 [details] testcase compile with : g++ -g `pkg-config --cflags glibmm-2.4` -o test test.cc `pkg-config --libs glibmm-2.4`
I don't think you can put unicode directly into C sources. You should use english literals and gettext().
Created attachment 45774 [details] g_locale_from_utf8.c Here is a C test case. Please discuss this on gtk-list@gnome.org if you disagree.
Then again, the gettext() thing doesn't help you when reading from your XML file. Maye a glib coder can explain.
Putting Unicode in C should be fine, as far as gcc is concerned. This looks like an iconv bug to me. It doesn't seem to accept \c3\b2
Well, I've confirmed that iconv is returning -1 and setting errno to EILSEQ, however I am unable to reproduce it outside of glib. I've attached a test case which works as expected.
Created attachment 45823 [details] iconv-test.c
any update on this? Yay or nay on whether it's really an iconv problem or not? I'm hoping this gets resolved soon, as the UTF-8 support is really my sole reason for using glib/glibmm.
the error does not occur when doing the explicit UTF-8 -> ISO8859-1 conversion using g_convert either, so the problem seems to be not in iconv and not in the glib iconv wrapper, but rather in g_locale_from_utf8
you should probably insert const gchar *charset; g_get_charset (&charset); g_print ("charset %s\n", charset); in your example and verify that glibs idea of the locale charset coincides with what you believe it is.
charset ANSI_X3.4-1968 setlocale returned 'en_US.UTF-8' strlen("Diego Pettenò") == 14 iconv: Invalid or incomplete multibyte or wide character result = 'Diego Petten' That's after replacing ISO8859-1 with ANSI_X3.4-1968. not sure what this means though.
ANSI_X3.4-1968 is a fancy name for ASCII, so it is no wonder it can't handle that last character. What is your locale set to ?
As displayed in comment #16, my locale is set to en_US.UTF-8.
Are we any closer to an explanation for this? It's very odd.
You need to figure out why g_get_charset() thinks that your locale charset is ASCII, when the locale is set to en_US.UTF-8
No response in more than a year, closing.