After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 301935 - Invalid byte sequence with g_locale_from_utf8()
Invalid byte sequence with g_locale_from_utf8()
Status: RESOLVED INCOMPLETE
Product: glib
Classification: Platform
Component: general
2.6.x
Other Linux
: Normal major
: ---
Assigned To: Christophe de Vienne
Christophe de Vienne
Depends on:
Blocks:
 
 
Reported: 2005-04-25 16:41 UTC by Aaron Walker
Modified: 2011-02-18 16:14 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test.cc (1.30 KB, text/plain)
2005-04-25 16:42 UTC, Aaron Walker
  Details
test.xml (138 bytes, text/plain)
2005-04-25 16:43 UTC, Aaron Walker
  Details
libxmlpp-2.10.0-debug.diff (1021 bytes, patch)
2005-04-25 16:44 UTC, Aaron Walker
none Details | Review
testcase (160 bytes, text/plain)
2005-04-26 14:04 UTC, Christophe de Vienne
  Details
g_locale_from_utf8.c (527 bytes, text/x-csrc)
2005-04-28 09:24 UTC, Murray Cumming
  Details
iconv-test.c (927 bytes, text/plain)
2005-04-29 12:15 UTC, Aaron Walker
  Details

Description Aaron Walker 2005-04-25 16:41:48 UTC
Distribution/Version: Gentoo Base System version 1.6.11

As per http://mail.gnome.org/archives/gtkmm-list/2005-April/msg00261.html, here
is a test case.

I'm using a patched version of 2.10.0.  I'll attach the patch, test app, and
test xml.

$ g++ -Wall -ggdb3 $(pkg-config --cflags --libs libxml++-2.6) test.cc -o test
$ ./test test.xml
ch = 'ò</name>
</maintainer>
'
s = 'Diego Petten'
text = 'Diego Petten'
Invalid byte sequence in conversion input

note: I had the same results when I ran the dom_parser/sax_parser example on the
same xml.
Comment 1 Aaron Walker 2005-04-25 16:42:41 UTC
Created attachment 45656 [details]
test.cc
Comment 2 Aaron Walker 2005-04-25 16:43:07 UTC
Created attachment 45657 [details]
test.xml
Comment 3 Aaron Walker 2005-04-25 16:44:53 UTC
Created attachment 45658 [details] [review]
libxmlpp-2.10.0-debug.diff

Prints ch+len and the resulting Glib::ustring.	Also has fix to the
SaxParserCallback::on_characters() bug.
Comment 4 Murray Cumming 2005-04-25 17:57:50 UTC
Confirmed with libxml++ from cvs.

Without catching the Glib::Error exception, this is the backtrace:
  • #0 __kernel_vsyscall
  • #1 raise
    from /lib/tls/i686/cmov/libc.so.6
  • #2 abort
    from /lib/tls/i686/cmov/libc.so.6
  • #3 __cxa_call_unexpected
    from /usr/lib/libstdc++.so.5
  • #4 std::terminate
    from /usr/lib/libstdc++.so.5
  • #5 __cxa_throw
    from /usr/lib/libstdc++.so.5
  • #6 Glib::ConvertError::throw_func
    at convert.cc line 320
  • #7 Glib::Error::throw_exception
    at error.cc line 174
  • #8 Glib::locale_from_utf8
    at convert.cc line 192
  • #9 Glib::operator<<
    at ustring.cc line 1202
  • #10 print_node
    at main.cc line 65
  • #11 print_node
    at main.cc line 108
  • #12 print_node
    at main.cc line 108
  • #13 main
    at main.cc line 140

I guess we need to find out exactly what bytes are in the Glib::ustring.
Comment 5 Christophe de Vienne 2005-04-26 14:02:33 UTC
Breakpoint 1, print_node (node=0x80554b8, indentation=4) at main.cc:65  
65          std::cout << "text = \"" << nodeText->get_content() << "\"" <<  
std::endl;  
(gdb) print nodeText->get_content()  
$1 = {static npos = 4294967295, string_ = {static npos = 4294967295,  
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> =  
{<No data fields>}, <No data fields>}, _M_p = 0x804c31c "Diego Pettenò"}}}  
(gdb)     
  
The bytes  in the Glib::ustring looks good. 
 
I could reproduce the problem independently of libxml++ (cf attachement). 
 
 
 
Comment 6 Christophe de Vienne 2005-04-26 14:04:37 UTC
Created attachment 45694 [details]
testcase

compile with :

g++ -g `pkg-config --cflags glibmm-2.4` -o test test.cc `pkg-config --libs
glibmm-2.4`
Comment 7 Murray Cumming 2005-04-28 09:17:56 UTC
I don't think you can put unicode directly into C sources. You should use
english literals and gettext().
Comment 8 Murray Cumming 2005-04-28 09:24:58 UTC
Created attachment 45774 [details]
g_locale_from_utf8.c

Here is a C test case. Please discuss this on gtk-list@gnome.org if you
disagree.
Comment 9 Murray Cumming 2005-04-28 09:27:28 UTC
Then again, the gettext() thing doesn't help you when reading from your XML
file. Maye a glib coder can explain.
Comment 10 Matthias Clasen 2005-04-28 20:31:08 UTC
Putting Unicode in C should be fine, as far as gcc is concerned. 
This looks like an iconv bug to me. It doesn't seem to accept \c3\b2
Comment 11 Aaron Walker 2005-04-29 12:14:18 UTC
Well, I've confirmed that iconv is returning -1 and setting errno to EILSEQ,
however I am unable to reproduce it outside of glib.  I've attached a test case
which works as expected.
Comment 12 Aaron Walker 2005-04-29 12:15:08 UTC
Created attachment 45823 [details]
iconv-test.c
Comment 13 Aaron Walker 2005-05-08 12:04:16 UTC
any update on this?  Yay or nay on whether it's really an iconv problem or not?
 I'm hoping this gets resolved soon, as the UTF-8 support is really my sole
reason for using glib/glibmm.
Comment 14 Matthias Clasen 2005-05-08 14:58:33 UTC
the error does not occur when doing the explicit UTF-8 -> ISO8859-1 conversion
using g_convert either, so the problem seems to be not in iconv and not in the 
glib iconv wrapper, but rather in g_locale_from_utf8
Comment 15 Matthias Clasen 2005-05-08 15:10:13 UTC
you should probably insert 

    const gchar *charset;
    g_get_charset (&charset);
    g_print ("charset %s\n", charset);

in your example and verify that glibs idea of the locale charset coincides with
what you believe it is.
Comment 16 Aaron Walker 2005-05-10 15:36:12 UTC
charset ANSI_X3.4-1968
setlocale returned 'en_US.UTF-8'
strlen("Diego Pettenò") == 14
iconv: Invalid or incomplete multibyte or wide character
result = 'Diego Petten'

That's after replacing ISO8859-1 with ANSI_X3.4-1968.  not sure what this means
though.
Comment 17 Matthias Clasen 2005-05-17 17:39:04 UTC
ANSI_X3.4-1968 is a fancy name for ASCII, so it is no wonder it can't handle
that last character.

What is your locale set to ? 
Comment 18 Aaron Walker 2005-05-18 00:56:49 UTC
As displayed in comment #16, my locale is set to en_US.UTF-8.
Comment 19 Murray Cumming 2005-07-18 08:49:42 UTC
Are we any closer to an explanation for this? It's very odd.
Comment 20 Matthias Clasen 2006-04-05 15:18:57 UTC
You need to figure out why g_get_charset() thinks that your locale charset is
ASCII, when the locale is set to en_US.UTF-8
Comment 21 Matthias Clasen 2007-12-23 01:13:34 UTC
No response in more than a year, closing.