Bug 301935 – Invalid byte sequence with g_locale_from_utf8()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 301935 - Invalid byte sequence with g_locale_from_utf8()


Summary:	Invalid byte sequence with g_locale_from_utf8()


Status:	RESOLVED INCOMPLETE

Product:	glib
Classification:	Platform
Component:	general
Version:	2.6.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	Christophe de Vienne
QA Contact:	Christophe de Vienne

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-04-25 16:41 UTC by Aaron Walker
Modified:	2011-02-18 16:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
test.cc (1.30 KB, text/plain) 2005-04-25 16:42 UTC, Aaron Walker		Details
test.xml (138 bytes, text/plain) 2005-04-25 16:43 UTC, Aaron Walker		Details
libxmlpp-2.10.0-debug.diff (1021 bytes, patch) 2005-04-25 16:44 UTC, Aaron Walker	none	Details \| Review
testcase (160 bytes, text/plain) 2005-04-26 14:04 UTC, Christophe de Vienne		Details
g_locale_from_utf8.c (527 bytes, text/x-csrc) 2005-04-28 09:24 UTC, Murray Cumming		Details
iconv-test.c (927 bytes, text/plain) 2005-04-29 12:15 UTC, Aaron Walker		Details

Description Aaron Walker 2005-04-25 16:41:48 UTC

Distribution/Version: Gentoo Base System version 1.6.11

As per http://mail.gnome.org/archives/gtkmm-list/2005-April/msg00261.html, here
is a test case.

I'm using a patched version of 2.10.0.  I'll attach the patch, test app, and
test xml.

$ g++ -Wall -ggdb3 $(pkg-config --cflags --libs libxml++-2.6) test.cc -o test
$ ./test test.xml
ch = 'ò</name>
</maintainer>
'
s = 'Diego Petten'
text = 'Diego Petten'
Invalid byte sequence in conversion input

note: I had the same results when I ran the dom_parser/sax_parser example on the
same xml.

Comment 1 Aaron Walker 2005-04-25 16:42:41 UTC

Created attachment 45656 [details]
test.cc

Comment 2 Aaron Walker 2005-04-25 16:43:07 UTC

Created attachment 45657 [details]
test.xml

Comment 3 Aaron Walker 2005-04-25 16:44:53 UTC

Created attachment 45658 [details] [review]
libxmlpp-2.10.0-debug.diff

Prints ch+len and the resulting Glib::ustring.	Also has fix to the
SaxParserCallback::on_characters() bug.

Comment 4 Murray Cumming 2005-04-25 17:57:50 UTC

Confirmed with libxml++ from cvs.

Without catching the Glib::Error exception, this is the backtrace:

+ Trace 58706

#0 __kernel_vsyscall
#1 raise
from /lib/tls/i686/cmov/libc.so.6
#2 abort
from /lib/tls/i686/cmov/libc.so.6
#3 __cxa_call_unexpected
from /usr/lib/libstdc++.so.5
#4 std::terminate
from /usr/lib/libstdc++.so.5
#5 __cxa_throw
from /usr/lib/libstdc++.so.5
#6 Glib::ConvertError::throw_func
at convert.cc line 320
#7 Glib::Error::throw_exception
at error.cc line 174
#8 Glib::locale_from_utf8
at convert.cc line 192
#9 Glib::operator<<
at ustring.cc line 1202
#10 print_node
at main.cc line 65
#11 print_node
at main.cc line 108
#12 print_node
at main.cc line 108
#13 main
at main.cc line 140


I guess we need to find out exactly what bytes are in the Glib::ustring.

Comment 5 Christophe de Vienne 2005-04-26 14:02:33 UTC

Breakpoint 1, print_node (node=0x80554b8, indentation=4) at main.cc:65  
65          std::cout << "text = \"" << nodeText->get_content() << "\"" <<  
std::endl;  
(gdb) print nodeText->get_content()  
$1 = {static npos = 4294967295, string_ = {static npos = 4294967295,  
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> =  
{<No data fields>}, <No data fields>}, _M_p = 0x804c31c "Diego PettenÃ²"}}}  
(gdb)     
  
The bytes  in the Glib::ustring looks good. 
 
I could reproduce the problem independently of libxml++ (cf attachement).

Comment 6 Christophe de Vienne 2005-04-26 14:04:37 UTC

Created attachment 45694 [details]
testcase

compile with :

g++ -g `pkg-config --cflags glibmm-2.4` -o test test.cc `pkg-config --libs
glibmm-2.4`

Comment 7 Murray Cumming 2005-04-28 09:17:56 UTC

I don't think you can put unicode directly into C sources. You should use
english literals and gettext().

Comment 8 Murray Cumming 2005-04-28 09:24:58 UTC

Created attachment 45774 [details]
g_locale_from_utf8.c

Here is a C test case. Please discuss this on gtk-list@gnome.org if you
disagree.

Comment 9 Murray Cumming 2005-04-28 09:27:28 UTC

Then again, the gettext() thing doesn't help you when reading from your XML
file. Maye a glib coder can explain.

Comment 10 Matthias Clasen 2005-04-28 20:31:08 UTC

Putting Unicode in C should be fine, as far as gcc is concerned. 
This looks like an iconv bug to me. It doesn't seem to accept \c3\b2

Comment 11 Aaron Walker 2005-04-29 12:14:18 UTC

Well, I've confirmed that iconv is returning -1 and setting errno to EILSEQ,
however I am unable to reproduce it outside of glib.  I've attached a test case
which works as expected.

Comment 12 Aaron Walker 2005-04-29 12:15:08 UTC

Created attachment 45823 [details]
iconv-test.c

Comment 13 Aaron Walker 2005-05-08 12:04:16 UTC

any update on this?  Yay or nay on whether it's really an iconv problem or not?
 I'm hoping this gets resolved soon, as the UTF-8 support is really my sole
reason for using glib/glibmm.

Comment 14 Matthias Clasen 2005-05-08 14:58:33 UTC

the error does not occur when doing the explicit UTF-8 -> ISO8859-1 conversion
using g_convert either, so the problem seems to be not in iconv and not in the 
glib iconv wrapper, but rather in g_locale_from_utf8

Comment 15 Matthias Clasen 2005-05-08 15:10:13 UTC

you should probably insert 

    const gchar *charset;
    g_get_charset (&charset);
    g_print ("charset %s\n", charset);

in your example and verify that glibs idea of the locale charset coincides with
what you believe it is.

Comment 16 Aaron Walker 2005-05-10 15:36:12 UTC

charset ANSI_X3.4-1968
setlocale returned 'en_US.UTF-8'
strlen("Diego PettenÃ²") == 14
iconv: Invalid or incomplete multibyte or wide character
result = 'Diego Petten'

That's after replacing ISO8859-1 with ANSI_X3.4-1968.  not sure what this means
though.

Comment 17 Matthias Clasen 2005-05-17 17:39:04 UTC

ANSI_X3.4-1968 is a fancy name for ASCII, so it is no wonder it can't handle
that last character.

What is your locale set to ?

Comment 18 Aaron Walker 2005-05-18 00:56:49 UTC

As displayed in comment #16, my locale is set to en_US.UTF-8.

Comment 19 Murray Cumming 2005-07-18 08:49:42 UTC

Are we any closer to an explanation for this? It's very odd.

Comment 20 Matthias Clasen 2006-04-05 15:18:57 UTC

You need to figure out why g_get_charset() thinks that your locale charset is
ASCII, when the locale is set to en_US.UTF-8

Comment 21 Matthias Clasen 2007-12-23 01:13:34 UTC

No response in more than a year, closing.