GNOME Bugzilla – Bug 311337
g_convert_with_iconv Converts incorrectly cp1255
Last modified: 2011-02-18 16:11:14 UTC
Please describe the problem: When converting CP1255 to UTF-8, the last letter in the output buffer is missing, if len is passed without couting the terminating null. A five letter cp1255 will be converted to 8 utf-8 bytes if the length is 5 or -1, but will be converted correctly (to 11 bytes) if len is 6 (counting the terminating null). Doing the same conversion on cp1253, suppling 5 as the length produces a correct string, so I assume there's a problem with cp1255 specificly. Steps to reproduce: C source is attached as other information. Actual results: Output buffer has a missing letter. Expected results: Text will be converted correctly. Does this happen every time? It does. Other information: #include <glib.h> #include <stdio.h> void tohex(char *str) { char *ch; for (ch = str ; *ch ; ch++) { printf("%02x ", *ch & 0xff); } printf("\n"); } int main() { // A five-letter hebrew word (cp1255) - null terminated char buf[] = { 0xf4, 0xe5, 0xf8, 0xe5, 0xed, 0x00 }; // Last char missing int buflen = 5; // Same thing here // int buflen = -1; // Works OK... // int buflen = 6; GIConv giconv = g_iconv_open("UTF8", "CP1255"); gsize bytes_read, bytes_written; GError *err = 0; char *outbuf = (char*)g_convert_with_iconv((gchar*)buf, buflen, giconv, &bytes_read, &bytes_written, &err); printf("buf: "); tohex(buf); printf("buflen: %d\n", buflen); printf("bytes_read: %d\n", bytes_read); // outbuf should be 10 bytes long + null (total 11 bytes) printf("bytes_written: %d\n", bytes_written); printf("outbuf: "); tohex(outbuf); return 0; }
Any problem that's specific to a particular 8-bit encoding is a system (GNU libc on Linux) problem.
I suspect the problem lies in the wrapping of the libc iconv methods and not in libc itself. Running the same code above using iconv(3) instead of g_convert_with_iconv will return correct results.
Correcting myself. iconv(3) return invalid unicode, which is probably removed by g_convert_with_iconv.
Hmmm, so is bug #138299 a duplicate of this one?
It appears so. I've reported this to glibc guys, <a href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=1124">bug #1124</a>.
After certain discussion in libc's bugzilla, it appears that the problem is cp1255 specific, since it is a stateful encoding, and there's a solution for it - see comment #4 at http://sources.redhat.com/bugzilla/show_bug.cgi?id=1124#c4 (this time no HTML tags :). I've tested it using the code in the libc bugreport, and it seems to be working - which means that g_convert_with_iconv might need a change after all... Cheers,
Created attachment 50096 [details] [review] a patch
Created attachment 50097 [details] [review] new version A new version of the patch which reuses the error handling code in g_convert_with_iconv() and protects against iconv returning something other than E2BIG in g_convert_with_fallback().
I don't think the change should be in g_convert_with_iconv(). After all, iconv *is* a stateful coder so we shouldn't reset its state in every call to iconv but only when we're sure our entire stream has ended. It's a possible scenario that a user of g_convert_with_iconv will call it for every block he receives from the network and he'll want to retain the state between the calls (which's exactly what the iconv descriptor provides -- a place to store the state). Only in a function which does a one-shot at conversion, when its clear to us when the end of the stream is (and when we afterwards close the iconv descriptor), we should perform the iconv reset (inbuf = NULL, inbytesleft=NULL). And that is - only at the end of the g_convert function.
Good point.
Created attachment 50110 [details] [review] new version
The last patch is a bit larger since it splits g_convert_with_iconv into two functions and moves them around a bit.
Matthias, thanks! The patch seems solid to me. For the benefit of future generations, I'd also add a small comment above g_convert_with_iconv_internal explaining its' necessity. But wait, I think you raised another issue! g_convert_with_iconv users also need a way to flush the iconv descriptor properly and the current g_convert_with_iconv doesn't allow them to do it. So I propose we add a new public API function and also use it ourselves: static gchar* g_flush_iconv(GIConv converter, gsize *bytes_written, GError **error);
I think g_convert_with_iconv() was meant to be "g_convert() using a cached iconv_t" rather than "better interface to iconv()"... may be the latter is useful though...
2001-08-24 Alexander Larsson <alla@lysator.liu.se> * glib/gconvert.[ch] (g_convert_with_iconv): New function, doing the same as g_convert but taking a GIConv argument. The old g_convert is just a call to this with a newly opened GIConv.
I think a simple fix for the problem raised in comment #13 is to do away with g_convert_with_iconv_internal and just allow NULL as first argument to g_convert_with_iconv. That would also considerably shrink the patch. Owen, do you think g_convert_with_iconv should emit the final shift, then ?
Created attachment 50135 [details] [review] final patch as committed