GNOME Bugzilla – Bug 360375
non-ascii subject is mangled
Last modified: 2007-01-25 06:38:27 UTC
subject containing mime-encoded characters is not converted into UTF-8 correctly on body part (both in full headers and in Subject header). But on header list, it is correctly converted. If writing a follow-up on such message, subject will be mangled in reply-to composition window. Attached is a message causing bug to appear.
Created attachment 74207 [details] message causing mangled subject
Confirmed. It does work correctly if the group's charset is set to UTF-8, but not when it's set to ISO-8859-15.
Unfortunately, my ISP has restricted UTF-8 message posts to only some groups, which is why most of my groups are not configured to use UTF-8.
Well, I was only leaving a breadcrum to fix this bug, not proposing a workaround.
Created attachment 74245 [details] [review] Patch to fix this problem. This patch fixes this problem: header_to_utf8() calls g_mime_utils_8bit_header_decode(), which already converted the subject to utf-8. So content_to_utf8() then attempts to convert again, which corrupted the string. I don't quite get the reason behind the original code, though: why still convert when the string's already utf-8 ?
Chris: g_mime_utils_8bit_header_decode() appears to only convert the encoded parts to UTF-8. I think the unencoded parts are passed through unchanged. Disgusting suggestion: before calling _header_decode(), build a string that converts the non-encoded segments into UTF-8. pass that string into _header_decode(), so both the encoded and unencoded segments will have been converted to utf-8.
You've lost me: are you addressing the case where a header's a mix between encoded and non-encoded 8bit characters ?
Yes. From my reading of g_mime_utils_8bit_header_decode(), it looks like only the header parts inside the =? ?= block are passed through iconv.
Created attachment 74307 [details] [review] Updated patch OK, it'd be quite unusual to have non-encoded and encoded characters in the same header, but the updated patch addresses that. Essentially, it converts to utf-8 first (which doesn't look at the encoded strings), and then calls g_mime_utils_8bit_header_decode (which doesn't look at the non-encoded characters). This seems to work with encoded characters, non-encoded characters and a mix of both.
I like the first part of that, but the second part looks like it would cause a regression on bug #356835 .
Created attachment 74369 [details] [review] Updated patch You're right. It would be a regression. Fixed now.
Looks good to me. Feel free to commit. (It's been a long time since I've said that. Nice to have the code back into CVS... :)
Committed.
Reopening : title is still mangled when replying to a non-ASCII subject (check title entry in compose windows) with pan 0.117
works for me. example ?
Ok, I found how to reproduce the problem : you need to set group default charset to ISO-8859-15 (or ISO-8859-1) first, then the following headers will cause title to be incorrectly decoded (I've anonymised message): Path: news.free.fr!not-for-mail From: Foo Bar <foo@bar.net> Newsgroups: proxad.test Subject: Synthèse toto Date: Wed, 30 Jun 2004 18:19:40 +0200 Organization: Free Lines: 9 Sender: foo@bar.net Message-ID: <cbup34$neb$1@news.free.fr> NNTP-Posting-Host: foobar.net Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit NNTP-Posting-Date: Wed, 30 Jun 2004 16:17:41 +0000 (UTC) User-Agent: Mozilla Thunderbird 0.6 (X11/20040502) X-Accept-Language: en If group default charset is UTF-8, title is decoded correctly. You'll notice title is not mime-encoded but 8bit encoded.
Confirmed.
grrr ... content_to_utf8() is a mess. The reason for this bug is that BodyPane :: create_followup_or_reply() calls g_mime_message_get_subject() and then converts it to utf8 (since it may still be in original charset). The problem is that, if the returned string *is* in UTF-8, the content_to_utf8() will do a conversion from the article's charset to UTF-8 ... and succeed, thereby garbling the subject (similar to what's happening in bug #363268). What content_to_utf8() should do is to not do any conversion if the message is already in utf-8, but it can't/doesn't because of bug #356835. Charles: any thoughts?
Created attachment 75536 [details] [review] Proposed additional patch against stock 0.117 Actually, we can do like the rest of body-bane.cc and use g_mime_message_get_header(), since that returns the raw header, not converted to UTF-8. BTW, I don't really get this code in BodyPane :: create_followup_or_reply() v = normalize_subject_re (h); std::string val (v.str, v.len); if (!val.find("RE:") || !val.find ("Re:")) val.replace (0, 3, "Re:"); // be polite & force lowercase 'e' else val.insert (0, "Re: "); // no Re: -- add one. normalize_subject_re() already strips the leading 'Re:'. Why the code to do it again ?
I've just tested the patch and it fixes the issue.
Chris: patch looks good to me. Please commit. normalize_subject_re() doesn't actually change the string, it just shrinks StringView's view to prune redundant leading Re's. We have to convert that StringView to a std::string before changing case ("RE:" -> "Re:") or prepending an "Re: " if one isn't already there.
Chris: I've cleaned up normalize_subject_re() and its caller a bit so that they're not so overlapping.
And reopening again (I know, you're going to hate :) With pan 0.118, some posts done with Pan/0.14.2.91 are not decoded properly, when group is configured to use ISO-8859-1/15 instead of UTF-8 : From: Foo bar <news_01@boofar.net> Subject: Re: Audits de sécurité Date: Sun, 05 Nov 2006 02:16:37 +0100 User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux)) In this case, title is decoded correctly in thread pane whenever is group encoding is configured in UTF-8 or ISO-8859-1 (or -15). But in message pane (and when replying to message), title is not decoded correctly if group encoding is set to ISO-8859-1 (or -15). It is correct only for UTF-8 encoding. Enjoy ;)
... oh boy. :) Can you give me the message id + group, or preferably, attach the full message?
Created attachment 76052 [details] message causing problem here is a message causing problem
Hmm, that article: - specifies a charset ISO-8859-15 in its content type - has a body in ISO-8859-15 (correct) - has a subject in UTF-8 (wrong) To be honest, I don't see a clean solution to this and, because the article is essentially invalid, I'd be tempted to close that as WONTFIX. Charles: any thoughts ?
Well, even if message is invalid (it is ironic to see it was sent by earlier version of pan ;), for full consistency, displaying its subject when group is configured in UTF-8 should be broken too ;) IMHO, group encoding shouldn't affect message display when encoding is specific in a message. Possible heuristic to workaround this problem: if group encoding is not set to UTF-8, if message charset is not UTF-8 and if subject pass g_utf8_validate, don't try to convert it using message charset to UTF-8.
The problem with your workaround is that it breaks support for charsets that are utf-8 clean (see bug #356835). charles: transferring to you, since I won't have any time to look at this this week.
Chris: transferring to you, since I've done nothing with this ticket over the last five weeks and it makes me twitch.
Marking as FIXED, since two of the three cases in this bugreport are addressed. Frederic/Charles: if either of you feel strongly about the final case (displaying an invalid message), feel free to open a new bugreport.