After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 360375 - non-ascii subject is mangled
non-ascii subject is mangled
Status: RESOLVED FIXED
Product: Pan
Classification: Other
Component: general
pre-1.0 betas
Other Linux
: Normal normal
: 1.0
Assigned To: Christophe Lambin
Pan QA Team
Depends on:
Blocks:
 
 
Reported: 2006-10-07 12:36 UTC by Frederic Crozat
Modified: 2007-01-25 06:38 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
message causing mangled subject (1.56 KB, text/plain)
2006-10-07 12:37 UTC, Frederic Crozat
  Details
Patch to fix this problem. (620 bytes, patch)
2006-10-07 20:01 UTC, Christophe Lambin
none Details | Review
Updated patch (1.19 KB, patch)
2006-10-08 18:55 UTC, Christophe Lambin
none Details | Review
Updated patch (748 bytes, patch)
2006-10-09 18:48 UTC, Christophe Lambin
committed Details | Review
Proposed additional patch against stock 0.117 (625 bytes, patch)
2006-10-27 23:17 UTC, Christophe Lambin
committed Details | Review
message causing problem (2.15 KB, text/plain)
2006-11-05 21:10 UTC, Frederic Crozat
  Details

Description Frederic Crozat 2006-10-07 12:36:49 UTC
subject containing mime-encoded characters is not converted into UTF-8 correctly on body part (both in full headers and in Subject header). But on header list, it is correctly converted. 

If writing a follow-up on such message, subject will be mangled in reply-to composition window.

Attached is a message causing bug to appear.
Comment 1 Frederic Crozat 2006-10-07 12:37:21 UTC
Created attachment 74207 [details]
message causing mangled subject
Comment 2 Christophe Lambin 2006-10-07 12:49:11 UTC
Confirmed.  It does work correctly if the group's charset is set to UTF-8, but not when it's set to ISO-8859-15.
Comment 3 Frederic Crozat 2006-10-07 14:55:34 UTC
Unfortunately, my ISP has restricted UTF-8 message posts to only some groups, which is why most of my groups are not configured to use UTF-8.
Comment 4 Christophe Lambin 2006-10-07 15:31:51 UTC
Well, I was only leaving a breadcrum to fix this bug, not proposing a workaround.
Comment 5 Christophe Lambin 2006-10-07 20:01:18 UTC
Created attachment 74245 [details] [review]
Patch to fix this problem.

This patch fixes this problem: header_to_utf8() calls g_mime_utils_8bit_header_decode(), which already converted the subject to utf-8. So content_to_utf8() then attempts to convert again, which corrupted the string.

I don't quite get the reason behind the original code, though: why still convert when the string's already utf-8 ?
Comment 6 Charles Kerr 2006-10-08 02:22:29 UTC
Chris: g_mime_utils_8bit_header_decode() appears to only
convert the encoded parts to UTF-8.  I think the unencoded
parts are passed through unchanged.

Disgusting suggestion: before calling _header_decode(),
build a string that converts the non-encoded segments
into UTF-8. pass that string into _header_decode(),
so both the encoded and unencoded segments will have
been converted to utf-8.
Comment 7 Christophe Lambin 2006-10-08 08:28:21 UTC
You've lost me: are you addressing the case where a header's a mix between encoded and non-encoded 8bit characters ?

Comment 8 Charles Kerr 2006-10-08 16:58:42 UTC
Yes.  From my reading of g_mime_utils_8bit_header_decode(), it
looks like only the header parts inside the =? ?= block are
passed through iconv.

Comment 9 Christophe Lambin 2006-10-08 18:55:13 UTC
Created attachment 74307 [details] [review]
Updated patch

OK, it'd be quite unusual to have non-encoded and encoded characters in the same header, but the updated patch addresses that.

Essentially, it converts to utf-8 first (which doesn't look at the encoded strings), and then calls g_mime_utils_8bit_header_decode (which doesn't look at the non-encoded characters).

This seems to work with encoded characters, non-encoded characters and a mix of both.
Comment 10 Charles Kerr 2006-10-08 23:27:26 UTC
I like the first part of that, but the second part
looks like it would cause a regression on bug #356835 .
Comment 11 Christophe Lambin 2006-10-09 18:48:28 UTC
Created attachment 74369 [details] [review]
Updated patch

You're right. It would be a regression. Fixed now.
Comment 12 Charles Kerr 2006-10-10 05:10:54 UTC
Looks good to me.  Feel free to commit.

(It's been a long time since I've said that.
Nice to have the code back into CVS... :)
Comment 13 Christophe Lambin 2006-10-10 05:42:33 UTC
Committed.

Comment 14 Frederic Crozat 2006-10-26 12:26:16 UTC
Reopening : title is still mangled when replying to a non-ASCII subject (check title entry in compose windows) with pan 0.117
Comment 15 Christophe Lambin 2006-10-26 16:08:20 UTC
works for me. example ?
Comment 16 Frederic Crozat 2006-10-27 13:08:25 UTC
Ok, I found how to reproduce the problem :

you need to set group default charset to ISO-8859-15 (or ISO-8859-1) first, then the following headers will cause title to be incorrectly decoded (I've anonymised message):

Path: news.free.fr!not-for-mail
From: Foo Bar <foo@bar.net>
Newsgroups: proxad.test
Subject: Synthèse toto
Date: Wed, 30 Jun 2004 18:19:40 +0200
Organization: Free
Lines: 9
Sender: foo@bar.net
Message-ID: <cbup34$neb$1@news.free.fr>
NNTP-Posting-Host: foobar.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
NNTP-Posting-Date: Wed, 30 Jun 2004 16:17:41 +0000 (UTC)
User-Agent: Mozilla Thunderbird 0.6 (X11/20040502)
X-Accept-Language: en

If group default charset is UTF-8, title is decoded correctly. You'll notice title is not mime-encoded but 8bit encoded.

Comment 17 Christophe Lambin 2006-10-27 16:56:09 UTC
Confirmed.

Comment 18 Christophe Lambin 2006-10-27 19:07:59 UTC
grrr ... content_to_utf8() is a mess.  

The reason for this bug is that BodyPane :: create_followup_or_reply() calls g_mime_message_get_subject() and then converts it to utf8 (since it may still be in original charset).  The problem is that, if the returned string *is* in UTF-8, the content_to_utf8() will do a conversion from the article's charset to UTF-8 ... and succeed, thereby garbling the subject (similar to what's happening in bug #363268).

What content_to_utf8() should do is to not do any conversion if the message is already in utf-8, but it can't/doesn't because of bug #356835.

Charles: any thoughts? 
Comment 19 Christophe Lambin 2006-10-27 23:17:45 UTC
Created attachment 75536 [details] [review]
Proposed additional patch against stock 0.117

Actually, we can do like the rest of body-bane.cc and use g_mime_message_get_header(), since that returns the raw header, not converted to UTF-8.

BTW, I don't really get this code in BodyPane :: create_followup_or_reply()

    v = normalize_subject_re (h);
    std::string val (v.str, v.len);
    if (!val.find("RE:") || !val.find ("Re:"))
      val.replace (0, 3, "Re:"); // be polite & force lowercase 'e'
    else
      val.insert (0, "Re: "); // no Re: -- add one.

normalize_subject_re() already strips the leading 'Re:'. Why the code to do it again ?
Comment 20 Frederic Crozat 2006-10-28 13:16:22 UTC
I've just tested the patch and it fixes the issue.
Comment 21 Charles Kerr 2006-10-30 20:13:33 UTC
Chris: patch looks good to me.  Please commit.

normalize_subject_re() doesn't actually change the string, it just
shrinks StringView's view to prune redundant leading Re's.
We have to convert that StringView to a std::string before changing
case ("RE:" -> "Re:") or prepending an "Re: " if one isn't already there.
Comment 22 Charles Kerr 2006-10-30 20:36:06 UTC
Chris: I've cleaned up normalize_subject_re() and its caller a bit
so that they're not so overlapping.
Comment 23 Christophe Lambin 2006-10-30 22:35:44 UTC
Committed.
Comment 24 Frederic Crozat 2006-11-05 11:17:43 UTC
And reopening again (I know, you're going to hate :)

With pan 0.118, some posts done with Pan/0.14.2.91 are not decoded properly, when  group is configured to use ISO-8859-1/15 instead of UTF-8 :

From: Foo bar <news_01@boofar.net>
Subject: Re: Audits de sécurité
Date: Sun, 05 Nov 2006 02:16:37 +0100
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux))

In this case, title is decoded correctly in thread pane whenever is group encoding is configured in UTF-8 or ISO-8859-1 (or -15).

But in message pane (and when replying to message), title is not decoded correctly  if group encoding is set to ISO-8859-1 (or -15). It is correct only for UTF-8 encoding.

Enjoy ;)
Comment 25 Christophe Lambin 2006-11-05 16:23:22 UTC
... oh boy. :)

Can you give me the message id + group, or preferably, attach the full message?
Comment 26 Frederic Crozat 2006-11-05 21:10:40 UTC
Created attachment 76052 [details]
message causing problem

here is a message causing problem
Comment 27 Christophe Lambin 2006-11-05 21:56:54 UTC
Hmm, that article:
- specifies a charset ISO-8859-15 in its content type
- has a body in ISO-8859-15 (correct)
- has a subject in UTF-8 (wrong)

To be honest, I don't see a clean solution to this and, because the article is essentially invalid, I'd be tempted to close that as WONTFIX.

Charles: any thoughts ?
Comment 28 Frederic Crozat 2006-11-06 06:51:33 UTC
Well, even if message is invalid (it is ironic to see it was sent by earlier version of pan ;), for full consistency, displaying its subject when group is configured in UTF-8 should be broken too ;)
IMHO, group encoding shouldn't affect message display when encoding is specific in a message.

Possible heuristic to workaround this problem:

if group encoding is not set to UTF-8, if message charset is not UTF-8 and if subject pass g_utf8_validate, don't try to convert it using message charset to UTF-8.
Comment 29 Christophe Lambin 2006-11-06 19:37:20 UTC
The problem with your workaround is that it breaks support for charsets that are utf-8 clean (see bug #356835).

charles: transferring to you, since I won't have any time to look at this this week.
Comment 30 Charles Kerr 2007-01-19 19:28:59 UTC
Chris: transferring to you, since I've done nothing with this ticket over
the last five weeks and it makes me twitch.
Comment 31 Christophe Lambin 2007-01-25 06:38:27 UTC
Marking as FIXED, since two of the three cases in this bugreport are addressed.

Frederic/Charles: if either of you feel strongly about the final case (displaying an invalid message), feel free to open a new bugreport.