Bug 360375 – non-ascii subject is mangled

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 360375 - non-ascii subject is mangled


Summary:	non-ascii subject is mangled


Status:	RESOLVED FIXED

Product:	Pan
Classification:	Other
Component:	general
Version:	pre-1.0 betas
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.0
Assigned To:	Christophe Lambin
QA Contact:	Pan QA Team

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-10-07 12:36 UTC by Frederic Crozat
Modified:	2007-01-25 06:38 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
message causing mangled subject (1.56 KB, text/plain) 2006-10-07 12:37 UTC, Frederic Crozat		Details
Patch to fix this problem. (620 bytes, patch) 2006-10-07 20:01 UTC, Christophe Lambin	none	Details \| Review
Updated patch (1.19 KB, patch) 2006-10-08 18:55 UTC, Christophe Lambin	none	Details \| Review
Updated patch (748 bytes, patch) 2006-10-09 18:48 UTC, Christophe Lambin	committed	Details \| Review
Proposed additional patch against stock 0.117 (625 bytes, patch) 2006-10-27 23:17 UTC, Christophe Lambin	committed	Details \| Review
message causing problem (2.15 KB, text/plain) 2006-11-05 21:10 UTC, Frederic Crozat		Details

Description Frederic Crozat 2006-10-07 12:36:49 UTC

subject containing mime-encoded characters is not converted into UTF-8 correctly on body part (both in full headers and in Subject header). But on header list, it is correctly converted. 

If writing a follow-up on such message, subject will be mangled in reply-to composition window.

Attached is a message causing bug to appear.

Comment 1 Frederic Crozat 2006-10-07 12:37:21 UTC

Created attachment 74207 [details]
message causing mangled subject

Comment 2 Christophe Lambin 2006-10-07 12:49:11 UTC

Confirmed.  It does work correctly if the group's charset is set to UTF-8, but not when it's set to ISO-8859-15.

Comment 3 Frederic Crozat 2006-10-07 14:55:34 UTC

Unfortunately, my ISP has restricted UTF-8 message posts to only some groups, which is why most of my groups are not configured to use UTF-8.

Comment 4 Christophe Lambin 2006-10-07 15:31:51 UTC

Well, I was only leaving a breadcrum to fix this bug, not proposing a workaround.

Comment 5 Christophe Lambin 2006-10-07 20:01:18 UTC

Created attachment 74245 [details] [review]
Patch to fix this problem.

This patch fixes this problem: header_to_utf8() calls g_mime_utils_8bit_header_decode(), which already converted the subject to utf-8. So content_to_utf8() then attempts to convert again, which corrupted the string.

I don't quite get the reason behind the original code, though: why still convert when the string's already utf-8 ?

Comment 6 Charles Kerr 2006-10-08 02:22:29 UTC

Chris: g_mime_utils_8bit_header_decode() appears to only
convert the encoded parts to UTF-8.  I think the unencoded
parts are passed through unchanged.

Disgusting suggestion: before calling _header_decode(),
build a string that converts the non-encoded segments
into UTF-8. pass that string into _header_decode(),
so both the encoded and unencoded segments will have
been converted to utf-8.

Comment 7 Christophe Lambin 2006-10-08 08:28:21 UTC

You've lost me: are you addressing the case where a header's a mix between encoded and non-encoded 8bit characters ?

Comment 8 Charles Kerr 2006-10-08 16:58:42 UTC

Yes.  From my reading of g_mime_utils_8bit_header_decode(), it
looks like only the header parts inside the =? ?= block are
passed through iconv.

Comment 9 Christophe Lambin 2006-10-08 18:55:13 UTC

Created attachment 74307 [details] [review]
Updated patch

OK, it'd be quite unusual to have non-encoded and encoded characters in the same header, but the updated patch addresses that.

Essentially, it converts to utf-8 first (which doesn't look at the encoded strings), and then calls g_mime_utils_8bit_header_decode (which doesn't look at the non-encoded characters).

This seems to work with encoded characters, non-encoded characters and a mix of both.

Comment 10 Charles Kerr 2006-10-08 23:27:26 UTC

I like the first part of that, but the second part
looks like it would cause a regression on bug #356835 .

Comment 11 Christophe Lambin 2006-10-09 18:48:28 UTC

Created attachment 74369 [details] [review]
Updated patch

You're right. It would be a regression. Fixed now.

Comment 12 Charles Kerr 2006-10-10 05:10:54 UTC

Looks good to me.  Feel free to commit.

(It's been a long time since I've said that.
Nice to have the code back into CVS... :)

Comment 13 Christophe Lambin 2006-10-10 05:42:33 UTC

Committed.

Comment 14 Frederic Crozat 2006-10-26 12:26:16 UTC

Reopening : title is still mangled when replying to a non-ASCII subject (check title entry in compose windows) with pan 0.117

Comment 15 Christophe Lambin 2006-10-26 16:08:20 UTC

works for me. example ?

Comment 16 Frederic Crozat 2006-10-27 13:08:25 UTC

Ok, I found how to reproduce the problem :

you need to set group default charset to ISO-8859-15 (or ISO-8859-1) first, then the following headers will cause title to be incorrectly decoded (I've anonymised message):

Path: news.free.fr!not-for-mail
From: Foo Bar <foo@bar.net>
Newsgroups: proxad.test
Subject: Synthèse toto
Date: Wed, 30 Jun 2004 18:19:40 +0200
Organization: Free
Lines: 9
Sender: foo@bar.net
Message-ID: <cbup34$neb$1@news.free.fr>
NNTP-Posting-Host: foobar.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
NNTP-Posting-Date: Wed, 30 Jun 2004 16:17:41 +0000 (UTC)
User-Agent: Mozilla Thunderbird 0.6 (X11/20040502)
X-Accept-Language: en

If group default charset is UTF-8, title is decoded correctly. You'll notice title is not mime-encoded but 8bit encoded.

Comment 17 Christophe Lambin 2006-10-27 16:56:09 UTC

Confirmed.

Comment 18 Christophe Lambin 2006-10-27 19:07:59 UTC

grrr ... content_to_utf8() is a mess.  

The reason for this bug is that BodyPane :: create_followup_or_reply() calls g_mime_message_get_subject() and then converts it to utf8 (since it may still be in original charset).  The problem is that, if the returned string *is* in UTF-8, the content_to_utf8() will do a conversion from the article's charset to UTF-8 ... and succeed, thereby garbling the subject (similar to what's happening in bug #363268).

What content_to_utf8() should do is to not do any conversion if the message is already in utf-8, but it can't/doesn't because of bug #356835.

Charles: any thoughts?

Comment 19 Christophe Lambin 2006-10-27 23:17:45 UTC

Created attachment 75536 [details] [review]
Proposed additional patch against stock 0.117

Actually, we can do like the rest of body-bane.cc and use g_mime_message_get_header(), since that returns the raw header, not converted to UTF-8.

BTW, I don't really get this code in BodyPane :: create_followup_or_reply()

    v = normalize_subject_re (h);
    std::string val (v.str, v.len);
    if (!val.find("RE:") || !val.find ("Re:"))
      val.replace (0, 3, "Re:"); // be polite & force lowercase 'e'
    else
      val.insert (0, "Re: "); // no Re: -- add one.

normalize_subject_re() already strips the leading 'Re:'. Why the code to do it again ?

Comment 20 Frederic Crozat 2006-10-28 13:16:22 UTC

I've just tested the patch and it fixes the issue.

Comment 21 Charles Kerr 2006-10-30 20:13:33 UTC

Chris: patch looks good to me.  Please commit.

normalize_subject_re() doesn't actually change the string, it just
shrinks StringView's view to prune redundant leading Re's.
We have to convert that StringView to a std::string before changing
case ("RE:" -> "Re:") or prepending an "Re: " if one isn't already there.

Comment 22 Charles Kerr 2006-10-30 20:36:06 UTC

Chris: I've cleaned up normalize_subject_re() and its caller a bit
so that they're not so overlapping.

Comment 23 Christophe Lambin 2006-10-30 22:35:44 UTC

Committed.

Comment 24 Frederic Crozat 2006-11-05 11:17:43 UTC

And reopening again (I know, you're going to hate :)

With pan 0.118, some posts done with Pan/0.14.2.91 are not decoded properly, when  group is configured to use ISO-8859-1/15 instead of UTF-8 :

From: Foo bar <news_01@boofar.net>
Subject: Re: Audits de sÃ©curitÃ©
Date: Sun, 05 Nov 2006 02:16:37 +0100
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux))

In this case, title is decoded correctly in thread pane whenever is group encoding is configured in UTF-8 or ISO-8859-1 (or -15).

But in message pane (and when replying to message), title is not decoded correctly  if group encoding is set to ISO-8859-1 (or -15). It is correct only for UTF-8 encoding.

Enjoy ;)

Comment 25 Christophe Lambin 2006-11-05 16:23:22 UTC

... oh boy. :)

Can you give me the message id + group, or preferably, attach the full message?

Comment 26 Frederic Crozat 2006-11-05 21:10:40 UTC

Created attachment 76052 [details]
message causing problem

here is a message causing problem

Comment 27 Christophe Lambin 2006-11-05 21:56:54 UTC

Hmm, that article:
- specifies a charset ISO-8859-15 in its content type
- has a body in ISO-8859-15 (correct)
- has a subject in UTF-8 (wrong)

To be honest, I don't see a clean solution to this and, because the article is essentially invalid, I'd be tempted to close that as WONTFIX.

Charles: any thoughts ?

Comment 28 Frederic Crozat 2006-11-06 06:51:33 UTC

Well, even if message is invalid (it is ironic to see it was sent by earlier version of pan ;), for full consistency, displaying its subject when group is configured in UTF-8 should be broken too ;)
IMHO, group encoding shouldn't affect message display when encoding is specific in a message.

Possible heuristic to workaround this problem:

if group encoding is not set to UTF-8, if message charset is not UTF-8 and if subject pass g_utf8_validate, don't try to convert it using message charset to UTF-8.

Comment 29 Christophe Lambin 2006-11-06 19:37:20 UTC

The problem with your workaround is that it breaks support for charsets that are utf-8 clean (see bug #356835).

charles: transferring to you, since I won't have any time to look at this this week.

Comment 30 Charles Kerr 2007-01-19 19:28:59 UTC

Chris: transferring to you, since I've done nothing with this ticket over
the last five weeks and it makes me twitch.

Comment 31 Christophe Lambin 2007-01-25 06:38:27 UTC

Marking as FIXED, since two of the three cases in this bugreport are addressed.

Frederic/Charles: if either of you feel strongly about the final case (displaying an invalid message), feel free to open a new bugreport.