Bug 224026 – try harder to not send headers in UTF-8

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 224026 - try harder to not send headers in UTF-8


Summary:	try harder to not send headers in UTF-8


Status:	RESOLVED FIXED

Product:	evolution
Classification:	Applications
Component:	Mailer
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	Future
Assigned To:	Jeffrey Stedfast
QA Contact:	Evolution QA team

URL:
Whiteboard:

Duplicates:	242549 252624 (view as bug list)
Depends on:	223988
Blocks:

Reported:	2002-04-30 06:09 UTC by Xavier Cho
Modified:	2007-12-26 02:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
24026.patch (work-in-progress fix - attaching now before I lose it) (8.67 KB, patch) 2003-10-16 17:42 UTC, Jeffrey Stedfast	none	Details \| Review
save prefer encoding in mime-message-object, and then encode by using it. (8.45 KB, patch) 2004-05-01 04:59 UTC, KANDA Daisuke	rejected	Details \| Review
gmime subject encoding test programs. the file is encoded in euc-jp. (1.23 KB, text/plain) 2004-05-02 08:41 UTC, KANDA Daisuke		Details
24026.patch (10.84 KB, patch) 2004-07-22 17:34 UTC, Jeffrey Stedfast	needs-work	Details \| Review
update feji's patch against eds. (9.68 KB, patch) 2005-12-22 07:08 UTC, Hiroyuki Ikezoe	none	Details \| Review

Description Xavier Cho 2002-04-30 06:09:57 UTC

I've found that Evolution sends mails with UTF-8 encoded headers which
show up broken in Outlook Express 6.0 mail client. It seems OE doesn't
recognize UTF-8 yet, but it is a major problem for Evolution users since if
the recipient sends back a reply to the broken message, Evolution renders
it completely unreadable(see #23988).

I hope both contents and headers are encoded by same charset which could be
set from edit menu of the new message window.

Without this feature, I have to forward all reply messages to some web mail
site to read them.

Could you please fix this? Thanks.

=================================================
Korean GNOME Community - http://gnome.or.kr

Comment 1 Jeffrey Stedfast 2002-04-30 06:26:52 UTC

1. this is Outlooks problem
2. we only encode headers in UTF-8 if we can't squeeze them into
something else.

Comment 2 Xavier Cho 2002-04-30 07:05:11 UTC

Surely it's a Outlook problem, but isn't it really a big problem if I
can't exchange messages with more than 90% of the internet users who
use Outlook as their default mail client? (Replied mail will be
completly unreadable by Evolution)

Why is it not possible to encode headers using encoding chosen by user
or system default?

Mail client which fails to show replied mails is simply unusable - who
causes such a problem is not a concern for end users. It's really
regretful after all those nice Korean-related bugfixes, Evolution
fails to be at least usable mail client for Korean users.

I hope there should be a workaround for this problem.

Comment 3 kz 2002-04-30 09:31:03 UTC

jeff: I think you have opinion for this issue. what about you?

Comment 4 kz 2002-04-30 09:44:49 UTC

oops. jeff is already here. :)

If I remember correctly you jeff told that gmime would be flexible to
user-defined encoding.
and I see this issue is for gmime.
why not allow user to use preferred encoding on their free? ;)

Comment 5 Jeffrey Stedfast 2002-04-30 18:22:27 UTC

gmime does the same as evolution actually. It has a table of some
charsets and tries to guess the most appropriate charset to encode the
text into. If it can't find one, it too will use UTF-8.

The difference between gmime and Evolution is that gmime also includes
some multibyte charsets in the table, whereas Evolution doesn't. I
think the reason being that "if a subject header, for example, has
text in Greek and Japanese, it would encode as Shift-JIS rather than
encoding in UTF-8 like it should since Greek will fit into Shift-JIS"
or some such.

I'll look into re-adding some multibyte charsets to the tables
(including euc-kr) but no guarentees.

Comment 6 Jeffrey Stedfast 2002-04-30 18:26:18 UTC

As to why we don't encode to user's locale: you cannot guarentee that
it *will* encode to the user's locale. With people communicating
accross locale boundaries, it is very likely that a header will
contain text in multiple locales that will not fit into a single
charset.

anyways, it's still my feeling that Outlook 6 (especially since it was
release after 2000) *should* know UTF-8 - it's just pure lazyness on
their part for not supporting it. All clients should support UTF-8
these days :\

Comment 7 kz 2002-04-30 22:39:11 UTC

If I could set charset for the page, like mozilla allow me,
I'll read almost every broken messages by a charset for another.

IMHO, UTF-8 is in early stage of spread. it makes many trouble to the
native locale. so it'll be good to allow workarounds for this moment.
(I know. OE is out-dated and f*cking buggy product.)

Comment 8 Xavier Cho 2002-05-01 01:50:12 UTC

Evolution already let users choose encoding to use in 
message content, only encoding for headers is the problem.

But by Jeff's comment, I see why Evolution doesn't allow
changing header encoding. And if euc-kr is included in 
the charset table, it could be a workaround for this problem.

I know Outlook 6 *SHOULD* support UTF-8, or it is a seer 
laziness of the development team. But if everybody uses 
Outlook then why should they change? Even if someone develops
a web-based mail client, he will certainly test for Outlook
but not probably for Evolution. So virtually every day-to-day 
email traffic can be handled by Outlook without a problem 
however mail or its client program may have serious flaws.

It's like a browser war. Only minority like Evolution (or 
Netscape) fails miserably for some of them. It's not their 
fault. But for end user like me who put those minority 
products to daily work can't accept a browser fail to view
20% of web pages or a mail client which does not read 
some replied mails.

I can't force you to do this or that. But for the end users'
sake, please give them at least a workaround so they can 
use Evolution in their day-to-day work.

Comment 9 Not Zed 2002-05-01 09:43:12 UTC

Perhaps we could try encode in the users locale, and fall back to the
charset check if it fails.  It wouldn't be that hard to add to the
rfc2047 encoder would it?

I thought it already did something like this anyway.

As for outlook express, please open a support request for them, its
not our problem (tm).

Comment 10 Jeffrey Stedfast 2002-05-01 17:59:29 UTC

notzed: yea, maybe that would be better than adding multibyte charsets
to camel_charset_best()?

Comment 11 Thomas O'Dowd 2002-05-03 08:13:43 UTC

Subjects with Japanese in them mailed to NTT DoCoMo phones also appear
garbled as the UTF-8 charset is not understood. The body is of course
encoded in shift-jis/euc or iso-2022-jp and works, but UTF-8 doesn't.
Encoding the subject in one of those charsets works as expected.

Comment 12 Not Zed 2002-06-11 12:21:37 UTC

Well I guess thats another company that should be fixing their code then.

Comment 13 Xavier Cho 2002-06-12 01:00:07 UTC

Yahoo mail also doesn't handle utf-8 correctly. 

Yes, they also have to change their code to support utf-8 someday, but
do you think we Evolution users shouldn't mail anyone using Outlook
Express, Yahoo mail, or whatever mail client not as up to date as
Evolution till they all fix their code?

If we can't use it in our day-to-day work, then why're you developing
this? I know this is not a place to raise a long debate, but I was
quiet shocked to see how could usuability neglected as this. If you
think it's not a problem if any users with multibyte language couldn't
use Evolution. Fine! We won't use it. There's at least couple of
alternatives out there which care more about international users.

Comment 14 Jeffrey Stedfast 2002-06-12 01:08:28 UTC

I'm still looking into a fix for this btw. In fact I have a fix, but
it is not ideal. I think what we want to do is to try and use the same
charset that the user chose for the message-body, but I'm not sure how
to go about doing that since I don't think the headers can easily get
at that information without rearranging a some code.

Comment 15 Not Zed 2002-07-24 12:33:52 UTC

this is not a 1.2 bug because its not our bug.

Comment 16 Thomas O'Dowd 2002-07-24 13:03:30 UTC

I disagree and think that this is totally an evolution bug. All mail
Japanese software that I have had the pleasure of using uses the
iso-2022-jp, sjis or euc encodings. Most email is sent using
iso-20220-jp but people do send using the others also.

Evolution does this correctly in the body of mails already. Headers,
on the other hand, in Japan are basically always encoded using the
same encoding as used in the body. If you are sending a mail in
Japanese and select the encoding iso-2022-jp, you usually also type in
a Japanese subject line and expect it to use the same encoding. I know
that in a perfect world it shouldn't matter, but it does in Japan.

So, here is what I believe should happen... You type in an email in
Japanese, you select iso-2022-jp (or maybe its default already) and
evolution tries to encode the subject using the same encoding. If it
can't because someone has typed in "Greek" (not likely) then fallback
to utf-8 if you really want to. Personally, I prefer that evolution
complain and ask me to select a different charset. But I guess it
won't really happen that often anyway.

This is important for us over here. Right now, I can't use Japanese
subject lines as most Japanese email programs don't understand utf-8. 

By the way, most of the existing Japanese command line encoding tools
on linux also don't understand utf-8 and can't translate it to euc for
me. I'm talking about nkf in particular here. And as another aside,
there are still encoding/decoding issues with sjis->utf-8 and
euc->utf-8 and vice versa. This is another reason, its not used so
often as an encoding in Japan yet.

So, we'd all be greatful over here in Japan if Evolution could sort
this out for us... Thanks!!!

Comment 17 Xavier Cho 2002-07-25 01:04:57 UTC

notzed: Call it an enhancement request if you don't like to admit it's
a bug. I also don't think it's Evolution's fault.

But do you really think it's ok to exclude Asian users from using
Evolution at all? Suppose how many efforts have been made for i18n in
GNOME2/GTK2, and also the current position of Evolution as a default
mail client for GNOME. I believe Evolution is something more than just
some hackers' leisure time hobby.

And if you still insist it's none of your problem, please think about
why Mozilla has to include compatibility rendering mode to its new
beta release.

Comment 18 kz 2002-07-25 13:42:39 UTC

xavier:
pliz come down. :)
I would believe notzed did not want to say 'multichar users go home.'
but 'OE is suck, and UTF-8 rules of future.'

notzed:
but you'd better to think of backward-compatibility once more.
UTF-8 is future. yeah, I bet you.
but its in early (very early) stage of spreading.

the locale trouble of native-to-utf8 is system-wide, not just an app.
gtk2/gnome2 has phantom manace by this issue,
and FreeBSD and other *nix family also not familiar with UTF-8 yet.
pliz concern.

Comment 19 Jeffrey Stedfast 2003-05-09 01:25:42 UTC

*** bug 242549 has been marked as a duplicate of this bug. ***

Comment 20 Wesley Tanaka 2003-06-17 15:31:35 UTC

I filed bug 244991.  It may possibly be a duplicate of this one, but
they sound different because this one reportedly does not affect the
message body.

Comment 21 Yanko Kaneti 2003-07-21 10:49:05 UTC

Just a note that "Headers encoded differently than the forced message
body" also affects window-1251 users like us - Bulgarians. Currently
in 1.4.3 a cyrillic subject is encoded in koi8-r.

Comment 22 Jeffrey Stedfast 2003-07-21 15:18:54 UTC

koi8-r is probably the better choice of charsets to use tho... since
windows charsets are not always available on Unix systems and so thus
should be avoided if possible anyway.

Comment 23 Jeffrey Stedfast 2003-10-16 17:42:18 UTC

Created attachment 42989 [details] [review]
24026.patch (work-in-progress fix - attaching now before I lose it)

Comment 24 Jeffrey Stedfast 2004-01-06 13:17:40 UTC

*** bug 252624 has been marked as a duplicate of this bug. ***

Comment 25 KANDA Daisuke 2004-05-01 04:59:28 UTC

Created attachment 43652 [details] [review]
save prefer encoding in mime-message-object, and then encode by using it.

Comment 26 KANDA Daisuke 2004-05-01 05:32:01 UTC

I've made and posted a patch to Evolution-1.4.6.

This resolution is that save encodings used by encoding message body
and use its encodings to encode subject.

I think the way evolution-1.4.6 determines the encoding is not right.
At a glance of evolutino source code, in camel_charset_best() and
camel-charset-private.h(genereted by camel-charset-map.c#main()), I
think the Evolution determines language by seeing what character is
appeared in strings.
It is impossible to detect what language is used by glancing character
data, because CJK ideograph characters occupy same code in Unicode.

- add new field prefer_charsets to struct of CamelMimeMessage
- set prefer_charsets when creating CamelMimeMessage instance(in
e-msg-composer.c#build_message())
- create new function camel_charset_select(), which first try encode
by prefer_charsets and then pass to camel_charset_best() if fails.
- replace calling camel_charset_best() to camel_charset_select() in
place of encoding subject(camel-mime-utils.c#header_encode_string()).
(it is better that use this mechanism in other headers such as From:)


I hope merge this patch to help all of CJK Evolution users.

Comment 27 Jeffrey Stedfast 2004-05-01 14:29:52 UTC

by default, the first time a user starts up evolution (at least on a
new distro), their charset settings will default to UTF-8.

your patch breaks the current logic to encode text using iso-8859-* if
at all possible which breaks the following rule from rfc2047, Section 3:

   When there is a possibility of using more than one character set to
   represent the text in an 'encoded-word', and in the absence of
   private agreements between sender and recipients of a message, it is
   recommended that members of the ISO-8859-* series be used in
   preference to other character sets.

A similar rule that should be applied to cjk charsets is that
Evolution should encode using one of the universally accepted charsets
 for internet use if at all possible. Since the user can enter in
anything for his charsets, we cannot possibly control that if we were
to use your patch.

I've got a fix for this very issue that does not break the rules of
rfc2047 and at the same fixes some other issues with the current
charset stuff, I just haven't committed it because I need to get
Michael to review the changes (it's a fairly large change).

If you'd like to test it out, check out the gmime module from gnome cvs.

Comment 28 KANDA Daisuke 2004-05-02 04:10:04 UTC

<blockquote>
your patch breaks the current logic to encode text using iso-8859-* if
at all possible which breaks the following rule from rfc2047, Section 3:
</blockquote>

There are some fixes which can apply to my patch. 1) add ISO-8859-* to
first of prefer-charsets list, 2) in my charset_select() function, at
first call charset_best() and if it returns "UTF-8" try to encode with
prefer-charsets, 3) set Evolution default charset to "none" but "UTF-8".

<blockquote>
A similar rule that should be applied to cjk charsets is that
Evolution should encode using one of the universally accepted charsets
for internet use if at all possible. Since the user can enter in
anything for his charsets, we cannot possibly control that if we were
to use your patch.
</blockquote>
I think the quotation from RFC2047-section3 is not a universal rule
but is just a Europian local rule.

There are no universally accepted charsets in CJK. Of cause it may be
Unicode/UTF-8, but we talk about in a case of MUAs which cannot
recognize UTF-8. There are some charsets each CJK regions and a region
cannot understand other regions' charsets. It is important difference
with Europe.

I've check out and test gmime and it successfully encode with ISO-2022-JP.
But I doubt Chinese or Korean user also encode subject with
ISO-2022-JP if the subject string is encodable in ISO-2022-JP. If so,
CK users surely reject such MUA.

I tell again that it is impossible to detect what language is used by
seeing characters, because CJK ideographic characters occupy same code
in Unicode.

And gmime's merging mechanism is not hopefull. At least many programs
which handle mails in Japan expect that ideographic characters are
encoded and ASCII charateres not. For example, subject strings such as
"Re: XXX"(XXX means ideograph character) should be encoded to "Re:
=?iso-2022-jp?b?...?=".

Comment 29 Jeffrey Stedfast 2004-05-02 06:01:23 UTC

gmime has logic to choose the proper cjk charset based on the locale
lang, so that is not an issue.

as for the merging... huh? it splits/merges words the way rfc2047
describes. if other mailers can't handle that, then that's their
fault, not gmime's.

gmime will encode:

ascii-foo <multibyte-foo>

as

ascii-foo =?charset?b?...?=

so I have no idea what you are talking about.

anyways, I much prefer gmime's solution and it works without having to
add kludgy interfaces to CamelMimeMessage.

Comment 30 KANDA Daisuke 2004-05-02 08:38:07 UTC

I see your decision because using locale is a kind of way user can
specify his language.
But I don't understand where locale is checked.

g_mime_header_set_subject(message, subject)
  - message_set_subject(message, subject)
    - message->subject = g_strstrip(g_strdup(subject))
  - g_mime_utils_header_encode_text(message->subject)
    - rfc2047_encode(in, IS_ESAFE)
      - words = rfc2047_encode_get_rfc822_words(in, safemask & IS_PSAFE)
      - rfc2047_encode_merge_rfc822_words (&words)
      - while (word) {
        - switch (word->type) {
        - case WORD_2047:
          - if (word->encoding == 1)
          - else
          - rfc2047_encode_word (out, start, len, g_mime_charset_best
(start, len), safemask);

rfc2047_encode_word() and g_mime_charset_best() seems not to look locale.

And rfc2047_encode_merge_rfc822_words() merges words depends on
word_types_compatable() which returns true when former word is ATOM
and later one is WORD_2047(in case of "<ascii> <multibite>" word
sequence).

I attach my sample source code. Am I wrong in coding?

>it splits/merges words the way rfc2047 describes. if other mailers
can't handle that, then that's their fault, not gmime's.

There are many programs before rfc2047 and Evolution is a tool for
people to communicate others who may use old MUAs. "<ascii>
<multibyte>" is not a so big problem, but evaluated as negative.

Comment 31 KANDA Daisuke 2004-05-02 08:41:26 UTC

Created attachment 43655 [details]
gmime subject encoding test programs. the file is encoded in euc-jp.

Comment 32 KANDA Daisuke 2004-05-02 08:46:58 UTC

abobe program output:

Subject: =?iso-2022-jp?q?Re=3A_=1B$B$O$8$a$^$7$F=1B=28B?=
Subject: =?iso-2022-jp?q?Re=3A_=1B$B=3C+8J=3ER2p=1B=28B?=
Subject: =?iso-2022-jp?q?Re=3A_=1B$B$40'=3B=22=1B=28B?=

The "Re:" prefix is merged to Japanese strings and encoded.

Comment 33 Jeffrey Stedfast 2004-05-02 14:07:22 UTC

ah, maybe I was wrong about the merging. in any event, doesn't really
matter.

g_mime_charset_best does check the locale lang btw.

static const char *
charset_best_mask (unsigned int mask)
{
	const char *lang;
	int i;
	
	for (i = 0; i < G_N_ELEMENTS (charinfo); i++) {
		if (charinfo[i].bit & mask) {
			lang = g_mime_charset_language (charinfo[i].name);
			
			if (!lang || (locale_lang && !strncmp (locale_lang, lang, 2)))
				return charinfo[i].name;
		}
	}
	
	return "UTF-8";
}

Comment 34 Not Zed 2004-07-22 05:14:11 UTC

sending headers in utf8 isn't strictly a bug and certainly isn't a
regression.

gmime has nothing to do with evolution, apart that it appears to be a
fork of camel.

Comment 35 Jeffrey Stedfast 2004-07-22 17:34:15 UTC

Created attachment 43982 [details] [review]
24026.patch

Comment 36 Yanko Kaneti 2004-07-22 22:53:09 UTC

Just a nitpick , but I am pretty sure that in all the places where a
comment reads "Russian" , you actually mean "Cyrillic"

Comment 37 Jeffrey Stedfast 2004-07-26 18:16:34 UTC

sorry, yes - I meant cyrillic.

Comment 38 Jeffrey Stedfast 2004-07-28 21:01:30 UTC

punting

only part of the patch made it in (camel_charset_best_mask)

Comment 39 Jeffrey Stedfast 2004-08-03 13:20:04 UTC

*** http://bugzilla.ximian.com/show_bug.cgi?id=62345 has been marked as a duplicate of this bug. ***

Comment 40 André Klapper 2005-03-05 14:47:47 UTC

adding "patch" keyword

Comment 41 Not Zed 2005-03-17 08:00:05 UTC

don't we just do locale based stuff now?

i.e. this patch is not valid anymore

Comment 42 André Klapper 2005-07-31 14:37:40 UTC

related to bug 250087

Comment 43 André Klapper 2005-09-25 00:54:45 UTC

setting the first patch to obsolete

Comment 44 André Klapper 2005-09-25 00:58:23 UTC

seems like the last patch has not been committed yet; needs-work because of
camel move from evo to eds

Comment 45 Jeffrey Stedfast 2005-09-26 14:34:04 UTC

even the last patch is wrong actually in that it's not 100% reliable

Comment 46 Hiroyuki Ikezoe 2005-12-22 07:08:17 UTC

Created attachment 56284 [details] [review]
update feji's patch against eds.

Comment 47 Hiroyuki Ikezoe 2005-12-22 07:15:34 UTC

I think feji's patch works in almost all cases.  

As far as I confirmed, it works fine on all CJK locales.

Comment 48 Yanko Kaneti 2005-12-22 11:26:08 UTC

My e-d-s is running under en_US.ISO-8859-1 with attachment 56284 [details] [review] applied.
The headers are sent UTF-8 encoded for both Windows-1251 and UTF-8 forced message bodies.
This is still better than the unpatched e-d-s which sends koi8-r encoded headers when both windows-1251 and utf-8 forced message bodies. Apparently Outlook Express in a recent XP gets confused by these.

To answer a previous comment, using koi8-r in bg correspondence is very much innapropriate despite being used in a standards compliant manner.

Comment 49 Simos Xenitellis 2006-04-16 14:03:29 UTC

Related report is bug 338550,
"Evolution encodes greek Subject: as 8859-7, though configured UTF-8".

I think I hold the other side of the discussion as I would prefer all in UTF-8.
In my case, GMail appears to ignore the message body encoding and follow the Subject: encoding, whatever that is.

It looks that a bug report to GMail should be sent.

Comment 50 Jeffrey Stedfast 2006-07-26 19:33:01 UTC

er, didn't mean to mark as NOTABUG (wtf happened?)

Comment 51 Jeffrey Stedfast 2007-04-13 15:36:43 UTC

I've just implemented a solution to this very same problem in GMime svn (will appear in the 2.2.7 release)

What I did was allow the client to set a list of user-specified "preferred charsets". What GMime will do is, when encoding headers it will iterate thru that list of charsets and the first one that can fit the text into it will be the one used to encode.

So for example, if Simos wanted his mail client to always use UTF-8 when encoding headers... all he'd have to do is set UTF-8 as his first charset in the list.

Xavier, on the other hand, wanting to avoid sending mail with headers encoded in UTF-8, could set his charset list with "euc-kr" or something as his first choice and UTF-8 as his last (or not even bother listing UTF-8).

Comment 52 Srinivasa Ragavan 2007-08-23 06:06:45 UTC

In any case, the patch seems to fail to be applied to head. Marking it as obsolete.

Comment 53 Krzysztof Lubański 2007-08-27 20:40:04 UTC

Hello.

I am using Evolution 2.10.2 from Debian testing now. LANG is pl_PL.UTF-8, but as most Polish users still use ISO-8859-2, I choose ISO-8859-2 as the message character set. But even though I use only Polish special characters - also in headers - which fall into ISO-8859-2, headers are encoded in UTF-8. So, it looks that Evolution doesn't respect RFC 2047 section 3 here.

This causes non-Unicode MUAs to garble headers in replies, but is essentially just inconsistent - message body is encoded in ISO-8859-* and the headers, though using characters from the same set, are in UTF-8.

I have also noticed strange behavior when there are pure-ASCII words and encoded words in headers - filed it as bug 438438 some time ago.

Comment 54 Jeffrey Stedfast 2007-12-26 02:09:50 UTC

fixed in svn