GNOME Bugzilla – Bug 224026
try harder to not send headers in UTF-8
Last modified: 2007-12-26 02:09:50 UTC
I've found that Evolution sends mails with UTF-8 encoded headers which show up broken in Outlook Express 6.0 mail client. It seems OE doesn't recognize UTF-8 yet, but it is a major problem for Evolution users since if the recipient sends back a reply to the broken message, Evolution renders it completely unreadable(see #23988). I hope both contents and headers are encoded by same charset which could be set from edit menu of the new message window. Without this feature, I have to forward all reply messages to some web mail site to read them. Could you please fix this? Thanks. ================================================= Korean GNOME Community - http://gnome.or.kr
1. this is Outlooks problem 2. we only encode headers in UTF-8 if we can't squeeze them into something else.
Surely it's a Outlook problem, but isn't it really a big problem if I can't exchange messages with more than 90% of the internet users who use Outlook as their default mail client? (Replied mail will be completly unreadable by Evolution) Why is it not possible to encode headers using encoding chosen by user or system default? Mail client which fails to show replied mails is simply unusable - who causes such a problem is not a concern for end users. It's really regretful after all those nice Korean-related bugfixes, Evolution fails to be at least usable mail client for Korean users. I hope there should be a workaround for this problem.
jeff: I think you have opinion for this issue. what about you?
oops. jeff is already here. :) If I remember correctly you jeff told that gmime would be flexible to user-defined encoding. and I see this issue is for gmime. why not allow user to use preferred encoding on their free? ;)
gmime does the same as evolution actually. It has a table of some charsets and tries to guess the most appropriate charset to encode the text into. If it can't find one, it too will use UTF-8. The difference between gmime and Evolution is that gmime also includes some multibyte charsets in the table, whereas Evolution doesn't. I think the reason being that "if a subject header, for example, has text in Greek and Japanese, it would encode as Shift-JIS rather than encoding in UTF-8 like it should since Greek will fit into Shift-JIS" or some such. I'll look into re-adding some multibyte charsets to the tables (including euc-kr) but no guarentees.
As to why we don't encode to user's locale: you cannot guarentee that it *will* encode to the user's locale. With people communicating accross locale boundaries, it is very likely that a header will contain text in multiple locales that will not fit into a single charset. anyways, it's still my feeling that Outlook 6 (especially since it was release after 2000) *should* know UTF-8 - it's just pure lazyness on their part for not supporting it. All clients should support UTF-8 these days :\
If I could set charset for the page, like mozilla allow me, I'll read almost every broken messages by a charset for another. IMHO, UTF-8 is in early stage of spread. it makes many trouble to the native locale. so it'll be good to allow workarounds for this moment. (I know. OE is out-dated and f*cking buggy product.)
Evolution already let users choose encoding to use in message content, only encoding for headers is the problem. But by Jeff's comment, I see why Evolution doesn't allow changing header encoding. And if euc-kr is included in the charset table, it could be a workaround for this problem. I know Outlook 6 *SHOULD* support UTF-8, or it is a seer laziness of the development team. But if everybody uses Outlook then why should they change? Even if someone develops a web-based mail client, he will certainly test for Outlook but not probably for Evolution. So virtually every day-to-day email traffic can be handled by Outlook without a problem however mail or its client program may have serious flaws. It's like a browser war. Only minority like Evolution (or Netscape) fails miserably for some of them. It's not their fault. But for end user like me who put those minority products to daily work can't accept a browser fail to view 20% of web pages or a mail client which does not read some replied mails. I can't force you to do this or that. But for the end users' sake, please give them at least a workaround so they can use Evolution in their day-to-day work.
Perhaps we could try encode in the users locale, and fall back to the charset check if it fails. It wouldn't be that hard to add to the rfc2047 encoder would it? I thought it already did something like this anyway. As for outlook express, please open a support request for them, its not our problem (tm).
notzed: yea, maybe that would be better than adding multibyte charsets to camel_charset_best()?
Subjects with Japanese in them mailed to NTT DoCoMo phones also appear garbled as the UTF-8 charset is not understood. The body is of course encoded in shift-jis/euc or iso-2022-jp and works, but UTF-8 doesn't. Encoding the subject in one of those charsets works as expected.
Well I guess thats another company that should be fixing their code then.
Yahoo mail also doesn't handle utf-8 correctly. Yes, they also have to change their code to support utf-8 someday, but do you think we Evolution users shouldn't mail anyone using Outlook Express, Yahoo mail, or whatever mail client not as up to date as Evolution till they all fix their code? If we can't use it in our day-to-day work, then why're you developing this? I know this is not a place to raise a long debate, but I was quiet shocked to see how could usuability neglected as this. If you think it's not a problem if any users with multibyte language couldn't use Evolution. Fine! We won't use it. There's at least couple of alternatives out there which care more about international users.
I'm still looking into a fix for this btw. In fact I have a fix, but it is not ideal. I think what we want to do is to try and use the same charset that the user chose for the message-body, but I'm not sure how to go about doing that since I don't think the headers can easily get at that information without rearranging a some code.
this is not a 1.2 bug because its not our bug.
I disagree and think that this is totally an evolution bug. All mail Japanese software that I have had the pleasure of using uses the iso-2022-jp, sjis or euc encodings. Most email is sent using iso-20220-jp but people do send using the others also. Evolution does this correctly in the body of mails already. Headers, on the other hand, in Japan are basically always encoded using the same encoding as used in the body. If you are sending a mail in Japanese and select the encoding iso-2022-jp, you usually also type in a Japanese subject line and expect it to use the same encoding. I know that in a perfect world it shouldn't matter, but it does in Japan. So, here is what I believe should happen... You type in an email in Japanese, you select iso-2022-jp (or maybe its default already) and evolution tries to encode the subject using the same encoding. If it can't because someone has typed in "Greek" (not likely) then fallback to utf-8 if you really want to. Personally, I prefer that evolution complain and ask me to select a different charset. But I guess it won't really happen that often anyway. This is important for us over here. Right now, I can't use Japanese subject lines as most Japanese email programs don't understand utf-8. By the way, most of the existing Japanese command line encoding tools on linux also don't understand utf-8 and can't translate it to euc for me. I'm talking about nkf in particular here. And as another aside, there are still encoding/decoding issues with sjis->utf-8 and euc->utf-8 and vice versa. This is another reason, its not used so often as an encoding in Japan yet. So, we'd all be greatful over here in Japan if Evolution could sort this out for us... Thanks!!!
notzed: Call it an enhancement request if you don't like to admit it's a bug. I also don't think it's Evolution's fault. But do you really think it's ok to exclude Asian users from using Evolution at all? Suppose how many efforts have been made for i18n in GNOME2/GTK2, and also the current position of Evolution as a default mail client for GNOME. I believe Evolution is something more than just some hackers' leisure time hobby. And if you still insist it's none of your problem, please think about why Mozilla has to include compatibility rendering mode to its new beta release.
xavier: pliz come down. :) I would believe notzed did not want to say 'multichar users go home.' but 'OE is suck, and UTF-8 rules of future.' notzed: but you'd better to think of backward-compatibility once more. UTF-8 is future. yeah, I bet you. but its in early (very early) stage of spreading. the locale trouble of native-to-utf8 is system-wide, not just an app. gtk2/gnome2 has phantom manace by this issue, and FreeBSD and other *nix family also not familiar with UTF-8 yet. pliz concern.
*** bug 242549 has been marked as a duplicate of this bug. ***
I filed bug 244991. It may possibly be a duplicate of this one, but they sound different because this one reportedly does not affect the message body.
Just a note that "Headers encoded differently than the forced message body" also affects window-1251 users like us - Bulgarians. Currently in 1.4.3 a cyrillic subject is encoded in koi8-r.
koi8-r is probably the better choice of charsets to use tho... since windows charsets are not always available on Unix systems and so thus should be avoided if possible anyway.
Created attachment 42989 [details] [review] 24026.patch (work-in-progress fix - attaching now before I lose it)
*** bug 252624 has been marked as a duplicate of this bug. ***
Created attachment 43652 [details] [review] save prefer encoding in mime-message-object, and then encode by using it.
I've made and posted a patch to Evolution-1.4.6. This resolution is that save encodings used by encoding message body and use its encodings to encode subject. I think the way evolution-1.4.6 determines the encoding is not right. At a glance of evolutino source code, in camel_charset_best() and camel-charset-private.h(genereted by camel-charset-map.c#main()), I think the Evolution determines language by seeing what character is appeared in strings. It is impossible to detect what language is used by glancing character data, because CJK ideograph characters occupy same code in Unicode. - add new field prefer_charsets to struct of CamelMimeMessage - set prefer_charsets when creating CamelMimeMessage instance(in e-msg-composer.c#build_message()) - create new function camel_charset_select(), which first try encode by prefer_charsets and then pass to camel_charset_best() if fails. - replace calling camel_charset_best() to camel_charset_select() in place of encoding subject(camel-mime-utils.c#header_encode_string()). (it is better that use this mechanism in other headers such as From:) I hope merge this patch to help all of CJK Evolution users.
by default, the first time a user starts up evolution (at least on a new distro), their charset settings will default to UTF-8. your patch breaks the current logic to encode text using iso-8859-* if at all possible which breaks the following rule from rfc2047, Section 3: When there is a possibility of using more than one character set to represent the text in an 'encoded-word', and in the absence of private agreements between sender and recipients of a message, it is recommended that members of the ISO-8859-* series be used in preference to other character sets. A similar rule that should be applied to cjk charsets is that Evolution should encode using one of the universally accepted charsets for internet use if at all possible. Since the user can enter in anything for his charsets, we cannot possibly control that if we were to use your patch. I've got a fix for this very issue that does not break the rules of rfc2047 and at the same fixes some other issues with the current charset stuff, I just haven't committed it because I need to get Michael to review the changes (it's a fairly large change). If you'd like to test it out, check out the gmime module from gnome cvs.
<blockquote> your patch breaks the current logic to encode text using iso-8859-* if at all possible which breaks the following rule from rfc2047, Section 3: </blockquote> There are some fixes which can apply to my patch. 1) add ISO-8859-* to first of prefer-charsets list, 2) in my charset_select() function, at first call charset_best() and if it returns "UTF-8" try to encode with prefer-charsets, 3) set Evolution default charset to "none" but "UTF-8". <blockquote> A similar rule that should be applied to cjk charsets is that Evolution should encode using one of the universally accepted charsets for internet use if at all possible. Since the user can enter in anything for his charsets, we cannot possibly control that if we were to use your patch. </blockquote> I think the quotation from RFC2047-section3 is not a universal rule but is just a Europian local rule. There are no universally accepted charsets in CJK. Of cause it may be Unicode/UTF-8, but we talk about in a case of MUAs which cannot recognize UTF-8. There are some charsets each CJK regions and a region cannot understand other regions' charsets. It is important difference with Europe. I've check out and test gmime and it successfully encode with ISO-2022-JP. But I doubt Chinese or Korean user also encode subject with ISO-2022-JP if the subject string is encodable in ISO-2022-JP. If so, CK users surely reject such MUA. I tell again that it is impossible to detect what language is used by seeing characters, because CJK ideographic characters occupy same code in Unicode. And gmime's merging mechanism is not hopefull. At least many programs which handle mails in Japan expect that ideographic characters are encoded and ASCII charateres not. For example, subject strings such as "Re: XXX"(XXX means ideograph character) should be encoded to "Re: =?iso-2022-jp?b?...?=".
gmime has logic to choose the proper cjk charset based on the locale lang, so that is not an issue. as for the merging... huh? it splits/merges words the way rfc2047 describes. if other mailers can't handle that, then that's their fault, not gmime's. gmime will encode: ascii-foo <multibyte-foo> as ascii-foo =?charset?b?...?= so I have no idea what you are talking about. anyways, I much prefer gmime's solution and it works without having to add kludgy interfaces to CamelMimeMessage.
I see your decision because using locale is a kind of way user can specify his language. But I don't understand where locale is checked. g_mime_header_set_subject(message, subject) - message_set_subject(message, subject) - message->subject = g_strstrip(g_strdup(subject)) - g_mime_utils_header_encode_text(message->subject) - rfc2047_encode(in, IS_ESAFE) - words = rfc2047_encode_get_rfc822_words(in, safemask & IS_PSAFE) - rfc2047_encode_merge_rfc822_words (&words) - while (word) { - switch (word->type) { - case WORD_2047: - if (word->encoding == 1) - else - rfc2047_encode_word (out, start, len, g_mime_charset_best (start, len), safemask); rfc2047_encode_word() and g_mime_charset_best() seems not to look locale. And rfc2047_encode_merge_rfc822_words() merges words depends on word_types_compatable() which returns true when former word is ATOM and later one is WORD_2047(in case of "<ascii> <multibite>" word sequence). I attach my sample source code. Am I wrong in coding? >it splits/merges words the way rfc2047 describes. if other mailers can't handle that, then that's their fault, not gmime's. There are many programs before rfc2047 and Evolution is a tool for people to communicate others who may use old MUAs. "<ascii> <multibyte>" is not a so big problem, but evaluated as negative.
Created attachment 43655 [details] gmime subject encoding test programs. the file is encoded in euc-jp.
abobe program output: Subject: =?iso-2022-jp?q?Re=3A_=1B$B$O$8$a$^$7$F=1B=28B?= Subject: =?iso-2022-jp?q?Re=3A_=1B$B=3C+8J=3ER2p=1B=28B?= Subject: =?iso-2022-jp?q?Re=3A_=1B$B$40'=3B=22=1B=28B?= The "Re:" prefix is merged to Japanese strings and encoded.
ah, maybe I was wrong about the merging. in any event, doesn't really matter. g_mime_charset_best does check the locale lang btw. static const char * charset_best_mask (unsigned int mask) { const char *lang; int i; for (i = 0; i < G_N_ELEMENTS (charinfo); i++) { if (charinfo[i].bit & mask) { lang = g_mime_charset_language (charinfo[i].name); if (!lang || (locale_lang && !strncmp (locale_lang, lang, 2))) return charinfo[i].name; } } return "UTF-8"; }
sending headers in utf8 isn't strictly a bug and certainly isn't a regression. gmime has nothing to do with evolution, apart that it appears to be a fork of camel.
Created attachment 43982 [details] [review] 24026.patch
Just a nitpick , but I am pretty sure that in all the places where a comment reads "Russian" , you actually mean "Cyrillic"
sorry, yes - I meant cyrillic.
punting only part of the patch made it in (camel_charset_best_mask)
*** http://bugzilla.ximian.com/show_bug.cgi?id=62345 has been marked as a duplicate of this bug. ***
adding "patch" keyword
don't we just do locale based stuff now? i.e. this patch is not valid anymore
related to bug 250087
setting the first patch to obsolete
seems like the last patch has not been committed yet; needs-work because of camel move from evo to eds
even the last patch is wrong actually in that it's not 100% reliable
Created attachment 56284 [details] [review] update feji's patch against eds.
I think feji's patch works in almost all cases. As far as I confirmed, it works fine on all CJK locales.
My e-d-s is running under en_US.ISO-8859-1 with attachment 56284 [details] [review] applied. The headers are sent UTF-8 encoded for both Windows-1251 and UTF-8 forced message bodies. This is still better than the unpatched e-d-s which sends koi8-r encoded headers when both windows-1251 and utf-8 forced message bodies. Apparently Outlook Express in a recent XP gets confused by these. To answer a previous comment, using koi8-r in bg correspondence is very much innapropriate despite being used in a standards compliant manner.
Related report is bug 338550, "Evolution encodes greek Subject: as 8859-7, though configured UTF-8". I think I hold the other side of the discussion as I would prefer all in UTF-8. In my case, GMail appears to ignore the message body encoding and follow the Subject: encoding, whatever that is. It looks that a bug report to GMail should be sent.
er, didn't mean to mark as NOTABUG (wtf happened?)
I've just implemented a solution to this very same problem in GMime svn (will appear in the 2.2.7 release) What I did was allow the client to set a list of user-specified "preferred charsets". What GMime will do is, when encoding headers it will iterate thru that list of charsets and the first one that can fit the text into it will be the one used to encode. So for example, if Simos wanted his mail client to always use UTF-8 when encoding headers... all he'd have to do is set UTF-8 as his first charset in the list. Xavier, on the other hand, wanting to avoid sending mail with headers encoded in UTF-8, could set his charset list with "euc-kr" or something as his first choice and UTF-8 as his last (or not even bother listing UTF-8).
In any case, the patch seems to fail to be applied to head. Marking it as obsolete.
Hello. I am using Evolution 2.10.2 from Debian testing now. LANG is pl_PL.UTF-8, but as most Polish users still use ISO-8859-2, I choose ISO-8859-2 as the message character set. But even though I use only Polish special characters - also in headers - which fall into ISO-8859-2, headers are encoded in UTF-8. So, it looks that Evolution doesn't respect RFC 2047 section 3 here. This causes non-Unicode MUAs to garble headers in replies, but is essentially just inconsistent - message body is encoded in ISO-8859-* and the headers, though using characters from the same set, are in UTF-8. I have also noticed strange behavior when there are pure-ASCII words and encoded words in headers - filed it as bug 438438 some time ago.
fixed in svn