GNOME Bugzilla – Bug 138218
Encode headers in article's charset if possible
Last modified: 2010-09-26 13:02:08 UTC
Hello there, I won't explain much in words what is the problem, I will point some examples. This is how a Pan header looks like: From: =?koi8-r?b?9M/NydPMwdcg7cHSy8/X08vJ?= <tome@set.com.mk> Subject: =?koi8-r?b?0MHO0MHO0MHO0MHO?= Content-Type: text/plain; charset=UTF-8 The default locale encoding of my system is UTF-8 (mk_MK.UTF-8 to be exact). I configured Pan to use UTF-8, too. Is there a way to make the From/Subject also encode in the selected encoding? The following should give clearer view. These are correctly formed/encoded headers: From: =?UTF-8?B?0LTQsNC80ZjQsNC9INCzLg==?= <tome@set.com.mk> Subject: =?UTF-8?B?0KLQtdGB0YI=?= NOTE: This issue only concerns the headers. I know that even though koi8-r is used to display the headers and they show up fine, but some newsreaders are having problems with this. Thank you very much for your time and effort. You're doing a great job. Keep it up! Take care.
Can you explain what problems a newsreader may have with this? If a newsreader has problems interpreting a encoded koi8-r text, I would expect it to have the same problem interpreting the same text encoded in UTF-8 (with the same encoding mechanism). Which newsreaders in particular? Not that I doubt you, but I'd like to understand. :)
People have reported slrn to be having problems with this (one report on knode, but I think it was missconfigured). Another question, why would a newsreader automaticaly encode any cyrillic header text in koi8-r instead the specified? (i.e. ISO-8859-5, UTF-8 even cp1251, perhaps the system locale would be more logical) I'm a programmer myself, do you feel that this is correct? It's not a clean solution (according to me). Finnaly, don't get me wrong, I do not feel this is a bug, more of proper coding suggestion I guess... Kind regards, Tomislav Markovski (tome@set.com.mk) P.S. Other newsreaders encode the headers in same encoding as Content-Type specifies the body. However, other newsreaders ar simply incomparable to Pan!
Marking enhancement for bluesky: it would be more consistent to force encoded headers to be in the same charset as the body, when possible. Not strictly a requirement, since Content-Type relates to the mime part (body), but not the headers. Tomislav: since you're a programmer, patches are welcome. :)
Another issue: subject line gets unreadable when someone replies to the post, but uses (forcibly) different encoding. Is is possible (as viewed in non-Pan readers) to read properly the encoded text. However, Pan displays something else... About the patch, I'm not very familiar with the pan code, and I haven't got the time to go through it. I just thought it would be better to inform you guys that something isn't going well with Pan. It's up to you if you want to fix it.
> Finnaly, don't get me wrong, I do not feel this is a bug, more of proper > coding suggestion I guess... This is a bug. KOI8-R is only used with Russian cyrrilic alphabet. The Macedonian language has letters that KOI8-R doesn't have, I believe that other cyrillic languages(Serbian,Bulgarian...) would also have the same problem. I also believe that the headers should be encoded with the same charset as the body, not the locale. Pan is a great newsreader, but because of this bug we here mostly use knode or slrn. I'm supprised that someone didn't file this bug earlier. :)
fejj: putting you in CC since GMime is handling the encoding for us. Looks like GMime may need some work in distinguishing between different cyrillic languages. An alternative would be to allow Pan to override the charset used for encoding (rather than relying on best_charset()).
fejj: ok, *now* you're in CC :)
wow, this new bugzilla (and the changelog mails) is funkadelic. anyways, I've got an idea... I'll attach a patch here hopefully tonight.
Created attachment 26123 [details] [review] gmime-charset.patch The attached patch makes it so that special c/j/k/r charsets aren't used unless they match the user's locale. for example: koi8-r won't be used unless the user's locale lang is "ru" koi8-u won't be used unless the user's locale lang is "uk" euc-kr won't be used unless the user's locale lang is "ko" and so on... this should fix it for everyone except for those users who are composing, say, japanese in a non japanese locale - in which case the headers will be encoded as UTF-8, but that's better than the current situation so I guess that's acceptable.
Would be good if someone could test this before I commit to GMime CVS
The patch isn't working for me. I tried to patch 0.14.2.91 this is what I got: patching file ChangeLog Hunk #1 FAILED at 1. 1 out of 1 hunk FAILED -- saving rejects to file ChangeLog.rej patching file gmime/gmime-charset.c Hunk #1 FAILED at 121. Hunk #2 FAILED at 131. Hunk #3 FAILED at 345. Hunk #4 FAILED at 482. Hunk #5 FAILED at 536. Hunk #6 succeeded at 368 (offset -346 lines). 5 out of 6 hunks FAILED -- saving rejects to file gmime/gmime-charset.c.rej Any ideas?
I guess I'll have to backport the patch to whatever version of gmime pan is using. stay tuned... :-)
Created attachment 26164 [details] [review] pan-gmime-charset.patch Okay, new patch which also syncs Pan's copy of gmime-charset.[c,h] up with GMime CVS (which includes a number of fixes and some extra functionality)
The patch worked... kind of. The headers are now encoded in iso-8859-5 which does contain the letters from the alphabet I use, but it is neither my system locale, nor my pan settings encoding, not even the encoding I'm posting a followup. Somewhat peculiar, but does the job. Conclusion: gmime needs more development! Other than that, great job Jeffrey. Thanks a lot!
iso-8859-5 was actually the intended target for encoding the headers in my patch. With header encoding, since mailers need to be able to render that as clearly as possible, the RFCs suggest using the "lowest common factor" charset (ie. use an 8bit charset before using a multibyte charset, and when using an 8bit charset - use the lowest numbered iso charset possible: ie. use iso-8859-1 if possible, rather than iso-8859-2 and so on). We *really* don't want to be using UTF-8 to encode headers if at all possible because so few mailers/news readers actually support it (ironically, Linux mailers/news readers are far better about this than Windows mailers in general). The situation, is however, improving over time (more and more support it every year - a few years ago, I could probably have counted the number of Linux news/mail readers that supported UTF-8/proper charset conversion on a single hand). This is why GMime has code to try and use some popular asian charsets rather than UTF-8 as well (and this is non-trivial). From bug reports I've gotten, Outlook 10 still doesn't support UTF-8 and many of the popular mailers used in China/Japan/Korea don't support UTF-8 either. Sorry about the techno-babble, just figured I'd lay down the issues in case anyone was interested. Anyways, glad it now does the Right Thing (tm) and thanks for reporting the bug, always makes me happy to make GMime even better :-)
chris: shall I commit to Pan CVS?
That'd be great. Thanks for your help.
Thanks Jeffrey for the explanation. Call me annoynig, but I would like to ask you to take a look at this snapshot: http://www.set.com.mk/pan.png You will notice that some followups are not displayed correctly. Mostly when they reply in iso-8859-5 and one post at the bottom uses windows-1251. As far as I know, most newsreaders encode the headers and body of the same encoding as the post they reply to, unless they excplicitly specify to use their defined encoding. I would ignore this errors and say that the users haven't setup their clients as it should, but then again gecko based newsreaders (Moz, TB), knode and slrn display this correctly. Could you explain why this is happening? Is it another gmime issue? I figured I should post this message while we're at the charsets issue. Thanks again...
without having seen the message sources, it's hard to say (especially since the encoded word is not fully displayed) - but this is likely the same problem that's been reported a few times before, which is that the encoded-word is not a proper encoded-word token. For example, one of the subjects had "Re:" in the middle of an encoded-word, this is *illegal* and so GMime doesn't decode it. GMime uses a tokenising approach to parsing and so it will get "=?iso-8859-5?Q?Re" as a token, ":" as the next token, and then "...?=" (replace ... with the encoded text) as the third token. Obviously none of the 3 decode by themselves. Another common brokeness is this: Foo=?iso-8859-1?Q?...?=Bar ie, the encoded-word token is in the middle of a larger token, except that according to the rfc - encoded-word tokens have to be their own token. the way some readers get around this is by using the following approach to parsing (which is sick and wrong): if ((enc_start = strstr (subject, "=?"))) { enc_end = strstr (enc_start, "?="); decode (enc_start, enc_end); } So what needs to be done is for those broken readers to be fixed and to send properly formatted encoded text :)
Don't have access to that newsgroup, but from the little evidence in the screenshot, I think fejj is right. This bug is tracking that: http://bugzilla.gnome.org/show_bug.cgi?id=116269
fwiw, gmime has been greatly improved in the area of decoding brokenly encoded rfc2047 tokens.