Bug 138218 – Encode headers in article's charset if possible

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 138218 - Encode headers in article's charset if possible


Summary:	Encode headers in article's charset if possible


Status:	RESOLVED FIXED

Product:	Pan
Classification:	Other
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	0.14.3
Assigned To:	Christophe Lambin
QA Contact:	Pan QA Team

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-03-26 15:10 UTC by Tomislav Markovski
Modified:	2010-09-26 13:02 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gmime-charset.patch (2.61 KB, patch) 2004-03-30 22:17 UTC, Jeffrey Stedfast	none	Details \| Review
pan-gmime-charset.patch (16.33 KB, patch) 2004-03-31 14:48 UTC, Jeffrey Stedfast	none	Details \| Review

Description Tomislav Markovski 2004-03-26 15:10:27 UTC

Hello there,

I won't explain much in words what is the problem, I will point some examples.
This is how a Pan header looks like:

From:            =?koi8-r?b?9M/NydPMwdcg7cHSy8/X08vJ?= <tome@set.com.mk>
Subject:         =?koi8-r?b?0MHO0MHO0MHO0MHO?=
Content-Type:    text/plain; charset=UTF-8

The default locale encoding of my system is UTF-8 (mk_MK.UTF-8 to be exact). I
configured Pan to use UTF-8, too. Is there a way to make the From/Subject also
encode in the selected encoding?
The following should give clearer view. These are correctly formed/encoded headers:

From:       =?UTF-8?B?0LTQsNC80ZjQsNC9INCzLg==?= <tome@set.com.mk>
Subject:    =?UTF-8?B?0KLQtdGB0YI=?=

NOTE: This issue only concerns the headers. I know that even though koi8-r is
used to display the headers and they show up fine, but some newsreaders are
having problems with this.

Thank you very much for your time and effort. You're doing a great job. Keep it up!

Take care.

Comment 1 Christophe Lambin 2004-03-26 17:43:24 UTC

Can you explain what problems a newsreader may have with this? If a newsreader
has problems interpreting a encoded koi8-r text, I would expect it to have the
same problem interpreting the same text encoded in UTF-8 (with the same encoding
mechanism).  Which newsreaders in particular?

Not that I doubt you, but I'd like to understand. :)

Comment 2 Tomislav Markovski 2004-03-28 14:16:00 UTC

People have reported slrn to be having problems with this (one report on 
knode, but I think it was missconfigured).
Another question, why would a newsreader automaticaly encode any cyrillic 
header text in koi8-r instead the specified? (i.e. ISO-8859-5, UTF-8 even 
cp1251, perhaps the system locale would be more logical)
I'm a programmer myself, do you feel that this is correct? It's not a clean 
solution (according to me).
Finnaly, don't get me wrong, I do not feel this is a bug, more of proper 
coding suggestion I guess...

Kind regards,
Tomislav Markovski
(tome@set.com.mk)

P.S. Other newsreaders encode the headers in same encoding as Content-Type 
specifies the body. However, other newsreaders ar simply incomparable to Pan!

Comment 3 Christophe Lambin 2004-03-28 15:17:31 UTC

Marking enhancement for bluesky: it would be more consistent to force encoded
headers to be in the same charset as the body, when possible. Not strictly a
requirement, since Content-Type relates to the mime part (body), but not the
headers.

Tomislav: since you're a programmer, patches are welcome. :)

Comment 4 Tomislav Markovski 2004-03-29 13:26:48 UTC

Another issue: subject line gets unreadable when someone replies to the post,
but uses (forcibly) different encoding. Is is possible (as viewed in non-Pan
readers) to read properly the encoded text. However, Pan displays something else...

About the patch, I'm not very familiar with the pan code, and I haven't got the
time to go through it. I just thought it would be better to inform you guys that
something isn't going well with Pan. It's up to you if you want to fix it.

Comment 5 Georgi Stanojevski 2004-03-30 09:06:36 UTC

> Finnaly, don't get me wrong, I do not feel this is a bug, more of proper
> coding suggestion I guess...

This is a bug.

KOI8-R is only used with Russian cyrrilic alphabet.
The Macedonian language has letters that KOI8-R doesn't have, I believe that
other cyrillic languages(Serbian,Bulgarian...) would also have the same problem.

I also believe that the headers should be encoded with the same charset as the
body, not the locale.

Pan is a great newsreader, but because of this bug we here mostly use knode or
slrn. I'm supprised that someone didn't file this bug earlier. :)

Comment 6 Christophe Lambin 2004-03-30 16:40:21 UTC

fejj: putting you in CC since GMime is handling the encoding for us. Looks like
GMime may need some work in distinguishing between different cyrillic languages.
An alternative would be to allow Pan to override the charset used for encoding
(rather than relying on best_charset()).

Comment 7 Christophe Lambin 2004-03-30 16:40:56 UTC

fejj: ok, *now* you're in CC :)

Comment 8 Jeffrey Stedfast 2004-03-30 17:51:12 UTC

wow, this new bugzilla (and the changelog mails) is funkadelic.

anyways, I've got an idea... I'll attach a patch here hopefully tonight.

Comment 9 Jeffrey Stedfast 2004-03-30 22:17:28 UTC

Created attachment 26123 [details] [review]
gmime-charset.patch

The attached patch makes it so that special c/j/k/r charsets aren't used unless
they match the user's locale.

for example:

koi8-r won't be used unless the user's locale lang is "ru"
koi8-u won't be used unless the user's locale lang is "uk"
euc-kr won't be used unless the user's locale lang is "ko"

and so on...

this should fix it for everyone except for those users who are composing, say,
japanese in a non japanese locale - in which case the headers will be encoded
as UTF-8, but that's better than the current situation so I guess that's
acceptable.

Comment 10 Jeffrey Stedfast 2004-03-30 22:18:46 UTC

Would be good if someone could test this before I commit to GMime CVS

Comment 11 Tomislav Markovski 2004-03-31 12:25:02 UTC

The patch isn't working for me. I tried to patch 0.14.2.91 this is what I got:

patching file ChangeLog
Hunk #1 FAILED at 1.
1 out of 1 hunk FAILED -- saving rejects to file ChangeLog.rej
patching file gmime/gmime-charset.c
Hunk #1 FAILED at 121.
Hunk #2 FAILED at 131.
Hunk #3 FAILED at 345.
Hunk #4 FAILED at 482.
Hunk #5 FAILED at 536.
Hunk #6 succeeded at 368 (offset -346 lines).
5 out of 6 hunks FAILED -- saving rejects to file gmime/gmime-charset.c.rej

Any ideas?

Comment 12 Jeffrey Stedfast 2004-03-31 13:59:56 UTC

I guess I'll have to backport the patch to whatever version of gmime pan is using.

stay tuned... :-)

Comment 13 Jeffrey Stedfast 2004-03-31 14:48:50 UTC

Created attachment 26164 [details] [review]
pan-gmime-charset.patch

Okay, new patch which also syncs Pan's copy of gmime-charset.[c,h] up with
GMime CVS (which includes a number of fixes and some extra functionality)

Comment 14 Tomislav Markovski 2004-04-01 10:07:35 UTC

The patch worked... kind of. The headers are now encoded in iso-8859-5 which
does contain the letters from the alphabet I use, but it is neither my system
locale, nor my pan settings encoding, not even the encoding I'm posting a
followup. Somewhat peculiar, but does the job. Conclusion: gmime needs more
development!
Other than that, great job Jeffrey. Thanks a lot!

Comment 15 Jeffrey Stedfast 2004-04-01 13:42:07 UTC

iso-8859-5 was actually the intended target for encoding the headers in my
patch. With header encoding, since mailers need to be able to render that as
clearly as possible, the RFCs suggest using the "lowest common factor" charset
(ie. use an 8bit charset before using a multibyte charset, and when using an
8bit charset - use the lowest numbered iso charset possible: ie. use iso-8859-1
if possible, rather than iso-8859-2 and so on).

We *really* don't want to be using UTF-8 to encode headers if at all possible
because so few mailers/news readers actually support it (ironically, Linux
mailers/news readers are far better about this than Windows mailers in general).
The situation, is however, improving over time (more and more support it every
year - a few years ago, I could probably have counted the number of Linux
news/mail readers that supported UTF-8/proper charset conversion on a single hand).

This is why GMime has code to try and use some popular asian charsets rather
than UTF-8 as well (and this is non-trivial). From bug reports I've gotten,
Outlook 10 still doesn't support UTF-8 and many of the popular mailers used in
China/Japan/Korea don't support UTF-8 either.

Sorry about the techno-babble, just figured I'd lay down the issues in case
anyone was interested.

Anyways, glad it now does the Right Thing (tm) and thanks for reporting the bug,
always makes me happy to make GMime even better :-)

Comment 16 Jeffrey Stedfast 2004-04-01 13:44:13 UTC

chris: shall I commit to Pan CVS?

Comment 17 Christophe Lambin 2004-04-01 16:53:48 UTC

That'd be great. Thanks for your help.

Comment 18 Tomislav Markovski 2004-04-02 07:31:17 UTC

Thanks Jeffrey for the explanation. Call me annoynig, but I would like to ask
you to take a look at this snapshot:

http://www.set.com.mk/pan.png

You will notice that some followups are not displayed correctly. Mostly when
they reply in iso-8859-5 and one post at the bottom uses windows-1251. As far as
I know, most newsreaders encode the headers and body of the same encoding as the
post they reply to, unless they excplicitly specify to use their defined
encoding.  I would ignore this errors and say that the users haven't setup their
clients as it should, but then again gecko based newsreaders (Moz, TB), knode
and slrn display this correctly. Could you explain why this is happening? Is it
another gmime issue?
I figured I should post this message while we're at the charsets issue.

Thanks again...

Comment 19 Jeffrey Stedfast 2004-04-02 13:25:48 UTC

without having seen the message sources, it's hard to say (especially since the
encoded word is not fully displayed) - but this is likely the same problem
that's been reported a few times before, which is that the encoded-word is not a
proper encoded-word token.

For example, one of the subjects had "Re:" in the middle of an encoded-word,
this is *illegal* and so GMime doesn't decode it. GMime uses a tokenising
approach to parsing and so it will get "=?iso-8859-5?Q?Re" as a token, ":" as
the next token, and then "...?=" (replace ... with the encoded text) as the
third token. Obviously none of the 3 decode by themselves.

Another common brokeness is this:

Foo=?iso-8859-1?Q?...?=Bar

ie, the encoded-word token is in the middle of a larger token, except that
according to the rfc - encoded-word tokens have to be their own token.

the way some readers get around this is by using the following approach to
parsing (which is sick and wrong):

if ((enc_start = strstr (subject, "=?"))) {
   enc_end = strstr (enc_start, "?=");
   decode (enc_start, enc_end);
}

So what needs to be done is for those broken readers to be fixed and to send
properly formatted encoded text :)

Comment 20 Christophe Lambin 2004-04-02 18:19:10 UTC

Don't have access to that newsgroup, but from the little evidence in the
screenshot, I think fejj is right. This bug is tracking that:
http://bugzilla.gnome.org/show_bug.cgi?id=116269

Comment 21 Jeffrey Stedfast 2010-09-26 13:02:08 UTC

fwiw, gmime has been greatly improved in the area of decoding brokenly encoded rfc2047 tokens.