GNOME Bugzilla – Bug 315513
make evolution less strict on decoding multibyte characters
Last modified: 2007-12-26 00:27:42 UTC
Please describe the problem: Hello. I recieved mail with the following header (Please, look at subject field below): ++++++++++message++++++++++++++++ From XXX@rcline.ru Thu Sep 8 07:53:31 2005 Return-Path: <XXX@rcline.ru> Delivered-To: YYY@rcline.ru Received: (qmail 45607 invoked from network); 8 Sep 2005 07:53:31 -0000 Received: from unknown (HELO smtp-1.masterhost.ru) (83.222.24.101) by mx1.masterhost.ru with SMTP; 8 Sep 2005 07:53:31 -0000 Received: (qmail 44988 invoked from network); 8 Sep 2005 07:53:15 -0000 Received: from unknown (HELO ?172.17.0.8?) (XXX@rcline.ru@213.234.228.114) by smtp1.masterhost.ru with SMTP; 8 Sep 2005 07:53:15 -0000 Date: Thu, 8 Sep 2005 11:53:28 +0400 From: =?utf-8?Q?=D0=9F=D0=B5=D1=86=D0=BA=D0=B0_=D0=9F=D0=B5=D1=82=D1=80?= <XXX@rcline.ru> X-Mailer: The Bat! (v3.51.10) UNREG / CD5BF9353B3B7091 Reply-To: =?utf-8?Q?=D0=9F=D0=B5=D1=86=D0=BA=D0=B0_=D0=9F=D0=B5=D1=82=D1=80?= <XXX@rcline.ru> X-Priority: 3 (Normal) Message-ID: <1596600470.20050908115328@rcline.ru> To: Peter <YYY@rcline.ru> Subject: =?utf-8?Q?Re=5B2=5D=3A_=D0=9F=D0=BB=D0=B0=D0=BD_=D0=BF=D0=BE_=D0=B2=D0=B0?= =?utf-8?Q?=D0=BB=D1=83=2C_=D0=B1=D1=83=D0=B4=D0=B5=D0=BC_=D1=81=D1=82=D0?= =?utf-8?Q?=B0=D0=B2=D0=B8=D1=82=D1=8C_=D0=B3=D0=B0=D0=BB=D0=BE=D1=87=D0?= =?utf-8?Q?=BA=D0=B8_=D1=87=D1=82=D0=BE_=D1=81=D0=B4=D0=B5=D0=BB=D0=B0=D0?= =?utf-8?Q?=BD=D0=BE=2E?= In-Reply-To: <1126165186.8060.19.camel@localhost> References: <1689165981.20050908000525@rcline.ru> <1126165186.8060.19.camel@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-SpamTest-Info: Profile: Formal (268/050901) X-SpamTest-Info: Profile: based on Detect Hard No RBL (4/030526) X-SpamTest-Info: Profile: SysLog X-SpamTest-Status: Not detected X-SpamTest-Version: SMTP-Filter Version 2.1.0 [0148], SpamtestISP/Release X-Evolution-Source: pop://YYY%40rcline.ru@pop.masterhost.ru/ Content-Transfer-Encoding: 8bit ++++++++++end of message++++++++++++++++ Subject field consist of multiple lines. And the followind I see in evoulution in Subject: Subject: Re[2]: План по ва =?utf-8?Q?=D0=BB=D1=83=2C_=D0=B1=D1=83=D0=B4=D0=B5=D0=BC_=D1=81=D1=82=D0?= =?utf-8?Q?=B0=D0=B2=D0=B8=D1=82=D1=8C_=D0=B3=D0=B0=D0=BB=D0=BE=D1=87=D0?= =?utf-8?Q?=BA=D0=B8_=D1=87=D1=82=D0=BE_=D1=81=D0=B4=D0=B5=D0=BB=D0=B0=D0?= =?utf-8?Q?=BD=D0=BE=2E?= Thus only first line is shown correctly (First letters of subject in Russian). I think all other should be shown also. Steps to reproduce: Actual results: Expected results: Does this happen every time? Other information: If you need I can send you by mail the whole message as attachment.
unless something got broken in the code, it does handle this - but what is probably happening is that the other encoded tokens do not decode properly (e.g. they are invalidly encoded, don't actually fit into UTF-8 like they claim, etc) anyway, I'll leave this up to a current mail hacker to look into.
adding I18N keyword. could you attach a sample message if it does not contain confidential data? e.g. change the mail addresses to "example@example.com" or something... thanks in advance.
Created attachment 52451 [details] sample message that reproduce the problem. Hello. Sorry for long silence. Here is my message. Peter.
<eah, right. confirming, though i do not know if it's evolution fault or if it's just badly encoded by the sender's email program.
I just wanted to add, that some subjects are completely unreadable. =?utf-8?B?W1VUTTUgMDAwMDcxNF06INC90LUg0YDQsNCx0L7RgtCw0LXRgiDQv9C10YfQ?= =?utf-8?B?sNGC0Ywg0LDQutGC0L7QsiDQstGL0L/QvtC70L3QtdC90L3Ri9GFINGA0LDQ?= =?utf-8?B?sdC+0YI=?=
I just looked into this... the problem is that the second and third encoded-word tokens contain chars that are not valid utf-8 and so evolution bails as it has no idea what charset they are actually in (and thus, no matter what it does - the subject will be corrupted) this is a problem with the sending mailer (unless it is a bug in GNU libc's charset library)
this is what I get when I try to manually decode the subject: Subject: Re[2]: План по валу, будем ст??вить галоч??и что сдела??о. the ?'s are place holders for invalid characters
Sorry. But I did not manage to reproduce you result. It's possible that I'm wrong, but I've managed to get correct result: I took all the strings and remove =?utf-8?Q? and ?=, concatenate strings together and receive the long string. After that I've qprint -d on it: qprint -d Re=5B4=5D=3A_=D0=9F=D0=BB=D0=B0=D0=BD_=D0=BF=D0=BE_=D0=B2=D0=B0=D0=BB=D1=83=2C_=D0=B1=D1=83=D0=B4=D0=B5=D0=BC_=D1=81=D1=82=D0=B0=D0=B2=D0=B8=D1=82=D1=8C_=D0=B3=D0=B0=D0=BB=D0=BE=D1=87=D0=BA=D0=B8_=D1=87=D1=82=D0=BE_=D1=81=D0=B4=D0=B5=D0=BB=D0=B0=D0=BD=D0=BE=2E Re[4]:_План_по_валу,_будем_ставить_галочки_что_сделано. Strings from comment 5 are encoded differently but it's possible to exclude Subject with base64 utility from there too. Did I miss anything? Thank you.
I don't know what qprint is (I don't have it), but it obviously isn't doing any checks on the output... after decoding the quoted-printable, I fed it to the following piece of code: char *p = (char *) decoded; int len = declen; while (!g_utf8_validate (p, len, (const char **) &p)) { len = declen - (p - (char *) decoded); *p = '?'; } and then I printed it out... (hence the ? marks) the g_utf8_validate() function exists inside glib and is what glib/gtk use to validate unicode text, if the text does not pass - then it does not display properly (it will truncate at the first bad character). now... evolution doesn't actually use g_utf8_validate() on the output after it qp decodes. Instead, it feeds the text into iconv() to convert to UTF-8 so we effectively have the following code in evolution: in the following code, inbuf is the input string (the qp-decoded text of one of the encoded-word tokens in the subject field), inlen is the length of the input string, and charset is the name of the input charset (in your case, "utf-8") char *outbase, *outbuf; size_t ret, outlen; iconv_t ic; outlen = inlen * 6 + 16; outbase = g_alloca (outlen); outbuf = outbase; if ((ic = iconv_open ("UTF-8", charset)) != (iconv_t) -1) { ret = iconv (ic, &inbuf, &inlen, &outbuf, &outlen); if (ret != (size_t) -1) { iconv (ic, NULL, 0, &outbuf, &outlen); *outbuf = '\0'; return g_strdup (outbase); } else { /* charset conversion failed, illegal characters in the input */ } } else { /* failed to open charset conversion */ }
aha, I see the problem... the problem is that the sending client broke encoded words in the middle of a multibyte encoded character. see here, original encoded text: Subject: =?utf-8?Q?Re=5B2=5D=3A_=D0=9F=D0=BB=D0=B0=D0=BD_=D0=BF=D0=BE_=D0=B2=D0=B0?= =?utf-8?Q?=D0=BB=D1=83=2C_=D0=B1=D1=83=D0=B4=D0=B5=D0=BC_=D1=81=D1=82=D0?= =?utf-8?Q?=B0=D0=B2=D0=B8=D1=82=D1=8C_=D0=B3=D0=B0=D0=BB=D0=BE=D1=87=D0?= =?utf-8?Q?=BA=D0=B8_=D1=87=D1=82=D0=BE_=D1=81=D0=B4=D0=B5=D0=BB=D0=B0=D0?= =?utf-8?Q?=BD=D0=BE=2E?= if you look at the encoded word tokens, you'll notice that the multibyte sequences all begin with \xD0 or \xD1 (aka =D0 or =D1) you'll also notice that the second encoded word token ENDS with =D0 and the third encoded-word token BEGINS in the MIDDLE of a multi-byte sequence... hence starting with the second encoded-word token, the charset conversion routine fails because of incomplete multibyte sequences. Lets see what the RFC has to say about this (I already know, but just so I don't get any arguments, I'll paste it here): rfc2047, section 5 part 3: The 'encoded-text' in an 'encoded-word' must be self-contained; 'encoded-text' MUST NOT be continued from one 'encoded-word' to another. This implies that the 'encoded-text' portion of a "B" 'encoded-word' will be a multiple of 4 characters long; for a "Q" 'encoded-word', any "=" character that appears in the 'encoded-text' portion will be followed by two hexadecimal characters. Each 'encoded-word' MUST encode an integral number of octets. The 'encoded-text' in each 'encoded-word' must be well-formed according to the encoding specified; the 'encoded-text' may not be continued in the next 'encoded-word'. (For example, "=?charset?Q?=?= =?charset?Q?AB?=" would be illegal, because the two hex digits "AB" must follow the "=" in the same 'encoded-word'.) Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. the sending client clearly fails to conform to the last sentence
Jeffrey, thank you for your detailed answer. That is what I was afraid of. But are there any chances to change behavior on decoding? Could we consider this like feature request? Some people are not progressive and use old programs... Some programs have bugs, and we have to communicate with people using them. I keep old mail with corrupted subjects so this problem does not go away, untill I decide to remove that mail... There are mail clients which do not have this problem and may be evolution should be less strict on decoding? I reopened bug, but that does not mean I'm forcing anybody to fix this. I just want to keep this for record, as this seem no so hard to fix and of course this will make evolution a bit better for end-users. :)
please tell people to not use ancient software then. i'd say that *all* programs have bugs (evolution has also many of them), that's why we all file bugs against the affected programs and not against all the other programs that do things right... i would close this here as WONTFIX.
Andre, not all people are computer gurus. In science community most of people do not care to upgrade anything while it works. Some people are not too young and are far from knowing even how to install programs; they just use them. They never read about security problems and are not going to change their habits. While I can change my environment it's hard to change people which hardly know me and live in the other cities. This is really big problem for me as normally after I receive similar mail I have to look for workaround to find what was the subject, what was the mail itself. Yes I know that mail does not follow some RFC, but my supervisor does not have such problems with "but" - mail client in windows... Most of people I have to communicate does not have such problems and personally I'd like to have less problems in evolution... We have RFC which tell us what we MUST do and that is what we should expect to receive from the others. But consider bugs in software, consider old software on the other side. What to do when what we received and need to read does not follow our expectations? Mailing "Please, upgrade you mail client dude, or fill the bug and wait while it'll be resolved" is not an option. You need to communicate right now. I'd say evolution should send everything following RFC, but it can not control what it received and thus it should do its best to show us mail... Without this it's hard to use/to advise it in business/science, in field where people are just using computers and do not care know how it works. The just remove program that does not work and install another... While of course there are many other more important problems let's keep this low or even less than low priority. But I'd like to have this bug open in hope that one day such problems became fixed too and that time evolution will dominate the world :)
the problem with trying to work around this particular bug is that it requires the parser to string together the raw qp/b64 decoded content of multiple encoded-word tokens before converting the charset... once you understand why each token has its own charset defined, you can easily see how this can't possibly work :) imagine, if you will, that you have: =?utf-8?Q?<some encoded text>?= =?iso-8859-1?<more encoded text>?= =?koi8-r?<yet more text?= there are 3 charsets there... and that is a perfectly valid string of encoded-word tokens that can (and do) occur in the Real World. You can't just combine the qp-decoded content of each encoded-word token and then simply convert to UTF-8... it's impossible. Hence the rules outlined in the RFC. (Note: if you are wondering where the "convert to UTF-8" requirement comes from, Evolution uses UTF-8 internally for strings because it can represent any character) Now... my guess is that the mail clients that actually handle that (wrongly) implemented their decoders the way you decoded the text in comment #8, with the baseless assumption that each encoded-word token is the same charset as the first... and seeing as how I've seen users submit bugs to me for a MIME encoder project of mine (unrelated to Evolution) under the assumption that it is not legal to encoded different tokens with different charsets (because some mailer doesn't handle it), I know for a fact that broken mailers like that do in fact exist, sadly... my point is that in order to handle this kind of brokenness, Evolution's parser would either have to make assumptions that would break for other, VALID, encodings (which, imho, is far worse than not handling some other mail client's broken encoding scheme) or else add a fair bit of complexity in order to try and awkwardly work around this particular sending clients bug, which just makes the Evolution code more prone to bugs (which is also not good). so it's not quite as "easy" as you envisioned ;)
Discussion about this bug takes place here. The discussion contains a patch, but a.t.m. the patch is flawed (has buffer overflows): http://mail.gnome.org/archives/evolution-hackers/2007-December/msg00061.html
actually, I don't believe that patch fixes this particular problem.
FWIW, even Thunderbird doesn't handle multi-byte characters split across multiple encoded-word tokens. I was just checking out their implementation in mozilla/netwerk/src/nsMIMEHeaderParamImpl.cpp:DecodeRFC2047Str()
worked around in svn