GNOME Bugzilla – Bug 536457
RFC2047 encoded recipients from gmail imap not parsed properly
Last modified: 2008-09-28 11:36:54 UTC
Please describe the problem: When camel parses recipient headers encoded following rfc2047 coming from gmail imap, they are not correctly parsed. They come badly encoded from gmail. Steps to reproduce: 1. Send an email from evolution with recipients containing accents that force rfc2047 encoding in it (for example, myself José Dapena Paz <address@mail>), to the gmail imap account you have configured in evolution 2. Fetch new headers from the gmail imap in evolution 3. Header of message you sent is retrieved. Actual results: The message list shows the string encoded, and does not do the rfc2047 conversion. It shows like this in headers view: =?ISO-8859-1?Q?Jos=E9_Dapena_Paz_<address@mail>?= Expected results: The message list should show the recipient properly without any rfc2047 formatting thing: José Dapena Paz <address@mail> Does this happen every time? Yes Other information: Problem is gmail encodes badly the recipients with rfc2047. It puts the encoding stuff in all the string, instead of only the left part. Instead of: =?ISO-8859-1?Q?Jos=E9_Dapena_Paz_<address@mail>?= it should be: =?ISO-8859-1?Q?Jos=E9_Dapena_Paz?= <address@mail>
I've prepared a patch for tinymail fixing this. We add a special parse workaround for this case. I'll adapt the patch for camel and send for review.
Created attachment 112069 [details] [review] Patch: fix broken rfc2047 recipients from imap This patch fixes broken rfc2047 recipient headers from imap. It simply moves the trailing ?= to make it be before the <> part. Changelog entry would be: * evolution-data-server/camel/camel-mime-utils.c: Parse properly broken rfc2047 recipient headers sent from gmail imap.
Nice thing about online email services like GMail is, as soon as they fix their server we can remove nasty workarounds like this. What do you think, Jeff?
one of (not sure if it's the only) problem with this patch is that the 'in' string passed to header_decode_mailbox() may contain more than a single address, so the str[r]str() hack is broken. since this is to work-around a GMail IMAP problem, it probably should be handled in the IMAP provider. Unfortunately, I just realised that the current IMAP provider uses a header-fetch rather than fetching the ENVELOPE, which means that it's gonna be more problematic to solve since you won't get individually shrink-wrapped addresses :\ (actually, would switching to an ENVELOPE fetch magically fix this?)
The IMAP server must print the ENVELOPE in a specifically formatted way (with the name of the persons separated from his E-mail address), so yes. But ENVELOPE is not sufficient for what Evolution wants. And the code that accepts these TOP-like pieces of E-mail doesn't cope with ENVELOPE replies.
Confirmed, patch broken when more than one address comes in the "in" string. I'll try to do a better workaround.
Created attachment 112195 [details] [review] Patch: fix broken frc2047 recipients from imap New version of the patch. Now it works with multiaddresses that gmail delivers.
same as bug #537088 ?
This does look like what is happening to me in bug #537088.
(In reply to comment #9) > This does look like what is happening to me in bug #537088. > Sorry about that, I'm mixing up bugs I've reported. This looks like what I was seeing in bug #536962. Bug #537088 is also using GMail, but is a completely different beast I think.
Did the reporter (or anyone else) report this as a Gmail IMAP "issue" too? I couldn't find this in Gmail's Help Center. (Please note that reporting issues with Gmail isn't very rewarding. I reported bug #517440, but I never got any response form Gmail whatsoever, not even a confirmation that they at least received my report. Since it also didn't show up in their list of known IMAP issues, it's impossible for me to see what has happened with my report.)
*** Bug 523259 has been marked as a duplicate of this bug. ***
*** Bug 536773 has been marked as a duplicate of this bug. ***
0) With evolution 2.22.2 (as currently shipped in Fedora 9) a message send to: José Dapena Paz <pebolle@tiscali.nl> (over Gmail's smtp server and read through mail's IMAP server) will have this To header: To: =?ISO-8859-1?Q?Jos=E9?= Dapena Paz <pebolle@tiscali.nl> which will be displayed (incorrectly) by Evolution (but only in the message list "header", in the To column) as (copied by hand): =?ISO-8859-1?Q?Jos=E9_Dapena_Paz_ <pebolle@tiscali.nl> The To header seems to be the one generated by Evolution, left untouched by Gmail, and displayed incorrectly (in one part of the UI) by Evolution. 1) Could the reporter provide more details? At this stage I'd guess it would be interesting to see which programs/servers are actually involved. For instance, what is the format of the headers when the message is still in Evolutions outbox (try to send with your network interfaces down to have a chance to analyze that). 2) As it stands, I cannot reproduce this bug.
0) I finally managed to reproduce this bug. 1) I'm not sure what the "headers view" (that was mentioned in the bugreport) is, but when I started evolution with the "CAMEL_DEBUG=imap" environment variable, the debugging output contained messages like: Literal: -->Return-Path: [...] From: Paul Bolle <pebolle@tiscali.nl> To: =?ISO-8859-1?Q?Jos=E9_Dapena_Paz_<pebolle@tiscali.nl>?= Content-Type: [...] <-- 2) My comment #14 is no longer relevant. Based on previous comments, I'd have to say this is indeed a bug in the Gmail IMAP server. 3) If (something like) the patch suggested in comment #7 would be added, shouldn´t we also add: - some warning message (e.g. "fixed broken rfc2047 encoding in string '$STRING'"; and/or - add a check for an environment variable (say "CAMEL_SKIP_RFC2047_FIX") to disable this (or a similar) workaround? That would allow us to notice and/or test that Gmail fixed their IMAP servers and the workaround could be dropped.
Fejj, can you please look at the above patch ?
I think the header munging should be done in the IMAP code decode_mailbox() should not be modifying its input string, you never know if a static read-only string was passed in.
*** Bug 538428 has been marked as a duplicate of this bug. ***
setting the patch status according to comment #17
(In reply to comment #17) > I think the header munging should be done in the IMAP code > > decode_mailbox() should not be modifying its input string, you never know if a > static read-only string was passed in. > But we can only apply this filtering once the headers are decoded. Where should I add such decoding in imap code? (Just hint me please and I'll try to get the patch asap done :).
Jose, We have a bug for this open and it's taking a bit too long in my opinion, so I approve the patch for Tinymail's camel-lite. We can refactor it to the proper fix that also went into camel-upstream later.
0) Further investigation: gmail spits out rfc2047 encoded headers in about 40 (or about 60, depending on what you count) chunks, each chunk encoded. Example: Cc: =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_"Jiri_Slaby"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?= (I removed an additional newline (^M) after each line. Added by debugging code?) 1) Python handles these just fine: >>> from email.header import decode_header >>> decode_header('=?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?=\ ... =?ISO-8859-1?Q?xx.com>,_"Jiri_Slaby"_<jirislaby@xxxxx.co?=\ ... =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?=\ ... =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=') [('"Thomas Hellstr\xf6m" <thomas@xxxxxxxxxxxxxxxx.com>, "Jiri Slaby" <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org', 'iso-8859-1')] 2) So, maybe evolution's rfc2047 decoding is at fault after all. 3) A solution might me to: - decode all rfc2047 encoded chunks first - concatenate these chunks and regular chunks to one string - parse that string into (names and) addresses. Not sure yet whether that is doable without a major rewrite of camel_header_address_decode() and friends.
that's probably how the python parser is doing it, but that's not the proper way of decoding things and you can end up misparsing valid address lists if you do things that way too (which is worse than misparsing badly formed address lists like your example). evolution's parser is not at fault here, gmail's encoding is completely broken. I suggest making the IMAP code special-case gmail by issuing an ENVELOPE request and using the server-parsed addresses rather than trying to parse them from the raw headers. It might be worth doing that for all servers but some performance regression testing (against multiple IMAP server implementations) would be in order before going through with such a change.
0) comment #23 just arrived before I wanted to comment this: $ cat camel_header_decode_string.c #include <stdio.h> #include <camel/camel.h> int main (void) { char *in = " =?ISO-8859-1?Q?\"Thomas_Hellstr=F6m\"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_\"Jiri_Slaby\"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?="; char *out; out = camel_header_decode_string(in, NULL); printf("in : %s\n", in); printf("out: %s\n", out); g_free(out); return 0; } $ gcc camel_header_decode_string.c -g -o camel_header_decode_string $(pkg-config --cflags --libs camel-1.2 gnome-vfs-2.0) -Wall $ ./camel_header_decode_string in : =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_"Jiri_Slaby"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?= out: "Thomas Hellström" <thomas@xxxxxxxxxxxxxxxx.com>, "Jiri Slaby" <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org 1) It does look to me like e-d-s can handle this just like python!
yea, but what happens if the decoded string has commas other than between addresses? :-) "oops" That's why you can't do it the way python does it (and why no serious application that handles mail is written using the python implementation).
0) comma between double quotes: $ ./camel_header_decode_string in : =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_"Slaby,_Jiri"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?= out: "Thomas Hellström" <thomas@xxxxxxxxxxxxxxxx.com>, "Slaby, Jiri" <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org 1) comma not quoted: $ ./camel_header_decode_string in : =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_Slaby,_Jiri_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?= out: "Thomas Hellström" <thomas@xxxxxxxxxxxxxxxx.com>, Slaby, Jiri <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org 2) Not sure what the issue would be: both out strings seem to resemble the sort of headers that evolution has to deal with already: out in 0) is a correct header, out in 1) would be just another incorrect header. 3) That evolution would be more forgiving in handling rfc2047 encoded headers [*] and also decodes rfc2047 at a different stage in the parsing of the (address) headers doesn't seem to change to sort of problems it already has to deal with. 4) I do not yet see an issue here, but chances are you were trying to raise another issue. * I haven't been able to determine whether gmail's enconding really is invalid or just a different interpretation of rfc2047 (and friends). Besides, even if it is invalid, that doesn't mean evolution shouldn't at least try to parse it.
here's an example for you: ./a.out in : =?iso-8859-1?q?Hellstr=F6m=2C?= Thomas <thomas@xxx.com>, =?iso-8859-1?q?j=F6seph=40=F6lson=2Ecom?= <joe@realaddr.com> out: Hellström, Thomas <thomas@xxx.com>, jöseph@ölson.com <joe@realaddr.com> that's a big friggin "oops" if you try to parse it the python way. This is why developers need to read the spec and not just pull stuff out of their proverbials ;-)
0) Another example: ./camel_header_decode_string in : Hellstrom, Thomas <thomas@xxx.com>, joseph@olson.com <joe@realaddr.com> out: Hellstrom, Thomas <thomas@xxx.com>, joseph@olson.com <joe@realaddr.com> 1) The example here in 0), the example in comment #27 and the example in comment #26 in 1) all have unquoted commas. (The example of comment #27 doesn't really differ that much from example 1) in comment #26.) As far as I can tell all those (address) headers are thus invalid. Why should the fact that some chunks of two of those three headers are rfc2047 encoded matter?
because addresses are parsed according to the tokenization rules expressed in the BNF grammar of rfc0822 In my example, the original string would be parsed thusly: word token: =?iso-8859-1?q?Hellstr=F6m=2C?= LWSP token: SPACE word token: Thomas LWSP token: SPACE CHAR token: < word token: thomas CHAR token: @ word token: xxx CHAR token: . word token: com CHAR token: > CHAR token: , at this point, you can piece together what you got: the name will be composed of the following tokens: =?iso-8859-1?q?Hellstr=F6m=2C?= (which, when decoded, becomes "Hellström,") SPACE Thomas the address will be comprised of these tokens: thomas @ xxx . com thus, we get: name = "Hellström, Thomas"; addr = "thomas@xxx.com"; before you waste any more cycles meaninglessly, I'll advise you to read http://www.ietf.org/rfc/rfc0822.txt Once you have read that and understood the BNF grammar in (specifically the bits in Section 6.1 as that is the most relevant portion), you should then continue on to reading http://www.ietf.org/rfc/rfc2047.txt Pay close attention to Section 5.3
Created attachment 115669 [details] [review] simple hack to work around gmail's rfc2047 encoding interpretation 0) Notwithstanding the advise to not "waste any more cycles meaninglessly" I wrote a simple hack to test my suggestion. (Patch against 2.22.3). 1) This patch now renders all (previously) troublesome address headers correctly. (As far as the legit messages in my folders are concerned. But I even tested that with a number of spam messages with - for me - unreadable headers (in some Asian language). Even those seem to come out correct (evolution then renders those identical to gmail's web interface in firefox.) 2) I haven't yet run into regressions, but that doesn't mean this patch doesn't break anything else. Still, people who have run into this problem might try whether this works for them too, for instance as long a patch along the lines discussed in comment #4, comment #5 and comment #23 has not been released. 3) If the patch does break something one should be able to clean up the mess by just deleting the "summary" in your IMAP folders and have an unpatched version of evolution regenerate those files (when contacting gmail again).
Created attachment 115670 [details] [review] simple hack to work around gmail's rfc2047 encoding interpretation same patch, forward ported to trunk (entirely untested!)
2) I've already given you a valid example of where your patch introduces regressions
> 2) I've already given you a valid example of where your patch introduces > regressions Replying to this comment will add little to what I already stated under 1) and 2) in comment #30.
*** Bug 531698 has been marked as a duplicate of this bug. ***
*** Bug 532825 has been marked as a duplicate of this bug. ***
I just noticed a new (at least to me; it is hard to say when this was added) item in Gmail's known IMAP issues: "Non-Latin characters can corrupt message headers. Message headers contain technical information necessary for the successful delivery of messages between email servers. Gmail's IMAP implementation re-encodes the information stored in message headers, but non-ASCII characters may become garbled. For example, this can affect the 'To:' line in an email message if a name is written in a language that uses non-Latin characters. Several issues can result from corrupt message headers, including delivery problems. The Gmail Team is working to resolve this issue." (See: http://mail.google.com/support/bin/answer.py?answer=78771&topic=12922). This seems to cover this bug. So the easiest to implement solution (wait until the Gmail server is fixed) is getting more appealing. Could this bug just be resolved as NOTGNOME?
NOTGNOME sounds good to me.
*** Bug 552388 has been marked as a duplicate of this bug. ***
*** Bug 552898 has been marked as a duplicate of this bug. ***
*** Bug 554153 has been marked as a duplicate of this bug. ***