Bug 536457 – RFC2047 encoded recipients from gmail imap not parsed properly

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 536457 - RFC2047 encoded recipients from gmail imap not parsed properly


Summary:	RFC2047 encoded recipients from gmail imap not parsed properly


Status:	RESOLVED NOTGNOME

Product:	evolution-data-server
Classification:	Platform
Component:	Mailer
Version:	2.22.x (obsolete)
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	evolution-mail-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:

Duplicates:	523259 531698 532825 536773 538428 552388 552898 554153 (view as bug list)
Depends on:
Blocks:

Reported:	2008-06-03 16:33 UTC by Jose Dapena Paz
Modified:	2008-09-28 11:36 UTC

See Also:
GNOME target:	---
GNOME version:	2.21/2.22

Attachments
Patch: fix broken rfc2047 recipients from imap (956 bytes, patch) 2008-06-03 16:55 UTC, Jose Dapena Paz	none	Details \| Review
Patch: fix broken frc2047 recipients from imap (1.50 KB, patch) 2008-06-05 08:25 UTC, Jose Dapena Paz	needs-work	Details \| Review
simple hack to work around gmail's rfc2047 encoding interpretation (933 bytes, patch) 2008-08-01 11:37 UTC, Paul Bolle	rejected	Details \| Review
simple hack to work around gmail's rfc2047 encoding interpretation (978 bytes, patch) 2008-08-01 11:39 UTC, Paul Bolle	rejected	Details \| Review

Description Jose Dapena Paz 2008-06-03 16:33:02 UTC

Please describe the problem:
When camel parses recipient headers encoded following rfc2047 coming from gmail imap, they are not correctly parsed. They come badly encoded from gmail.

Steps to reproduce:
1. Send an email from evolution with recipients containing accents that force rfc2047 encoding in it (for example, myself José Dapena Paz <address@mail>), to the gmail imap account you have configured in evolution
2. Fetch new headers from the gmail imap in evolution
3. Header of message you sent is retrieved.


Actual results:
The message list shows the string encoded, and does not do the rfc2047 conversion. It shows like this in headers view:
=?ISO-8859-1?Q?Jos=E9_Dapena_Paz_<address@mail>?=

Expected results:
The message list should show the recipient properly without any rfc2047 formatting thing:
José Dapena Paz <address@mail>

Does this happen every time?
Yes

Other information:
Problem is gmail encodes badly the recipients with rfc2047. It puts the encoding stuff in all the string, instead of only the left part.

Instead of:
=?ISO-8859-1?Q?Jos=E9_Dapena_Paz_<address@mail>?=
it should be:
=?ISO-8859-1?Q?Jos=E9_Dapena_Paz?= <address@mail>

Comment 1 Jose Dapena Paz 2008-06-03 16:40:10 UTC

I've prepared a patch for tinymail fixing this. We add a special parse workaround for this case. I'll adapt the patch for camel and send for review.

Comment 2 Jose Dapena Paz 2008-06-03 16:55:15 UTC

Created attachment 112069 [details] [review]
Patch: fix broken rfc2047 recipients from imap

This patch fixes broken rfc2047 recipient headers from imap. It simply moves the trailing ?= to make it be before the <> part.

Changelog entry would be:
* evolution-data-server/camel/camel-mime-utils.c:
  Parse properly broken rfc2047 recipient headers sent from gmail imap.

Comment 3 Matthew Barnes 2008-06-03 17:26:35 UTC

Nice thing about online email services like GMail is, as soon as they fix their server we can remove nasty workarounds like this.

What do you think, Jeff?

Comment 4 Jeffrey Stedfast 2008-06-04 15:30:30 UTC

one of (not sure if it's the only) problem with this patch is that the 'in' string passed to header_decode_mailbox() may contain more than a single address, so the str[r]str() hack is broken.

since this is to work-around a GMail IMAP problem, it probably should be handled in the IMAP provider.

Unfortunately, I just realised that the current IMAP provider uses a header-fetch rather than fetching the ENVELOPE, which means that it's gonna be more problematic to solve since you won't get individually shrink-wrapped addresses :\

(actually, would switching to an ENVELOPE fetch magically fix this?)

Comment 5 Philip Van Hoof 2008-06-04 16:09:12 UTC

The IMAP server must print the ENVELOPE in a specifically formatted way (with the name of the persons separated from his E-mail address), so yes.

But ENVELOPE is not sufficient for what Evolution wants. And the code that accepts these TOP-like pieces of E-mail doesn't cope with ENVELOPE replies.

Comment 6 Jose Dapena Paz 2008-06-05 06:45:19 UTC

Confirmed, patch broken when more than one address comes in the "in" string. I'll try to do a better workaround.

Comment 7 Jose Dapena Paz 2008-06-05 08:25:35 UTC

Created attachment 112195 [details] [review]
Patch: fix broken frc2047 recipients from imap

New version of the patch. Now it works with multiaddresses that gmail delivers.

Comment 8 Srinivasa Ragavan 2008-06-09 04:43:21 UTC

same as bug #537088 ?

Comment 9 Jeroen Hoek 2008-06-09 10:27:35 UTC

This does look like what is happening to me in bug #537088.

Comment 10 Jeroen Hoek 2008-06-09 10:31:26 UTC

(In reply to comment #9)
> This does look like what is happening to me in bug #537088.
> 

Sorry about that, I'm mixing up bugs I've reported. This looks like what I was seeing in bug #536962. Bug #537088 is also using GMail, but is a completely different beast I think.

Comment 11 Paul Bolle 2008-06-11 10:17:13 UTC

Did the reporter (or anyone else) report this as a Gmail IMAP "issue" too? I couldn't find this in Gmail's Help Center. 

(Please note that reporting issues with Gmail isn't very rewarding. I reported bug #517440, but I never got any response form Gmail whatsoever, not even a confirmation that they at least received my report. Since it also didn't show up in their list of known IMAP issues, it's impossible for me to see what has happened with my report.)

Comment 12 Jeffrey Stedfast 2008-06-11 15:29:16 UTC

*** Bug 523259 has been marked as a duplicate of this bug. ***

Comment 13 Jonas Eberle 2008-06-11 15:47:24 UTC

*** Bug 536773 has been marked as a duplicate of this bug. ***

Comment 14 Paul Bolle 2008-06-11 20:10:27 UTC

0) With evolution 2.22.2 (as currently shipped in Fedora 9) a message send to:
    José Dapena Paz <pebolle@tiscali.nl>

(over Gmail's smtp server and read through mail's IMAP server) will have this To header:
    To: =?ISO-8859-1?Q?Jos=E9?= Dapena Paz <pebolle@tiscali.nl>

which will be displayed (incorrectly) by Evolution (but only in the message list "header", in the To column) as (copied by hand):
    =?ISO-8859-1?Q?Jos=E9_Dapena_Paz_ <pebolle@tiscali.nl>

The To header seems to be the one generated by Evolution, left untouched by Gmail, and displayed incorrectly (in one part of the UI) by Evolution. 

1) Could the reporter provide more details? At this stage I'd guess it would be interesting to see which programs/servers are actually involved. For instance, what is the format of the headers when the message is still in Evolutions outbox (try to send with your network interfaces down to have a chance to analyze that).

2) As it stands, I cannot reproduce this bug.

Comment 15 Paul Bolle 2008-06-12 14:02:59 UTC

0) I finally managed to reproduce this bug.

1) I'm not sure what the "headers view" (that was mentioned in the bugreport) is, but when I started evolution with the "CAMEL_DEBUG=imap" environment variable, the debugging output contained messages like:
    Literal: -->Return-Path: [...]
    From: Paul Bolle <pebolle@tiscali.nl>
    To: =?ISO-8859-1?Q?Jos=E9_Dapena_Paz_<pebolle@tiscali.nl>?=
    Content-Type: [...]
    
    <--

2) My comment #14 is no longer relevant. Based on previous comments, I'd have to say this is indeed a bug in the Gmail IMAP server.

3) If (something like) the patch suggested in comment #7 would be added, shouldn´t we also add:
- some warning message (e.g. "fixed broken rfc2047 encoding in string '$STRING'"; and/or
- add a check for an environment variable (say "CAMEL_SKIP_RFC2047_FIX") to disable this (or a similar) workaround? That would allow us to notice and/or test that Gmail fixed their IMAP servers and the workaround could be dropped.

Comment 16 Srinivasa Ragavan 2008-06-13 03:47:47 UTC

Fejj, can you please look at the above patch ?

Comment 17 Jeffrey Stedfast 2008-06-13 15:25:56 UTC

I think the header munging should be done in the IMAP code

decode_mailbox() should not be modifying its input string, you never know if a static read-only string was passed in.

Comment 18 André Klapper 2008-06-17 11:35:45 UTC

*** Bug 538428 has been marked as a duplicate of this bug. ***

Comment 19 Srinivasa Ragavan 2008-06-18 16:30:05 UTC

setting the patch status according to comment #17

Comment 20 Jose Dapena Paz 2008-07-01 15:40:27 UTC

(In reply to comment #17)
> I think the header munging should be done in the IMAP code
> 
> decode_mailbox() should not be modifying its input string, you never know if a
> static read-only string was passed in.
> 

But we can only apply this filtering once the headers are decoded. Where should I add such decoding in imap code? (Just hint me please and I'll try to get the patch asap done :).

Comment 21 Philip Van Hoof 2008-07-01 15:43:23 UTC

Jose,

We have a bug for this open and it's taking a bit too long in my opinion, so I approve the patch for Tinymail's camel-lite. We can refactor it to the proper fix that also went into camel-upstream later.

Comment 22 Paul Bolle 2008-07-29 21:08:46 UTC

0) Further investigation: gmail spits out rfc2047 encoded headers in about 40 (or about 60, depending on what you count) chunks, each chunk encoded. Example:

Cc: =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?=
 =?ISO-8859-1?Q?xx.com>,_"Jiri_Slaby"_<jirislaby@xxxxx.co?=
 =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?=
 =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=

(I removed an additional newline (^M) after each line. Added by debugging code?)

1) Python handles these just fine:
>>> from email.header import decode_header
>>> decode_header('=?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?=\
...  =?ISO-8859-1?Q?xx.com>,_"Jiri_Slaby"_<jirislaby@xxxxx.co?=\
...  =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?=\
...  =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=')
[('"Thomas Hellstr\xf6m" <thomas@xxxxxxxxxxxxxxxx.com>, "Jiri Slaby" <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org', 'iso-8859-1')]

2) So, maybe evolution's rfc2047 decoding is at fault after all.

3) A solution might me to:
- decode all rfc2047 encoded chunks first
- concatenate these chunks and regular chunks to one string
- parse that string into (names and) addresses.

Not sure yet whether that is doable without a major rewrite of camel_header_address_decode() and friends.

Comment 23 Jeffrey Stedfast 2008-07-29 23:49:43 UTC

that's probably how the python parser is doing it, but that's not the proper way of decoding things and you can end up misparsing valid address lists if you do things that way too (which is worse than misparsing badly formed address lists like your example).

evolution's parser is not at fault here, gmail's encoding is completely broken.

I suggest making the IMAP code special-case gmail by issuing an ENVELOPE request and using the server-parsed addresses rather than trying to parse them from the raw headers.

It might be worth doing that for all servers but some performance regression testing (against multiple IMAP server implementations) would be in order before going through with such a change.

Comment 24 Paul Bolle 2008-07-29 23:57:43 UTC

0) comment #23 just arrived before I wanted to comment this:

$ cat camel_header_decode_string.c
#include <stdio.h>
#include <camel/camel.h>

int
main (void) {
	char *in = " =?ISO-8859-1?Q?\"Thomas_Hellstr=F6m\"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_\"Jiri_Slaby\"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=";
	char *out;

	out = camel_header_decode_string(in, NULL);

	printf("in : %s\n", in);
	printf("out: %s\n", out);
	g_free(out);

	return 0;
}	

$ gcc camel_header_decode_string.c -g -o camel_header_decode_string $(pkg-config --cflags --libs camel-1.2 gnome-vfs-2.0) -Wall

$ ./camel_header_decode_string 
in :  =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_"Jiri_Slaby"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=
out:  "Thomas Hellström" <thomas@xxxxxxxxxxxxxxxx.com>, "Jiri Slaby" <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org

1) It does look to me like e-d-s can handle this just like python!

Comment 25 Jeffrey Stedfast 2008-07-30 13:31:00 UTC

yea, but what happens if the decoded string has commas other than between addresses? :-)

"oops"

That's why you can't do it the way python does it (and why no serious application that handles mail is written using the python implementation).

Comment 26 Paul Bolle 2008-07-30 16:49:55 UTC

0) comma between double quotes:

$ ./camel_header_decode_string 
in :  =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_"Slaby,_Jiri"_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=
out:  "Thomas Hellström" <thomas@xxxxxxxxxxxxxxxx.com>, "Slaby, Jiri" <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org

1) comma not quoted:

$ ./camel_header_decode_string 
in :  =?ISO-8859-1?Q?"Thomas_Hellstr=F6m"_<thomas@xxxxxxxxxxxxxx?= =?ISO-8859-1?Q?xx.com>,_Slaby,_Jiri_<jirislaby@xxxxx.co?= =?ISO-8859-1?Q?m>,_airlied@xxxxx.ie,_dri-devel@lists.sou?= =?ISO-8859-1?Q?rceforge.net,_linux-kernel@vger.kernel.org?=
out:  "Thomas Hellström" <thomas@xxxxxxxxxxxxxxxx.com>, Slaby, Jiri <jirislaby@xxxxx.com>, airlied@xxxxx.ie, dri-devel@lists.sourceforge.net, linux-kernel@vger.kernel.org

2) Not sure what the issue would be: both out strings seem to resemble the sort of headers that evolution has to deal with already: out in 0) is a correct header, out in 1) would be just another incorrect header.

3) That evolution would be more forgiving in handling rfc2047 encoded headers [*] and also decodes rfc2047 at a different stage in the parsing of the (address) headers doesn't seem to change to sort of problems it already has to deal with.

4) I do not yet see an issue here, but chances are you were trying to raise another issue.

* I haven't been able to determine whether gmail's enconding really is invalid or just a different interpretation of rfc2047 (and friends). Besides, even if it is invalid, that doesn't mean evolution shouldn't at least try to parse it.

Comment 27 Jeffrey Stedfast 2008-07-30 19:44:39 UTC

here's an example for you:

./a.out 
in : =?iso-8859-1?q?Hellstr=F6m=2C?= Thomas <thomas@xxx.com>, =?iso-8859-1?q?j=F6seph=40=F6lson=2Ecom?= <joe@realaddr.com>
out: Hellström, Thomas <thomas@xxx.com>, jöseph@ölson.com <joe@realaddr.com>

that's a big friggin "oops" if you try to parse it the python way.

This is why developers need to read the spec and not just pull stuff out of their proverbials ;-)

Comment 28 Paul Bolle 2008-07-30 20:58:31 UTC

0) Another example:
./camel_header_decode_string 
in : Hellstrom, Thomas <thomas@xxx.com>, joseph@olson.com <joe@realaddr.com>
out: Hellstrom, Thomas <thomas@xxx.com>, joseph@olson.com <joe@realaddr.com>

1) The example here in 0), the example in comment #27 and the example in comment #26 in 1) all have unquoted commas. (The example of comment #27 doesn't really differ that much from example 1) in comment #26.)

As far as I can tell all those (address) headers are thus invalid. Why should the fact that some chunks of two of those three headers are rfc2047 encoded matter?

Comment 29 Jeffrey Stedfast 2008-07-30 21:31:03 UTC

because addresses are parsed according to the tokenization rules expressed in the BNF grammar of rfc0822

In my example, the original string would be parsed thusly:

word token: =?iso-8859-1?q?Hellstr=F6m=2C?=
LWSP token: SPACE
word token: Thomas
LWSP token: SPACE
CHAR token: <
word token: thomas
CHAR token: @
word token: xxx
CHAR token: .
word token: com
CHAR token: >
CHAR token: ,

at this point, you can piece together what you got:

the name will be composed of the following tokens:
  =?iso-8859-1?q?Hellstr=F6m=2C?= (which, when decoded, becomes "Hellström,")
  SPACE
  Thomas

the address will be comprised of these tokens:
  thomas
  @
  xxx
  .
  com

thus, we get:
  name = "Hellström, Thomas";
  addr = "thomas@xxx.com";


before you waste any more cycles meaninglessly, I'll advise you to read http://www.ietf.org/rfc/rfc0822.txt

Once you have read that and understood the BNF grammar in (specifically the bits in Section 6.1 as that is the most relevant portion), you should then continue on to reading http://www.ietf.org/rfc/rfc2047.txt

Pay close attention to Section 5.3

Comment 30 Paul Bolle 2008-08-01 11:37:20 UTC

Created attachment 115669 [details] [review]
simple hack to work around gmail's rfc2047 encoding interpretation

0) Notwithstanding the advise to not "waste any more cycles meaninglessly" I wrote a simple hack to test my suggestion. (Patch against 2.22.3).

1) This patch now renders all (previously) troublesome address headers correctly. (As far as the legit messages in my folders are concerned. But I even tested that with a number of spam messages with - for me - unreadable headers (in some Asian language). Even those seem to come out correct (evolution then renders those identical to gmail's web interface in firefox.)

2) I haven't yet run into regressions, but that doesn't mean this patch doesn't break anything else. Still, people who have run into this problem might try whether this works for them too, for instance as long a patch along the lines discussed in comment #4, comment #5 and comment #23 has not been released.

3) If the patch does break something one should be able to clean up the mess by just deleting the "summary" in your IMAP folders and have an unpatched version of evolution regenerate those files (when contacting gmail again).

Comment 31 Paul Bolle 2008-08-01 11:39:14 UTC

Created attachment 115670 [details] [review]
simple hack to work around gmail's rfc2047 encoding interpretation

same patch, forward ported to trunk (entirely untested!)

Comment 32 Jeffrey Stedfast 2008-08-01 11:53:13 UTC

2) I've already given you a valid example of where your patch introduces regressions

Comment 33 Paul Bolle 2008-08-01 12:04:43 UTC

> 2) I've already given you a valid example of where your patch introduces
> regressions

Replying to this comment will add little to what I already stated under 1) and 2) in comment #30.

Comment 34 André Klapper 2008-08-14 19:55:32 UTC

*** Bug 531698 has been marked as a duplicate of this bug. ***

Comment 35 André Klapper 2008-08-21 20:38:14 UTC

*** Bug 532825 has been marked as a duplicate of this bug. ***

Comment 36 Paul Bolle 2008-08-28 09:55:25 UTC

I just noticed a new (at least to me; it is hard to say when this was added) item in Gmail's known IMAP issues:

"Non-Latin characters can corrupt message headers.

Message headers contain technical information necessary for the successful delivery of messages between email servers. Gmail's IMAP implementation re-encodes the information stored in message headers, but non-ASCII characters may become garbled. For example, this can affect the 'To:' line in an email message if a name is written in a language that uses non-Latin characters. Several issues can result from corrupt message headers, including delivery problems.

The Gmail Team is working to resolve this issue."

(See: http://mail.google.com/support/bin/answer.py?answer=78771&topic=12922).

This seems to cover this bug. So the easiest to implement solution (wait until the Gmail server is fixed) is getting more appealing. Could this bug just be resolved as NOTGNOME?

Comment 37 Jeffrey Stedfast 2008-08-28 12:22:46 UTC

NOTGNOME sounds good to me.

Comment 38 Paul Bolle 2008-09-16 09:44:12 UTC

*** Bug 552388 has been marked as a duplicate of this bug. ***

Comment 39 Jeffrey Stedfast 2008-09-19 16:46:28 UTC

*** Bug 552898 has been marked as a duplicate of this bug. ***

Comment 40 Jeffrey Stedfast 2008-09-28 11:36:54 UTC

*** Bug 554153 has been marked as a duplicate of this bug. ***