GNOME Bugzilla – Bug 417000
Generated email subject shows without space in Microsoft Outlook
Last modified: 2008-04-04 15:12:04 UTC
Please describe the problem: I wrote an email with the following subject: "Proyecto USAID Paraguay - Análisis de situación". When the email is read in Microsoft Outlook, the subject shows up as "Proyecto USAID Paraguay - Análisis desituación" (there is a missing space before "situación"). Looking at the source code of the email I found that just after the word "de", there is a line break which might cause this problem. Here is the exact text: Subject: Proyecto USAID Paraguay - =?ISO-8859-1?Q?An=E1lisis?= de =?ISO-8859-1?Q?situaci=F3n?= The email subject shows up perfectly if you use Evolution or Gmail, but not under Microsoft Outlook. Thunderbird generates an email with the following subject: Subject: Proyecto USAID Paraguay - =?ISO-8859-1?Q?An=E1lisis_de_situa?= =?ISO-8859-1?Q?ci=F3n?= Steps to reproduce: 1. Write an email with the specified subject 2. Check the result in Microsoft Outlook 3. Actual results: There is a missing space in the subject line in Microsoft Outlook Expected results: The subject shows up as in Evolution Does this happen every time? Yes Other information: I guess it's an Outlook bug, but it forced me to switch to Thunderbird given that most of my customers use Outlook and it doesn't look profesional to "write" emails with orthography mistakes (even if they are not mine).
hmm... yes, please file a bug against outlook. :-/
From what I'm seeing in my emails, Mozilla Thunderbird keeps changing to that format anything (is encode the right word for this?) after the first word that has an accented word. On the other hand, Microsoft Outlook, as well as Yahoo Mail, changes everything in the subject line if there is an accented word. It seems that evolution only "encodes" the words that have an accented letter.
I am unsure on this one as being Outlook-related. I looked at RFC2045, and it says the following (section 6.7 "Quoted-Printable Content-Transfer-Encoding"): (...) (3) (White Space) Octets with values of 9 and 32 MAY be represented as US-ASCII TAB (HT) and SPACE characters, respectively, but MUST NOT be so represented at the end of an encoded line. Any TAB (HT) or SPACE characters on an encoded line MUST thus be followed on that line by a printable character. In particular, an "=" at the end of an encoded line, indicating a soft line break (see rule #5) may follow one or more TAB (HT) or SPACE characters. It follows that an octet with decimal value 9 or 32 appearing at the end of an encoded line must be represented according to Rule #1. This rule is necessary because some MTAs (Message Transport Agents, programs which transport messages from one user to another, or perform a portion of such transfers) are known to pad lines of text with SPACEs, and others are known to remove "white space" characters from the end of a line. Therefore, when decoding a Quoted-Printable body, any trailing white space on a line must be deleted, as it will necessarily have been added by intermediate transport agents. (...) (5) (Soft Line Breaks) The Quoted-Printable encoding REQUIRES that encoded lines be no more than 76 characters long. If longer lines are to be encoded with the Quoted-Printable encoding, "soft" line breaks must be used. An equal sign as the last character on a encoded line indicates such a non-significant ("soft") line break in the encoded text. Additionally, RFC 2047 states (section 2 "Syntax of encoded-words"): (...) IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's by an RFC 822 parser. As a consequence, unencoded white space characters (such as SPACE and HTAB) are FORBIDDEN within an 'encoded-word'. For example, the character sequence =?iso-8859-1?q?this is some text?= would be parsed as four 'atom's, rather than as a single 'atom' (by an RFC 822 parser) or 'encoded-word' (by a parser which understands 'encoded-words'). The correct way to encode the string "this is some text" is to encode the SPACE characters as well, e.g. =?iso-8859-1?q?this=20is=20some=20text?= The characters which may appear in 'encoded-text' are further restricted by the rules in section 5. and, later of RFC2047 (section 4.2 "The "Q" encoding") (...) (2) The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as "_" (underscore, ASCII 95.). (This character may not pass through some internetwork mail gateways, but its use will greatly enhance readability of "Q" encoded data with mail readers that do not support this encoding.) Note that the "_" always represents hexadecimal 20, even if the SPACE character occupies a different code position in the character set in use. (3) 8-bit values which correspond to printable ASCII characters other than "=", "?", and "_" (underscore), MAY be represented as those characters. (But see section 5 for restrictions.) In particular, SPACE and TAB MUST NOT be represented as themselves within encoded words. So it seems Evo is not fully respecting the RFCs, at least regarding the Subject header. Comments, as always, are welcome. I have confirmed this behaviour on Evo 2.11.5 and e-d-s 1.11.5.
(Note: rfc2045 section 6.7 is completely irrelevant to this case, that applies only to MIME part bodies, not headers) you mean that Outlook isn't following the RFCs, correct? because if you read what you just posted, you'd see evolution is following the rules precisely. Subject: Proyecto USAID Paraguay - =?ISO-8859-1?Q?An=E1lisis?= de =?ISO-8859-1?Q?situaci=F3n?= atom: Proyecto atom: USAID atom: - atom: =?ISO-8859-1?Q?An=E1lisis?= atom: de atom: =?ISO-8859-1?Q?situaci=F3n?= each encoded word looks like a legal atom token to me... anyways, your assumption that evo encodes on a word-by-word basis is also wrong (unless someone changed it since I left the team), it gathers words into like-encodings but not to exceed the remainder of the line length (up to 78 chars or some such). Thus, since combining the encoding of 'situación' with 'Análisis' + ' de ' would have exceeded 78 chars, they were split - since they were split, ' de ' was not added to the encoding of 'Análisis' because it was us-ascii and didn't need to be. If I was to hazard a guess, I would say that Outlook assumes it is supposed to ignore linear white space between all atom tokens when 1-or-more are encoded... this is not true. The RFC states in section 6.2: When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) notice that it says "a pair of adjacent encoded-words"... since: atom: de atom: =?ISO-8859-1?Q?situaci=F3n?= 'de' is not an encoded-word token, any lwsp between 'de' and '=?ISO-8859-1?Q?situaci=F3n?=' MUST be preserved in the display. hence... Outlook bug.
Thank you. I was indeed led on a wild goose chase. Went back to look at it and... Ah, RFC2047 is the one I should have looked at. Sorry. My question came up exactly because I could not make sense out of 2045. But, anyway: the original reporter (at https://bugs.edge.launchpad.net/evolution/+bug/115844) stated: "I guess it's an Outlook bug, but it forced me to switch to Thunderbird given that most of my customers use Outlook and it doesn't look professional to "write" emails with orthography mistakes (even if they are not mine). I hope that Ubuntu, since it's very focussed on making things work, can push for a fix to this interoperability problem, even if Evolution actually follows the spec." It is now clear to me that indeed Evo is following the specs. But, in the interest of usage, would it be possible to accept this as -- at least -- a wishlist? This would mean, I guess, encoding lswp that begins a continuation line. of course, I am REALLY not sure of potential ramifications. Meanwhile, I am leaving this as new.
that's up to the current evo mail maintainers... I personally don't mind, so long as whatever patch goes in to work around this Outlook bug doesn't break evo's rfc2047 compliance wrt output :)
(In reply to comment #4) > > atom: de > atom: =?ISO-8859-1?Q?situaci=F3n?= > > 'de' is not an encoded-word token, any lwsp between 'de' and > '=?ISO-8859-1?Q?situaci=F3n?=' MUST be preserved in the display. > > hence... Outlook bug. > I just made some tests and when the following subject is set "Proyecto USAID Paraguay - Análisis de situación", the "Message Source" is the following: Subject: Proyecto USAID Paraguay - =?ISO-8859-1?Q?An=E1lisis?= de =?ISO-8859-1?Q?situaci=F3n?= (Note that there is no space after "de") If I use the following subject: "Proyecto USAID Paraguay - Análisis de situación" (three spaces after "de"), the following is generated: Subject: Proyecto USAID Paraguay - =?ISO-8859-1?Q?An=E1lisis?= de =?ISO-8859-1?Q?situaci=F3n?= (Note that there are two spaces after "de") I assume that the spaces before "=?ISO-8859-1?Q?situaci=F3n?=" are actually a representation of a TAB because they are 8 characters. Where does the RFC specifies that a TAB should be represented as a space in a mail client while unfolding? I have searched the internet and found that many MUAs replaces CR/LR and TABs while unfolding. There is a message on Mailman users mailing list that do a comment like that: http://mail.python.org/pipermail/mailman-users/2007-June/057499.html If that is the case, wouldn't be better to replace a TAB with a SPACE so no MUA misrepresents the space of the original subject? (I'm not an expert in this field, but I try to investigate about this problem as much as possible because it is really annoying to my day-to-day work)
It's not specified, but many (most?) mail clients do it to make the raw message header formatting look nicer. I would accept a patch which makes Evolution use the WSP char from the pre-folded text instead of always using a TAB.
I am hitting the following bug, which seems to be the same as what is described above minus any word encodings: If the message goes over the 78 character limit and is wrapped, a tab is inserted after the CR/LF instead of a WSP. Sounds similar to what is described above... except the result when viewing the email is that a tab character is inserted where the space ought to be; both on Evolution as well as Outlook (2007). One caveat: I've encountered this bug on Evo 1.4.5. Yeah, I know, ancient, but I'm currently stuck on RHEL3. Anyway, it looks like the bug still exists in some form.
Created attachment 106827 [details] [review] Patch against version *1.4.5* to use spaces to fold long headers instead of tabs. This is against 1.4.5. Make of it what you will. Note: the *real* solution should probably look at what the actual whitespace character is and use that (because as written, if someone uses tabs in their subject line instead of spaces, a tab might get replaced with a space; although I don't know how to insert tabs into the subject line in evolution anyway).
fejj, could I ask you to review this, please? Thanks.
as madcap has noted, the patch isn't correct because it forces the use fo a space when it should really use the lwsp char it folded on. anyways... fixed this myself in svn
*** Bug 523259 has been marked as a duplicate of this bug. ***