GNOME Bugzilla – Bug 533741
copying text when composing message in Evolution inserts BOM in the middle of text
Last modified: 2013-06-20 16:20:55 UTC
Please describe the problem: It was pointed out to me that my emails sent from Evolution 2.22 sometimes contain weird characters -- specifically, "=EF=BB=BF" in quoted-printable encoding at the beginning of some lines (this is UTF-8 encoding of BOM (aka ZWNBSP=0xfeff)). Turns out it happens when I copy text from the composer to different place (in the same composer window). And this is because get_selection_string() intentionally inserts it there. The string, both in plaintext and HTML versions, is encoded as UTF-8 char*, so putting the BOM there doesn't really make sense (as it would if the string was in UTF-16 for example). Attached patch (against SVN r8844) fixes this by simply not using BOM with UTF-8. Steps to reproduce: 1. in Evolution 2.22, switch to offline mode and compose new message 2. enter "foo" in the body, select it, Ctrl+C 3. go to new line, Ctrl+V 3. send the mail (we're in offline mode, so it's only stored in outbox) 4. go to outbox, Ctrl+U on the mail Actual results: The emails' body reads foo =EF=BB=BFfoo (unless evo decides to use base64, in which case you can decode it to see there's a character in front of second "foo" too) Expected results: No unnecessary BOMs in the body: foo foo Does this happen every time? yes Other information:
Created attachment 111111 [details] [review] suggested patch to fix the issue
Guys, can you review it?
mcrha: Can you review this?
No no, I would prefer other approach, please rework patch in a way it will remove the order marker on paste, if present, because you want to have there this marker in case you are pasting text to other application. Same as other application can add this marker to its text on copy. Does it make sense? Please do not forget to fill a ChangeLog entry too. Thanks in advance.
(In reply to comment #4) > No no, I would prefer other approach, please rework patch in a way it will > remove the order marker on paste, if present, because you want to have there > this marker in case you are pasting text to other application. Same as other > application can add this marker to its text on copy. Does it make sense? Not at all -- once again, the text put into clipboard is encoded as *UTF-8* and it is known to be UTF-8. Therefore, it doesn't make any sense to use BOM -- UTF-8 is endianness-neutral and BOM is only useful for identifying endianness (UTF-8 is easily identified without it). Also, Putting BOM on the UTF-8 clipboard breaks any application not aware of BOM (and/or not playing it safe and filtering BOM out of UTF-8 stream). As a real-life example consider this: 1. somebody shows you a small shell script in email 2. you want to try it, so you do cat>test.sh, copy & paste body of the script from the email, and run it 3. tough; the file doesn't start with "#!", but with <BOM>+"#!", that has no special meaning Yes, in the ideal word, Evolution/Gtkhtml would filter out BOM on insert too. But frankly, I don't care about input from other broken apps, because I'm yet to see any other app misbehaving like this. And the BOM shouldn't be included in UTF-8 clipboard output regardless of whether it is or is not filtered out on input.
(In reply to comment #5) > Not at all -- once again, the text put into clipboard is encoded as *UTF-8* and > it is known to be UTF-8. Therefore, it doesn't make any sense to use BOM -- > UTF-8 is endianness-neutral and BOM is only useful for identifying endianness > (UTF-8 is easily identified without it). OK, I see now. I looked into http://tools.ietf.org/html/rfc3629#section-6 and even it's useless, then it can be there, and the second paragraph says: It is important to understand that the character U+FEFF appearing at any position other than the beginning of a stream MUST be interpreted with the semantics for the zero-width non-breaking space, and MUST NOT be interpreted as a signature. When interpreted as a signature, the Unicode standard suggests than an initial U+FEFF character may be stripped before processing the text. Such stripping is necessary in some cases (e.g., when concatenating two strings, because otherwise the resulting string may contain an unintended "ZERO WIDTH NO-BREAK SPACE" at the connection point), but might affect an external process at a different layer (such as a digital signature or a count of the characters) that is relying on the presence of all characters in the stream. It is therefore RECOMMENDED to avoid stripping an initial U+FEFF interpreted as a signature without a good reason, to ignore it instead of stripping it when appropriate (such as for display) and to strip it only when really necessary. So, I agree with your patch now, but I still want from you to extend it and strip the BOM on paste if it exists there. I would rather strip it, because we concatenate strings. It will be fine to just strip the first bytes if they have there this BOM (not on any position).
Created attachment 112305 [details] [review] revised patch to fix the whole BOM mess (In reply to comment #6) > but I still want from you to extend it You know, if I had the balls to demand fixes for loosely related bugs as precondition for patch's acceptance on my projects, I'd at least ask for it nicely. Updated patch that fixes both of GtkHtml's BOM handling bugs is attached. Sorry for forgetting to include changelog record: ** Fix for bug #533741 * gtkhtml.c: (get_selection_string), (clipboard_paste_received_cb): Don't insert BOM into UTF-8 text when copying to clipboard; filter it out when pasting from clipboard. Patch from Vaclav Slavik.
(In reply to comment #7) > (In reply to comment #6) > > but I still want from you to extend it > > You know, if I had the balls to demand fixes for loosely related bugs as > precondition for patch's acceptance on my projects, I'd at least ask for it > nicely. Oh, I'm sorry, I didn't want to sound rude or impolite, it really wasn't intended to be like that. I reviewed it and it seems fine, I'll commit to trunk and stable. Thanks for the patch.
Committed to trunk. Committed revision 8861. Committed to gnome-2-22. Committed revision 8862.
*** Bug 540810 has been marked as a duplicate of this bug. ***
*** Bug 544661 has been marked as a duplicate of this bug. ***
*** Bug 541373 has been marked as a duplicate of this bug. ***