Bug 533741 – copying text when composing message in Evolution inserts BOM in the middle of text

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 533741 - copying text when composing message in Evolution inserts BOM in the middle of text


Summary:	copying text when composing message in Evolution inserts BOM in the middle of...


Status:	RESOLVED FIXED

Product:	GtkHtml
Classification:	Other
Component:	Editing
Version:	3.23.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkhtml-maintainers
QA Contact:	Evolution QA team

URL:
Whiteboard:

Duplicates:	540810 541373 544661 (view as bug list)
Depends on:
Blocks:

Reported:	2008-05-18 19:09 UTC by Vaclav Slavik
Modified:	2013-06-20 16:20 UTC

See Also:
GNOME target:	---
GNOME version:	2.21/2.22

Attachments
suggested patch to fix the issue (2.13 KB, patch) 2008-05-18 19:10 UTC, Vaclav Slavik	needs-work	Details \| Review
revised patch to fix the whole BOM mess (3.21 KB, patch) 2008-06-07 09:15 UTC, Vaclav Slavik	committed	Details \| Review

Description Vaclav Slavik 2008-05-18 19:09:13 UTC

Please describe the problem:
It was pointed out to me that my emails sent from Evolution 2.22 sometimes contain weird characters -- specifically, "=EF=BB=BF" in quoted-printable encoding at the beginning of some lines (this is UTF-8 encoding of BOM (aka ZWNBSP=0xfeff)).

Turns out it happens when I copy text from the composer to different place (in the same composer window). And this is because get_selection_string() intentionally inserts it there. The string, both in plaintext and HTML versions, is encoded as UTF-8 char*, so putting the BOM there doesn't really make sense (as it would if the string was in UTF-16 for example).

Attached patch (against SVN r8844) fixes this by simply not using BOM with UTF-8.

Steps to reproduce:
1. in Evolution 2.22, switch to offline mode and compose new message
2. enter "foo" in the body, select it, Ctrl+C
3. go to new line, Ctrl+V
3. send the mail (we're in offline mode, so it's only stored in outbox)
4. go to outbox, Ctrl+U on the mail


Actual results:
The emails' body reads

foo
=EF=BB=BFfoo

(unless evo decides to use base64, in which case you can decode it to see 
there's a character in front of second "foo" too)

Expected results:
No unnecessary BOMs in the body:

foo
foo

Does this happen every time?
yes

Other information:

Comment 1 Vaclav Slavik 2008-05-18 19:10:02 UTC

Created attachment 111111 [details] [review]
suggested patch to fix the issue

Comment 2 Srinivasa Ragavan 2008-05-19 04:12:19 UTC

Guys, can you review it?

Comment 3 Srinivasa Ragavan 2008-05-27 17:40:27 UTC

mcrha: Can you review this?

Comment 4 Milan Crha 2008-05-28 16:25:09 UTC

No no, I would prefer other approach, please rework patch in a way it will remove the order marker on paste, if present, because you want to have there this marker in case you are pasting text to other application. Same as other application can add this marker to its text on copy. Does it make sense?

Please do not forget to fill a ChangeLog entry too. Thanks in advance.

Comment 5 Vaclav Slavik 2008-05-28 17:03:06 UTC

(In reply to comment #4)
> No no, I would prefer other approach, please rework patch in a way it will
> remove the order marker on paste, if present, because you want to have there
> this marker in case you are pasting text to other application. Same as other
> application can add this marker to its text on copy. Does it make sense?

Not at all -- once again, the text put into clipboard is encoded as *UTF-8* and it is known to be UTF-8. Therefore, it doesn't make any sense to use BOM -- UTF-8 is endianness-neutral and BOM is only useful for identifying endianness (UTF-8 is easily identified without it).

Also, Putting BOM on the UTF-8 clipboard breaks any application not aware of BOM (and/or not playing it safe and filtering BOM out of UTF-8 stream). As a real-life example consider this:
1. somebody shows you a small shell script in email 
2. you want to try it, so you do cat>test.sh, copy & paste body of the script
   from the email, and run it
3. tough; the file doesn't start with "#!", but with <BOM>+"#!", that has no special meaning

Yes, in the ideal word, Evolution/Gtkhtml would filter out BOM on insert too. But frankly, I don't care about input from other broken apps, because I'm yet to see any other app misbehaving like this. And the BOM shouldn't be included in UTF-8 clipboard output regardless of whether it is or is not filtered out on input.

Comment 6 Milan Crha 2008-05-29 10:32:18 UTC

(In reply to comment #5)
> Not at all -- once again, the text put into clipboard is encoded as *UTF-8* and
> it is known to be UTF-8. Therefore, it doesn't make any sense to use BOM --
> UTF-8 is endianness-neutral and BOM is only useful for identifying endianness
> (UTF-8 is easily identified without it).

OK, I see now.
I looked into http://tools.ietf.org/html/rfc3629#section-6
and even it's useless, then it can be there, and the second paragraph says:
   It is important to understand that the character U+FEFF appearing at
   any position other than the beginning of a stream MUST be interpreted
   with the semantics for the zero-width non-breaking space, and MUST
   NOT be interpreted as a signature.  When interpreted as a signature,
   the Unicode standard suggests than an initial U+FEFF character may be
   stripped before processing the text.  Such stripping is necessary in
   some cases (e.g., when concatenating two strings, because otherwise
   the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
   SPACE" at the connection point), but might affect an external process
   at a different layer (such as a digital signature or a count of the
   characters) that is relying on the presence of all characters in the
   stream.  It is therefore RECOMMENDED to avoid stripping an initial
   U+FEFF interpreted as a signature without a good reason, to ignore it
   instead of stripping it when appropriate (such as for display) and to
   strip it only when really necessary.

So, I agree with your patch now, but I still want from you to extend it and strip the BOM on paste if it exists there. I would rather strip it, because we concatenate strings. It will be fine to just strip the first bytes if they have there this BOM (not on any position).

Comment 7 Vaclav Slavik 2008-06-07 09:15:52 UTC

Created attachment 112305 [details] [review]
revised patch to fix the whole BOM mess

(In reply to comment #6)
> but I still want from you to extend it

You know, if I had the balls to demand fixes for loosely related bugs as precondition for patch's acceptance on my projects, I'd at least ask for it nicely.

Updated patch that fixes both of GtkHtml's BOM handling bugs is attached.

Sorry for forgetting to include changelog record:


	** Fix for bug #533741

	* gtkhtml.c: (get_selection_string), (clipboard_paste_received_cb):
	Don't insert BOM into UTF-8 text when copying to clipboard; filter
	it out when pasting from clipboard. Patch from Vaclav Slavik.

Comment 8 Milan Crha 2008-06-09 14:24:12 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > but I still want from you to extend it
> 
> You know, if I had the balls to demand fixes for loosely related bugs as
> precondition for patch's acceptance on my projects, I'd at least ask for it
> nicely.

Oh, I'm sorry, I didn't want to sound rude or impolite, it really wasn't intended to be like that.

I reviewed it and it seems fine, I'll commit to trunk and stable.
Thanks for the patch.

Comment 9 Milan Crha 2008-06-09 14:53:16 UTC

Committed to trunk. Committed revision 8861.
Committed to gnome-2-22. Committed revision 8862.

Comment 10 Matthew Barnes 2008-06-29 21:18:47 UTC

*** Bug 540810 has been marked as a duplicate of this bug. ***

Comment 11 Milan Crha 2013-06-20 16:20:31 UTC

*** Bug 544661 has been marked as a duplicate of this bug. ***

Comment 12 Milan Crha 2013-06-20 16:20:55 UTC

*** Bug 541373 has been marked as a duplicate of this bug. ***