Bug 693020 – Encoding conversion problem with UTF-8 file starting in BOM

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 693020 - Encoding conversion problem with UTF-8 file starting in BOM


Summary:	Encoding conversion problem with UTF-8 file starting in BOM


Status:	RESOLVED FIXED

Product:	tepl
Classification:	Other
Component:	File loading and saving
Version:	2.0.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Gtef maintainer(s)
QA Contact:	Gtef maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-02-01 16:14 UTC by Tomáš Hnyk
Modified:	2017-04-25 09:59 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Tomáš Hnyk 2013-02-01 16:14:01 UTC

Libreoffice saves documents by default in UTF-8 (at least on my system) with no way to configure it. It also includes BOM at the start of the file (that does not break unicode specification: http://unicode.org/faq/utf_bom.html#BOM: "Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.")

Gedit opens such a file just fine, but interprets the BOM as "ZERO WIDTH NON-BREAKING SPACE (ZWNBSP)"*, which seems to be correct when such character is in the middle of the file. However, it treats it in this way also when it is in the beginning of the file, which seems to be a bug. It manifests especially when one needs to convert the file to a different encoding. Gedit complains with "The document contains one or more characters that cannot be encoded using the specified character encoding."

* In practice, when a cursor is right of such character, pressing left arrow does seemingly nothing, but it moves the cursor left of such character. Backspace seems to do nothing as well but deletes the character.

Comment 1 Ignacio Casal Quinteiro (nacho) 2013-02-01 16:17:04 UTC

Yes this is indeed a known problem. We should improve our detection to check the first char and skip it if it is a DOM char.

Comment 2 Tomáš Hnyk 2013-02-01 16:21:07 UTC

Thanks for the information, I tried searching for existing reports and could not find any.

(BTW: would it be possible to show which characters "cannot be encoded using
the specified character encoding." when showing that warning? I can imagine that it might be hard, but then again it might not. It would certainly help with clearing the file os such characters when they are there by design. As of now, the error message is a bit unhelpful. Should I open a ticked for that?)

Comment 3 Sébastien Wilmet 2017-04-25 09:59:00 UTC

It seems to be fixed with GtefFileLoader.

It would have been easier to reproduce the bug if there was an attachment with a sample file starting with BOM.