GNOME Bugzilla – Bug 693020
Encoding conversion problem with UTF-8 file starting in BOM
Last modified: 2017-04-25 09:59:00 UTC
Libreoffice saves documents by default in UTF-8 (at least on my system) with no way to configure it. It also includes BOM at the start of the file (that does not break unicode specification: http://unicode.org/faq/utf_bom.html#BOM: "Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.") Gedit opens such a file just fine, but interprets the BOM as "ZERO WIDTH NON-BREAKING SPACE (ZWNBSP)"*, which seems to be correct when such character is in the middle of the file. However, it treats it in this way also when it is in the beginning of the file, which seems to be a bug. It manifests especially when one needs to convert the file to a different encoding. Gedit complains with "The document contains one or more characters that cannot be encoded using the specified character encoding." * In practice, when a cursor is right of such character, pressing left arrow does seemingly nothing, but it moves the cursor left of such character. Backspace seems to do nothing as well but deletes the character.
Yes this is indeed a known problem. We should improve our detection to check the first char and skip it if it is a DOM char.
Thanks for the information, I tried searching for existing reports and could not find any. (BTW: would it be possible to show which characters "cannot be encoded using the specified character encoding." when showing that warning? I can imagine that it might be hard, but then again it might not. It would certainly help with clearing the file os such characters when they are there by design. As of now, the error message is a bit unhelpful. Should I open a ticked for that?)
It seems to be fixed with GtefFileLoader. It would have been easier to reproduce the bug if there was an attachment with a sample file starting with BOM.