Bug 788283 – UnicodeDecodeError: 'utf8' codec can't decode byte

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 788283 - UnicodeDecodeError: 'utf8' codec can't decode byte


Summary:	UnicodeDecodeError: 'utf8' codec can't decode byte


Status:	RESOLVED DUPLICATE of bug 787685

Product:	libgda
Classification:	Other
Component:	general
Version:	5.0.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	malerba
QA Contact:	gnome-db Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-09-28 10:59 UTC by Ryan Schmidt
Modified:	2017-12-23 12:00 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
script to convert ISO to UTF-8 encoding (328 bytes, application/x-sh) 2017-10-01 17:56 UTC, Ryan Schmidt	Details
script to convert ISO to UTF-8 encoding (374 bytes, application/x-sh) 2017-10-01 18:05 UTC, Ryan Schmidt	Details

Description Ryan Schmidt 2017-09-28 10:59:43 UTC

glib-mkenums seems to dislike files whose contents are not UTF-8, such as the files that are part of libgda 5.2.4:


Traceback (most recent call last):

+ Trace 238009

File "/opt/local/bin/glib-mkenums", line 688 in <module>
```
process_file(fname)
```
File "/opt/local/bin/glib-mkenums", line 420 in process_file
```
line = curfile.readline()
```
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 296 in decode
```
(result, consumed) = self._buffer_decode(data, self.errors, final)
```

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 167: invalid continuation byte



I know glib-mkenums was rewritten in Python recently; presumably this is a consequence of that.

Comment 1 Philip Withnall 2017-09-28 11:26:05 UTC

Which files, in particular?

What are you proposing as a fix? glib-mkenums can’t automatically work out the encoding of a file. It would need to grow an argument to specify the input encoding of the file. Given that GLib and GTK+ require all UI strings to be UTF-8, I’m not convinced anything good can come of having source files be encoded differently.

Comment 2 Emmanuele Bassi (:ebassi) 2017-09-28 11:56:07 UTC

While porting glib-mkenums I did try and fix the headers that had ISO-8859-15 encoding, but I realise now I only fixed the master branch.

This commit should be backported to the 5.x branches:
https://git.gnome.org/browse/libgda/commit/?id=b611c805b3a2248e2f4f85f993f96c13a05b4730

The only reason why the old glib-mkenums worked was that Perl doesn't care about UTF-8 and will happily apply a regular expression to any chunk of bytes it finds — which, of course, causes many more hilarious failures.

Comment 3 Philip Withnall 2017-09-28 11:59:58 UTC

Moving to be a libgda bug.

Comment 4 Emmanuele Bassi (:ebassi) 2017-09-28 12:05:41 UTC

Incidentally, if you want to know which files have issues, you can increase the verbosity of glib-mkenums.

Comment 5 Ryan Schmidt 2017-09-28 12:07:33 UTC

I have not yet identified all the affected files. But so far, the encoding errors are just due to developers' names written in various encodings (sometimes with multiple different codings within the same file) in the copyright portion of the comment header. That libgda commit you showed does address some but not all of them.

I agree source files should be UTF-8 these days, and I will patch the remaining libgda files to fix the headers to be so, and I will report the problem to the developers of libgda.

My point was just that previous versions of glib-mkenums processed these files without complaint, and the current version fails with an error. It is annoying to people who maintain collections of software, such as me with MacPorts, when software that we've already published in our collection that used to build no longer does because a new version of a build utility has become more strict about its input.

It would be nice if we at MacPorts, and other maintainers of software collections, did not have to become the UTF-8 police to fix this issue.

Comment 6 Emmanuele Bassi (:ebassi) 2017-09-28 12:46:32 UTC

(In reply to Ryan Schmidt from comment #5)
> I have not yet identified all the affected files. But so far, the encoding
> errors are just due to developers' names written in various encodings
> (sometimes with multiple different codings within the same file) in the
> copyright portion of the comment header.

Sadly, glib-mkenums has to parse comment blocks because they are used for directives for enumerations.

This means we cannot just ignore everything.

Additionally, without knowing the encoding of the file (which is impossible to know for C headers), we cannot ask Python to encode everything into UTF-8.

> That libgda commit you showed does
> address some but not all of them.

As far as I know, it did for libgda master.

> I agree source files should be UTF-8 these days, and I will patch the
> remaining libgda files to fix the headers to be so, and I will report the
> problem to the developers of libgda.

Thanks.

> My point was just that previous versions of glib-mkenums processed these
> files without complaint, and the current version fails with an error. It is
> annoying to people who maintain collections of software, such as me with
> MacPorts, when software that we've already published in our collection that
> used to build no longer does because a new version of a build utility has
> become more strict about its input.

The behaviour of the old glib-mkenums was undefined, and unspecified. Sadly, in this case, the old behaviour fell through the cracks of how different languages handle files with mixed encoding.

> It would be nice if we at MacPorts, and other maintainers of software
> collections, did not have to become the UTF-8 police to fix this issue.

Welcome to the wonderful world of packaging software, where you build things that may or may not be maintained, and branches that may or may not be regularly built against newer versions of the toolchain.

It is your responsibility, as the person building software, to report build failures and, occasionally, be the "UTF-8 police".

Comment 7 Philip Withnall 2017-09-28 13:11:05 UTC

(In reply to Emmanuele Bassi (:ebassi) from comment #6)
> It is your responsibility, as the person building software, to report build
> failures and, occasionally, be the "UTF-8 police".

So thank you for doing so. :-)

Comment 8 Ryan Schmidt 2017-09-28 20:38:48 UTC

Very well! I do agree it's better for the future to enforce UTF-8 for input files, and the number of projects affected by this issue is finite and decreasing.

Here is the patch I committed to address the issue in MacPorts:

https://github.com/macports/macports-ports/blob/2c6b1919577a96760a639ddea8f64a7cb9ba86d4/databases/libgda5/files/UTF-8.patch

Comment 9 Philip Withnall 2017-09-30 20:03:50 UTC

The suggested backport in comment #2 turns out to be non-trivial, so I’ll leave this open for the libgda developers to deal with. A quick run of iconv over `git ls-files` shows that a *lot* of other source files in the repository are not UTF-8; they should also be fixed.

Comment 10 Ryan Schmidt 2017-10-01 17:56:42 UTC

Created attachment 360731 [details]
script to convert ISO to UTF-8 encoding

Comment 11 Ryan Schmidt 2017-10-01 17:57:53 UTC

The MacPorts patch I linked above fixes all files in libgda 5.2.4. At the time I didn't wish to spend additional time providing a version of the patch for master. But here's how I made it.

To search for files that need conversion, I used:

find . -type f -print0 | xargs -0 file | grep ISO

To convert the files, I used a sed script, attached. It uses macOS sed (BSD sed); may need adjustment for GNU sed.

After conversion, the only remaining ISO file is installers/Windows/gda-browser-tmpl.nsi. I don't know Windows installerisms so I don't know if this file should be converted or not.

Comment 12 Ryan Schmidt 2017-10-01 18:05:24 UTC

Created attachment 360732 [details]
script to convert ISO to UTF-8 encoding

Comment 13 Andrea Zagli 2017-12-23 09:34:58 UTC

look at #787685

Comment 14 Philip Withnall 2017-12-23 09:53:37 UTC

Let’s close this as a duplicate of bug #787685 then.

*** This bug has been marked as a duplicate of bug 787685 ***

Comment 15 Jan Tojnar 2017-12-23 12:00:22 UTC

Note that some files have names both in UTF-8 and ISO. https://bugzilla.gnome.org/show_bug.cgi?id=787685 manually fixes this everywhere.