GNOME Bugzilla – Bug 788283
UnicodeDecodeError: 'utf8' codec can't decode byte
Last modified: 2017-12-23 12:00:22 UTC
glib-mkenums seems to dislike files whose contents are not UTF-8, such as the files that are part of libgda 5.2.4: Traceback (most recent call last):
+ Trace 238009
process_file(fname)
line = curfile.readline()
(result, consumed) = self._buffer_decode(data, self.errors, final)
I know glib-mkenums was rewritten in Python recently; presumably this is a consequence of that.
Which files, in particular? What are you proposing as a fix? glib-mkenums can’t automatically work out the encoding of a file. It would need to grow an argument to specify the input encoding of the file. Given that GLib and GTK+ require all UI strings to be UTF-8, I’m not convinced anything good can come of having source files be encoded differently.
While porting glib-mkenums I did try and fix the headers that had ISO-8859-15 encoding, but I realise now I only fixed the master branch. This commit should be backported to the 5.x branches: https://git.gnome.org/browse/libgda/commit/?id=b611c805b3a2248e2f4f85f993f96c13a05b4730 The only reason why the old glib-mkenums worked was that Perl doesn't care about UTF-8 and will happily apply a regular expression to any chunk of bytes it finds — which, of course, causes many more hilarious failures.
Moving to be a libgda bug.
Incidentally, if you want to know which files have issues, you can increase the verbosity of glib-mkenums.
I have not yet identified all the affected files. But so far, the encoding errors are just due to developers' names written in various encodings (sometimes with multiple different codings within the same file) in the copyright portion of the comment header. That libgda commit you showed does address some but not all of them. I agree source files should be UTF-8 these days, and I will patch the remaining libgda files to fix the headers to be so, and I will report the problem to the developers of libgda. My point was just that previous versions of glib-mkenums processed these files without complaint, and the current version fails with an error. It is annoying to people who maintain collections of software, such as me with MacPorts, when software that we've already published in our collection that used to build no longer does because a new version of a build utility has become more strict about its input. It would be nice if we at MacPorts, and other maintainers of software collections, did not have to become the UTF-8 police to fix this issue.
(In reply to Ryan Schmidt from comment #5) > I have not yet identified all the affected files. But so far, the encoding > errors are just due to developers' names written in various encodings > (sometimes with multiple different codings within the same file) in the > copyright portion of the comment header. Sadly, glib-mkenums has to parse comment blocks because they are used for directives for enumerations. This means we cannot just ignore everything. Additionally, without knowing the encoding of the file (which is impossible to know for C headers), we cannot ask Python to encode everything into UTF-8. > That libgda commit you showed does > address some but not all of them. As far as I know, it did for libgda master. > I agree source files should be UTF-8 these days, and I will patch the > remaining libgda files to fix the headers to be so, and I will report the > problem to the developers of libgda. Thanks. > My point was just that previous versions of glib-mkenums processed these > files without complaint, and the current version fails with an error. It is > annoying to people who maintain collections of software, such as me with > MacPorts, when software that we've already published in our collection that > used to build no longer does because a new version of a build utility has > become more strict about its input. The behaviour of the old glib-mkenums was undefined, and unspecified. Sadly, in this case, the old behaviour fell through the cracks of how different languages handle files with mixed encoding. > It would be nice if we at MacPorts, and other maintainers of software > collections, did not have to become the UTF-8 police to fix this issue. Welcome to the wonderful world of packaging software, where you build things that may or may not be maintained, and branches that may or may not be regularly built against newer versions of the toolchain. It is your responsibility, as the person building software, to report build failures and, occasionally, be the "UTF-8 police".
(In reply to Emmanuele Bassi (:ebassi) from comment #6) > It is your responsibility, as the person building software, to report build > failures and, occasionally, be the "UTF-8 police". So thank you for doing so. :-)
Very well! I do agree it's better for the future to enforce UTF-8 for input files, and the number of projects affected by this issue is finite and decreasing. Here is the patch I committed to address the issue in MacPorts: https://github.com/macports/macports-ports/blob/2c6b1919577a96760a639ddea8f64a7cb9ba86d4/databases/libgda5/files/UTF-8.patch
The suggested backport in comment #2 turns out to be non-trivial, so I’ll leave this open for the libgda developers to deal with. A quick run of iconv over `git ls-files` shows that a *lot* of other source files in the repository are not UTF-8; they should also be fixed.
Created attachment 360731 [details] script to convert ISO to UTF-8 encoding
The MacPorts patch I linked above fixes all files in libgda 5.2.4. At the time I didn't wish to spend additional time providing a version of the patch for master. But here's how I made it. To search for files that need conversion, I used: find . -type f -print0 | xargs -0 file | grep ISO To convert the files, I used a sed script, attached. It uses macOS sed (BSD sed); may need adjustment for GNU sed. After conversion, the only remaining ISO file is installers/Windows/gda-browser-tmpl.nsi. I don't know Windows installerisms so I don't know if this file should be converted or not.
Created attachment 360732 [details] script to convert ISO to UTF-8 encoding
look at #787685
Let’s close this as a duplicate of bug #787685 then. *** This bug has been marked as a duplicate of bug 787685 ***
Note that some files have names both in UTF-8 and ISO. https://bugzilla.gnome.org/show_bug.cgi?id=787685 manually fixes this everywhere.