GNOME Bugzilla – Bug 785113
glib-mkenums Python port fixes
Last modified: 2017-12-13 13:45:19 UTC
Building gtk with glib master fails on Windows. Here are some fixes
Created attachment 355924 [details] [review] glib-mkenums: Use utf-8 for reading files On Windows open() defaults to ANSI and on Python 2 it doesn't take an encoding. Use io.open() instead which provides the same interface on both Python versions.
Created attachment 355925 [details] [review] glib-mkenums: Don't use FileNotFoundError, it's Python 3 only.
Review of attachment 355924 [details] [review]: ACK
Review of attachment 355925 [details] [review]: ACK
another utf8 bug?: gcab (master %)$ glib-mkenums --identifier-prefix GCab --symbol-prefix gcab --template libgcab/gcab-enums.h.etemplate libgcab/gcab-folder.h libgcab/gcab-file.h > libgcab/gcab-enums.h Traceback (most recent call last):
+ Trace 237688
write_output(prod)
print(output, file=output_stream)
Possibly, but all files are marked as UTF-8.
This is also Python 2.x only; Python 3.x parses the files successfully.
Created attachment 356187 [details] [review] glib-mkenums: fix encoding error when writing files Instead of using NamedTemporaryFile, which doesn't take an encoding in Python 2 use mkstemp() to create a file and open it with io.open(), with a proper encoding set.
This is just a guess.. Python 3 should have the same problem with LANG=C
Review of attachment 356187 [details] [review]: Thanks, that fixes the issue.
Thanks for testing
Still happening with glib master, same error as in comment #5, 100% reproducible for me today.
Thanks. Python 2 gives up when redirecting stdout and sets stdout.encoding to None which makes it default to ascii. Python 3 sets it to the locale encoding.
Created attachment 356410 [details] [review] glib-mkenums: Python2: use locale encoding when redirecting stdout In case of Python 2 and stdout being redirected to a file, sys.stdout.encoding is None and it defaults ASCII for encoding text. To match the behaviour of Python 3, which uses the locale encoding also when redirecting to a file, wrap sys.stdout with a StreamWriter using the locale encoding.
Ideally gcab should use "--output" to not depend on the locale encoding.
Why are you attempting to support Python 2?
(In reply to Michael Catanzaro from comment #16) > Why are you attempting to support Python 2? Because glib-mkenums is used to build projects on systems without Python 3.
It also defaults to Python 2 here.
Review of attachment 356410 [details] [review]: This looks okay to me.
2.54.1 contains the above patches, yet I hit something similar when trying to compile xfce4-gtk2-engine 3.2.0: Traceback (most recent call last):
+ Trace 238083
process_file(fname)
line = curfile.readline()
(result, consumed) = self._buffer_decode(data, self.errors, final)
Traceback (most recent call last):
Makefile:737: recipe for target 'xfce_typebuiltin.h' failed 0xd6 in iso-8859-1 is Öaut;, and the author's name in the commented out copyright notice is Tomas Öaut;gren. (glib 2.54.1, python 3.6)
(In reply to Patrick Welche from comment #20) > 2.54.1 contains the above patches, yet I hit something similar when trying > to compile xfce4-gtk2-engine 3.2.0: This issue was not for files with mixed encoding: it was for the output of files in Unicode when the local output encoding was not UTF-8. The parsed files must be UTF-8 because C sources have *no* notion of encoding.
... and from the C point of view, you don't care about what is in the comments. Suddenly the content of every comment in every old bit of source becomes vitally important to the success of a build.
(In reply to Patrick Welche from comment #22) > ... and from the C point of view, you don't care about what is in the > comments. Which is true for a C compiler, *but* glib-mkenums parses comments for directives, and needs to be able to run regular expressions on various bits of C, including comments. > Suddenly the content of every comment in every old bit of source becomes > vitally important to the success of a build. Yes, stuff ends up breaking when behaviour is undefined. The old, Perl-based script was happy to run regular expressions on random byte sequences, which is a recipe for other, exciting undefined behaviour. You have two options: either use ASCII, or use UTF-8. Using other encodings in file formats that have no possibility to define the encoding they use is a recipe for disaster.
OK, you convinced me - in some sense, it isn't C code, but mkenum code.
A possible workaround would be to use the "replace" error handler when decoding, which would give us "?" for every byte failing to decode. Ideally xfce4-gtk2-engine would get converted to utf-8, but I can see how that seems inconvenient, given its more or less abandoned/"finished" state.
Sounds like a good idea, or we'll keep finding strange new build failures in abandonware. (I just spotted the equivalent debian bug with the magic iso-8559-1 -> utf-8 O umlaut patch: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=870738 )
Created attachment 361959 [details] [review] glib-mkenums: Handle source files which are not utf-8, somewhat Hm, so that only affects one case in one file which needs to be patched. If this remains the only case where mkenums breaks I'm not sure working that around in glib is worth the trouble. Old unmaintained software is bound to break over time one way or another when used with newer tools/deps. Anyway, potential patch attached, I'll let others decide if if should be applied.
Probably reporting an error and expecting it to be fixed is the better thing to do. Compilers get more strict about input all the time, and this is comparatively easy to fix.
(In reply to Christoph Reiter (lazka) from comment #27) > Created attachment 361959 [details] [review] [review] > glib-mkenums: Handle source files which are not utf-8, somewhat > > Hm, so that only affects one case in one file which needs to be patched. If > this remains the only case where mkenums breaks I'm not sure working that > around in glib is worth the trouble. Old unmaintained software is bound to > break over time one way or another when used with newer tools/deps. > > Anyway, potential patch attached, I'll let others decide if if should be > applied. Thank you! I will certainly be applying it to pkgsrc.org. Macports discovered the same issue building libgnome debian discovered the same issue building libgnomeui https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=870729 Given that gnome 2 is far more non-linux friendly than gnome 3, users of it will persist, but would you want to patch ancient libgnome? Presumably not, so comment 28 position doesn't really hold.
(In reply to Patrick Welche from comment #29) > Given that gnome 2 is far more non-linux friendly than gnome 3, users of it > will persist, but would you want to patch ancient libgnome? Presumably not, > so comment 28 position doesn't really hold. See: https://git.gnome.org/browse/libgnome/commit/?id=33313713c4f5c1de500859ff128d6fd7e3af5722 https://git.gnome.org/browse/libgnomeui/commit/?id=30334c28794ef85d8973f4ed0779b5ceed6594f2 Distros that are still shipping these libraries should apply the patches.
Even better: distros simply apply the patch in comment 27.
Let's reopen this, instead of constantly commenting on a closed bug.
(In reply to Patrick Welche from comment #31) > Even better: distros simply apply the patch in comment 27. Personally, I'd rather people ported their code to 2017. This should be a (non-fatal) warning, but it should still cause people to port their code to UTF-8. Legacy modules that have long since deprecated notwithstanding, modern code should not be mixing encodings.
I am certainly not advocating that shiny new Gnome 3 code should mix encodings, but it is a shame to force distros to maintain N patches for something as trivial as an author's name containing an iso8859-1 accented character.
Review of attachment 361959 [details] [review]: ::: gobject/glib-mkenums.in @@ +38,3 @@ +# Replace invalid data with a replacement character to keep things working. +# https://bugzilla.gnome.org/show_bug.cgi?id=785113#c20 +decoding_errors = "replace" I'd like to get a warning for this, for people porting code. AFAICS, you can use something like: ``` import codecs def replace_and_warn(err): print_warning('Invalid Unicode at {}'.format(err.start)) return ('?', err.end) codecs.register_error('replace_and_warn', replace_and_warn) ``` and then use `replace_and_warn` as the errors policy.
Sounds good. Better being the enemy of good, I tried to do better with: def replace_and_warn(err): warnings.warn('{} at {}'.format(err.reason, err.start), UnicodeWarning) return ('?', err.end) but see /usr/pkg/bin/glib-mkenums:44: UnicodeWarning: invalid continuation byte at 980 warnings.warn('{} at {}'.format(err.reason, err.start), UnicodeWarning) /usr/pkg/bin/glib-mkenums:44: UnicodeWarning: invalid continuation byte at 980 warnings.warn('{} at {}'.format(err.reason, err.start), UnicodeWarning) Creating xfce_typebuiltin.c... (Why print the warnings.warn source code line?) The principle works though! (Successful build + warning)
*** Bug 789230 has been marked as a duplicate of this bug. ***
Created attachment 362097 [details] [review] replace and warn Given the legwork required to remove the "warnings.warn" line from the build log, I bailed and went for the trivial patch attached which just prints to stderr. I checked +/-7000 for the range for err.object[], and didn't have out of bounds problems, so +/-7 should be safe to get a clue where the problematic character is. Example output: Making all in gtk-3.0 gmake[2]: Entering directory '/tmp/pkgsrc/x11/xfce4-gtk2-engine/work.x86_64/gtk- xfce-engine-3.2.0/gtk-3.0' UnicodeWarning: invalid continuation byte at 980 (b' Tomas \xd6gren <s') UnicodeWarning: invalid continuation byte at 980 (b' Tomas \xd6gren <s') Creating xfce_typebuiltin.c... Comments?
Review of attachment 362097 [details] [review]: ::: gobject/glib-mkenums.in @@ +40,3 @@ +decoding_errors = "replace_and_warn" + +def replace_and_warn(err): Can you move this down, and use `print_warning()` instead? We get the colorised output, and we get the automatic `--g-fatal-warnings` handling for stopping the processing (and the build). @@ +43,3 @@ + sys.stderr.write('UnicodeWarning: {} at {} ({})\n'.format( + err.reason, err.start, +decoding_errors = "replace_and_warn" Hardcoded numbers are a bit meh, especially without a comment on what they mean — or, better yet, a symbolic constant. Additionally, you probably want to add spaces around the `-`, for readability.
Created attachment 362098 [details] [review] replace and warn v2 Output now looks like: gmake[2]: Entering directory '/tmp/pkgsrc/x11/xfce4-gtk2-engine/work.x86_64/gtk-xfce-engine-3.2. 0/gtk-3.0' WARNING: UnicodeWarning: invalid continuation byte at 980 (b' Tomas \xd6gren <s') WARNING: UnicodeWarning: invalid continuation byte at 980 (b' Tomas \xd6gren <s') Creating xfce_typebuiltin.c...
OK to commit?
Review of attachment 362098 [details] [review]: Looks good to me, but somebody with more Python-fu should probably comment on the style.
Review of attachment 362098 [details] [review]: Looks alright to me. If a Pythonista comes along later with suggested improvements, they can always provide a patch. I’d like this one fix in first though, since the decoding_errors variable seems unnecessary. (If you change decoding_errors to another name, everything will break, since it’s not passed as the first parameter of register_error(). I’d just drop the variable though.) ::: gobject/glib-mkenums.in @@ +87,3 @@ +# Replace invalid data with a replacement character to keep things working. +# https://bugzilla.gnome.org/show_bug.cgi?id=785113#c20 +decoding_errors = "replace_and_warn" Is there any reason for this to be defined as a separate variable? I’d be inclined to just pass the string constant 'replace_and_warn' to the errors= arguments.
(In reply to Philip Withnall from comment #43) > Is there any reason for this to be defined as a separate variable? I’d be > inclined to just pass the string constant 'replace_and_warn' to the errors= > arguments. Patrick?
Created attachment 365102 [details] [review] v3 I'll admit I've not actually tested this... Removed the variable and replaced it with string literals.
Review of attachment 365102 [details] [review]: ACK-by: me
I don’t know why git bz didn’t leave a comment here, but I’ve pushed this to master: b6b74402d (HEAD -> master, origin/master, origin/HEAD) glib-mkenums: best effort attempt on non-utf8 encoded files. and (trivially) backported it and some other outstanding glib-mkenums patches to glib-2-54: 2fc3948ba (HEAD -> glib-2-54) glib-mkenums: best effort attempt on non-utf8 encoded files. b6d4edbae mkenums: Don't raise when unlinking a file that does not exist f5d8956ec glib-genmarshal/glib-mkenums: Add comment clarifying licensing 5fe01f182 Sort the list of files being processed by glib-mkenums 0c7d52ba2 glib-mkenums: Add default comment template if none is provided 1d4afde2f glib-mkenums: Fix typo in version string