After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 787862 - Always open files in text mode and always use utf-8
Always open files in text mode and always use utf-8
Status: RESOLVED FIXED
Product: gtk-doc
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: 1.27
Assigned To: gtk-doc maintainers
gtk-doc maintainers
Depends on:
Blocks:
 
 
Reported: 2017-09-18 22:24 UTC by Christoph Reiter (lazka)
Modified: 2017-11-01 20:00 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Always open files in text mode and always use utf-8 (17.73 KB, patch)
2017-09-18 22:24 UTC, Christoph Reiter (lazka)
committed Details | Review
mkdb: Mark multiple Unicode strings as such (2.76 KB, patch)
2017-10-25 19:32 UTC, Christoph Reiter (lazka)
committed Details | Review
mkdb: Mark multiple Unicode strings as such (2.78 KB, patch)
2017-10-26 18:41 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
committed Details | Review
Always open files in text mode and always use utf-8 (17.80 KB, patch)
2017-10-26 18:41 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
committed Details | Review

Description Christoph Reiter (lazka) 2017-09-18 22:24:11 UTC
Created attachment 360014 [details] [review]
Always open files in text mode and always use utf-8

(I'm currently trying to get gtk-doc working on Window: https://github.com/Alexpux/MINGW-packages/pull/2918)

----

Introduces a common.open_text() helper with saner defaults for opening
text files across Python versions.

open() defaults to the locale encoding which on a properly configured
Unix is utf-8, but cp-1252 on Windows which can't handle all of Unicode.
Instead of using the default always use utf-8 for text files.

To reduce the difference of types processed by Python 2 vs 3 use
codecs.open() to open text files in text mode on Python 2. The
resulting file object will return unicode like on Python 3, but still
allows passing in ASCII only str.

Also fixes a few missing file.close() operations, which is important on
Windows as non-closed files can't be renamed/deleted on Windows.
Comment 1 Stefan Sauer (gstreamer, gtkdoc dev) 2017-10-25 19:10:50 UTC
Do the tests work for you?

when using python2

make[4]: Entering directory `/home/ensonic/projects/gnome/gtk-doc/tests/gobject/docs'
  DOC   00:00:00.004161985: Scanning header files
  DOC   00:00:00.101422715: Introspecting gobjects
  DOC   00:00:00.297362182: Building XML
Traceback (most recent call last):
  • File "/home/ensonic/projects/gnome/gtk-doc/gtkdoc-mkdb", line 61 in <module>
    mkdb.Run(options)
  • File "/home/ensonic/projects/gnome/gtk-doc/gtkdoc/mkdb.py", line 284 in Run
    changed, book_top, book_bottom = OutputDB(os.path.join(ROOT_DIR, MODULE + "-sections.txt"), options)
  • File "/home/ensonic/projects/gnome/gtk-doc/gtkdoc/mkdb.py", line 728 in OutputDB
    sig_synop, sig_desc = GetSignals(symbol)
  • File "/home/ensonic/projects/gnome/gtk-doc/gtkdoc/mkdb.py", line 3315 in GetSignals
    sid, name)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)
  DOC   00:00:00.378855244: Building HTML
Comment 2 Christoph Reiter (lazka) 2017-10-25 19:14:26 UTC
Oh, oops, I'll have a look
Comment 3 Christoph Reiter (lazka) 2017-10-25 19:32:56 UTC
Created attachment 362289 [details] [review]
mkdb: Mark multiple Unicode strings as such

These are utf-8 encoded byte strings under Python 2 and when concatonated
with unicode objects get auto-decoded using the default ascii encoding,
which fails as they are not ascii.
Comment 4 Stefan Sauer (gstreamer, gtkdoc dev) 2017-10-26 18:41:25 UTC
The following fixes have been pushed:
2135887 mkdb: Mark multiple Unicode strings as such
1eeec38 Always open files in text mode and always use utf-8
Comment 5 Stefan Sauer (gstreamer, gtkdoc dev) 2017-10-26 18:41:37 UTC
Created attachment 362365 [details] [review]
mkdb: Mark multiple Unicode strings as such

These are utf-8 encoded byte strings under Python 2 and when concatonated
with unicode objects get auto-decoded using the default ascii encoding,
which fails as they are not ascii.
Comment 6 Stefan Sauer (gstreamer, gtkdoc dev) 2017-10-26 18:41:43 UTC
Created attachment 362366 [details] [review]
Always open files in text mode and always use utf-8

Introduces a common.open_text() helper with saner defaults for opening
text files across Python versions.

open() defaults to the locale encoding which on a properly configured
Unix is utf-8, but cp-1252 on Windows which can't handle all of Unicode.
Instead of using the default always use utf-8 for text files.

To reduce the difference of types processed by Python 2 vs 3 use
codecs.open() to open text files in text mode on Python 2. The
resulting file object will return unicode like on Python 3, but still
allows passing in ASCII only str.

Also fixes a few missing file.close() operations, which is important on
Windows as non-closed files can't be renamed/deleted on Windows.
Comment 7 Stefan Sauer (gstreamer, gtkdoc dev) 2017-10-26 18:42:19 UTC
Thanks!
Comment 8 Dominique Leuenberger 2017-10-27 17:51:40 UTC
while building gmime-2.6, I still see a gtk-doc failure like this:

[   99s] Traceback (most recent call last):
[   99s]   File "/usr/bin/gtkdoc-mkdb", line 61, in <module>
[   99s]     mkdb.Run(options)
[   99s]   File "/usr/share/gtk-doc/python/gtkdoc/mkdb.py", line 281, in Run
[   99s]     ReadSourceDocumentation(sdir, suffix_list, source_dirs, ignore_files)
[   99s]   File "/usr/share/gtk-doc/python/gtkdoc/mkdb.py", line 3638, in ReadSourceDocumentation
[   99s]     ScanSourceFile(fname, ignore_files)
[   99s]   File "/usr/share/gtk-doc/python/gtkdoc/mkdb.py", line 3679, in ScanSourceFile
[   99s]     for line in SRCFILE:
[   99s]   File "/usr/lib/python3.6/codecs.py", line 321, in decode
[   99s]     (result, consumed) = self._buffer_decode(data, self.errors, final)
[   99s] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 6694: invalid start byte

Related? different/new bug?
Comment 9 Christoph Reiter (lazka) 2017-10-27 19:09:46 UTC
Different issue. The files "gmime-iconv-utils.c" and "gmime-filter-charset.c" are latin1 encoded instead of utf-8.
Comment 10 Stefan Sauer (gstreamer, gtkdoc dev) 2017-11-01 20:00:50 UTC
Still sucks though. I wonder if we could peek at mode-lines in the file and reopen in the right encoding if it is specified. The files in gmime use mode-lines, but don't specify the encoding :/