GNOME Bugzilla – Bug 96376
Bugzilla doesn't grok UTF-8
Last modified: 2005-12-30 20:50:03 UTC
Bugzilla doesn't grok UTF-8. UTF-8 characters entered into bug reports turn into garbage. This may be problematic for reports involving i18n and l10n, and especially reports in the l10n component.
This is not a problem that we're likely going to be able to solve here. I'd suggest filing a bug upstream (bugzilla.mozilla.org) with more concrete examples of the problems, Christian.
I suspect this problem isn't as much about the bugzilla code itself as it really is about the database backend. If it's MySQL, MySQL by default stores everything as iso-8859-1. I don't know if there are any free DBMS that currently allows for text storage in UTF-8, unfortunately. Although, to some extent it really is a bugzilla problem, since bugzilla doesn't specify the character set used on the pages (and hence not for the input in forms) and hence everything gets treated as iso-8859-1 (which is the default for HTML 4 unless otherwise specified). So if bugzilla could be modified to specify the character set, this problem would be solved to some extent. In fact, when I went searching now in Mozilla bugzilla, this is what http://bugzilla.mozilla.org/show_bug.cgi?id=126266 is all about.
Is not directly related to this bug, but I think you can find this information useful. PostgreSQL allows you to store all information as UTF-8 since long ago. Cheers.
Another great reason to upgrade to postgres whenever someone ports bugzilla. :)
Upstream bugs http://bugzilla.mozilla.org/show_bug.cgi?id=44343 and http://bugzilla.mozilla.org/show_bug.cgi?id=126266 seem relevant, FWIW. Neither seem optimistic about fixes in the short term, though.
Another comment: UTF-8 is completely valid ISO-8859-1 as well, so you can store it without changing anything in MySQL. Of course, you'd want to convert ISO-8859-1 to UTF-8 first, and then adding "AddDefaultCharset UTF-8" in Apache configuration should be sufficient. The only problem with this would be that collation would not be perfect (eg. strings such as "éto" and "eto" might get sorted way apart), and case insensitive features might fail on non-ASCII, but that's not too big of a problem I think. FWIW, I believe MySQL 4.0 supports UTF-8 as well.
Danilo, it looks like UTF-8 support in MySQL is currently just experimental support in the 4.1 development series. If it was working in 4.0 that would have been nice :)
My mistake, looks like 4.1 isn't a development version, it's the latest recommended version. http://dev.mysql.com/downloads/
I can't find the link offhand, but I believe the next bugzilla release will require 4.1, so if the admins want to think about upgrading at some point... :)
RHEL 4 seems to provide MySQL 4.1.x, so that version of MySQL is likely to get installed when and if the servers get upgraded to RHEL 4. http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/release-notes/as-x86/#id3465361
I've upgraded mysql on button.gnome.org to v4.1, if that helps. GNOME bugzilla is currently working from a v3.23.28 mysql on window.gnome.org, but you could always dump a snapshot onto button, set up a test instance of bugzilla working from that, and well, progress this etc :)
UTF-8 support from the database is probably not needed. It's needed if the database should case-fold or count the characters in a string. I'm not sure that that's needed in bugzilla. Any database can store UTF-8, including MySQL 3.
https://bugzilla.mozilla.org/show_bug.cgi?id=126266#c51 details what "database supports UTF-8" means.
To ensure that clients send back UTF-8, any form where the user can enter text should be modified to include the accept-charset attribute. For example: <form name="changeform" method="post" action="process_bug.cgi" accept-charset="UTF-8"> Without this, if a user manually sets the charset on a bug page (eg. to correctly display a comment entered before bugzilla started serving its pages as UTF-8), they may submit data in that charset. This attribute is documented here: http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3 It is supported by Mozilla and to a lesser extent, by IE (it will respect the attribute provided its value is "UTF-8").
Changed the script from https://bugzilla.mozilla.org/show_bug.cgi?id=280633 to run on b.g.o. Results: total rows non-ascii non-utf8 attachments.description: 35229 35 15 attachments.filename: 35229 26 20 attachments.mimetype: 35229 0 0 bugs.bug_file_loc: 136974 0 0 bugs.short_desc: 136974 746 211 bugs.status_whiteboard: 136974 0 0 longdescs.thetext: 533637 28916 2802 namedqueries.name: 2038 4 2 namedqueries.query: 2038 0 0 profiles.realname: 52361 1126 1062 Quoting a comment from that bug: "Nice work :-) We need to remember, of course, that just because something decodes as UTF-8 doesn't necessarily mean that it is." People can always update their realname. Actually I'm only concerned with comments. I want to change the script for only open bugs and see what that gives us.
FYI, it seems Apache has been configured to specify UTF-8 for b.g.o. This wasn't done in Bugzilla. This causes problems as old comments can be in any charset. Or so I thought. As noone complained yet other than requesting the charset to be also set in emails (bug 300051), I've set that for the emails.
Created attachment 45412 [details] [review] Assume bug-buddy is utf-8 Bug-buddy emails should be UTF-8 (except for possible gdb output), but b.g.o didn't store them as UTF-8. In bug-buddy-import.pl the strings are assumed to be non-UTF-8. When storing this into the database, the strings are again reencoded as UTF-8. UTF-8 again encoded as UTF-8 will look strange. The problem is explained here: http://www.ahinea.com/en/tech/perl-unicode-struggle.html (unfortunately I didn't know about that page until after days of trying to figure this out I finally found the cause) The patch 'informs' bug-buddy that the data is UTF-8. Non-welformed UTF-8 data is stored as \xHH (HH is the hex representation of the octet that could not be decoded to utf8). See 'perlqq mode' at: http://search.cpan.org/~jhi/perl-5.8.1/ext/Encode/Encode.pm#Handling_Malformed_Data Unfortunately this still caused the email transmitted by bug-buddy-import.pl to be misformed (bad subject and body). Followups where fine. A local installation with newer perl, perl-CGI, etc was ok. Cause was the .UTF-8 in LANG="en_US.UTF-8" (although that works fine locally). Patch also modifies the LANG environment to work around that. As bug-buddy gets an XML I currently believe that Halloween perhaps could be changed to set the encoding/charset of the XML to UTF-8, but that is way more difficult to test, and current patch works. After some more testing I'll commit this.
Committed the patch. I've looked at about 1700 bug-buddy emails. 10 of those weren't UTF-8 (including headers). Some had non-UTF-8 in the From:, some non-UTF-8 in the gdb data (contents of a variable -- patch actually makes the result clearer) and one copy pasted a few bytes of non-UTF-8 data in the report. Result will be a big improvement (non-UTF-8 is still visible).
Buglist didn't specify charset=utf-8 when doing server push (the 'Please Stand by' page seen under Mozilla/Firefox/... browsers). Added that. Searches should now display UTF-8 characters correctly.
Decided current situation is ok (pretend everything is UTF-8, even if some data is not).