GNOME Bugzilla – Bug 96376
Bugzilla doesn't grok UTF-8
Last modified: 2005-12-30 20:50:03 UTC
Bugzilla doesn't grok UTF-8. UTF-8 characters entered into bug reports turn
into garbage. This may be problematic for reports involving i18n and l10n,
and especially reports in the l10n component.
This is not a problem that we're likely going to be able to solve
here. I'd suggest filing a bug upstream (bugzilla.mozilla.org) with
more concrete examples of the problems, Christian.
I suspect this problem isn't as much about the bugzilla code itself as
it really is about the database backend. If it's MySQL, MySQL by
default stores everything as iso-8859-1. I don't know if there are any
free DBMS that currently allows for text storage in UTF-8, unfortunately.
Although, to some extent it really is a bugzilla problem, since
bugzilla doesn't specify the character set used on the pages (and
hence not for the input in forms) and hence everything gets treated as
iso-8859-1 (which is the default for HTML 4 unless otherwise
specified). So if bugzilla could be modified to specify the character
set, this problem would be solved to some extent.
In fact, when I went searching now in Mozilla bugzilla, this is what
http://bugzilla.mozilla.org/show_bug.cgi?id=126266 is all about.
Is not directly related to this bug, but I think you can find this
PostgreSQL allows you to store all information as UTF-8 since long ago.
Another great reason to upgrade to postgres whenever someone ports
Upstream bugs http://bugzilla.mozilla.org/show_bug.cgi?id=44343 and
http://bugzilla.mozilla.org/show_bug.cgi?id=126266 seem relevant, FWIW. Neither
seem optimistic about fixes in the short term, though.
Another comment: UTF-8 is completely valid ISO-8859-1 as well, so you can store
it without changing anything in MySQL. Of course, you'd want to convert
ISO-8859-1 to UTF-8 first, and then adding "AddDefaultCharset UTF-8" in Apache
configuration should be sufficient.
The only problem with this would be that collation would not be perfect (eg.
strings such as "éto" and "eto" might get sorted way apart), and case
insensitive features might fail on non-ASCII, but that's not too big of a
problem I think.
FWIW, I believe MySQL 4.0 supports UTF-8 as well.
Danilo, it looks like UTF-8 support in MySQL is currently just experimental
support in the 4.1 development series. If it was working in 4.0 that would have
been nice :)
My mistake, looks like 4.1 isn't a development version, it's the latest
I can't find the link offhand, but I believe the next bugzilla release will
require 4.1, so if the admins want to think about upgrading at some point... :)
RHEL 4 seems to provide MySQL 4.1.x, so that version of MySQL is likely to get
installed when and if the servers get upgraded to RHEL 4.
I've upgraded mysql on button.gnome.org to v4.1, if that helps. GNOME bugzilla
is currently working from a v3.23.28 mysql on window.gnome.org, but you could
always dump a snapshot onto button, set up a test instance of bugzilla working
from that, and well, progress this etc :)
UTF-8 support from the database is probably not needed. It's needed if the
database should case-fold or count the characters in a string. I'm not sure that
that's needed in bugzilla.
Any database can store UTF-8, including MySQL 3.
https://bugzilla.mozilla.org/show_bug.cgi?id=126266#c51 details what "database
supports UTF-8" means.
To ensure that clients send back UTF-8, any form where the user can enter text
should be modified to include the accept-charset attribute. For example:
<form name="changeform" method="post" action="process_bug.cgi"
Without this, if a user manually sets the charset on a bug page (eg. to
correctly display a comment entered before bugzilla started serving its pages as
UTF-8), they may submit data in that charset.
This attribute is documented here:
It is supported by Mozilla and to a lesser extent, by IE (it will respect the
attribute provided its value is "UTF-8").
Changed the script from https://bugzilla.mozilla.org/show_bug.cgi?id=280633 to
run on b.g.o. Results:
total rows non-ascii non-utf8
attachments.description: 35229 35 15
attachments.filename: 35229 26 20
attachments.mimetype: 35229 0 0
bugs.bug_file_loc: 136974 0 0
bugs.short_desc: 136974 746 211
bugs.status_whiteboard: 136974 0 0
longdescs.thetext: 533637 28916 2802
namedqueries.name: 2038 4 2
namedqueries.query: 2038 0 0
profiles.realname: 52361 1126 1062
Quoting a comment from that bug:
"Nice work :-) We need to remember, of course, that just because something
decodes as UTF-8 doesn't necessarily mean that it is."
People can always update their realname. Actually I'm only concerned with
comments. I want to change the script for only open bugs and see what that gives us.
FYI, it seems Apache has been configured to specify UTF-8 for b.g.o. This wasn't
done in Bugzilla. This causes problems as old comments can be in any charset. Or
so I thought. As noone complained yet other than requesting the charset to be
also set in emails (bug 300051), I've set that for the emails.
Created attachment 45412 [details] [review]
Assume bug-buddy is utf-8
Bug-buddy emails should be UTF-8 (except for possible gdb output), but b.g.o
didn't store them as UTF-8.
In bug-buddy-import.pl the strings are assumed to be non-UTF-8. When storing
this into the database, the strings are again reencoded as UTF-8. UTF-8 again
encoded as UTF-8 will look strange. The problem is explained here:
(unfortunately I didn't know about that page until after days of trying to
figure this out I finally found the cause)
The patch 'informs' bug-buddy that the data is UTF-8. Non-welformed UTF-8 data
is stored as \xHH (HH is the hex representation of the octet that could not be
decoded to utf8). See 'perlqq mode' at:
Unfortunately this still caused the email transmitted by bug-buddy-import.pl to
be misformed (bad subject and body). Followups where fine. A local installation
with newer perl, perl-CGI, etc was ok. Cause was the .UTF-8 in
LANG="en_US.UTF-8" (although that works fine locally). Patch also modifies the
LANG environment to work around that.
As bug-buddy gets an XML I currently believe that Halloween perhaps could be
changed to set the encoding/charset of the XML to UTF-8, but that is way more
difficult to test, and current patch works.
After some more testing I'll commit this.
Committed the patch.
I've looked at about 1700 bug-buddy emails. 10 of those weren't UTF-8 (including
headers). Some had non-UTF-8 in the From:, some non-UTF-8 in the gdb data
(contents of a variable -- patch actually makes the result clearer) and one copy
pasted a few bytes of non-UTF-8 data in the report. Result will be a big
improvement (non-UTF-8 is still visible).
Buglist didn't specify charset=utf-8 when doing server push (the 'Please Stand
by' page seen under Mozilla/Firefox/... browsers). Added that. Searches should
now display UTF-8 characters correctly.
Decided current situation is ok (pretend everything is UTF-8, even if some data is not).