Bug 96376 - Bugzilla doesn't grok UTF-8
Bugzilla doesn't grok UTF-8
Comment 1 Christian Rose 2002-10-21 06:56:36 UTC
Bugzilla doesn't grok UTF-8. UTF-8 characters entered into bug reports turn
into garbage. This may be problematic for reports involving i18n and l10n,
and especially reports in the l10n component.
Comment 2 Luis Villa 2002-12-03 19:50:09 UTC
This is not a problem that we're likely going to be able to solve
here. I'd suggest filing a bug upstream ( with
more concrete examples of the problems, Christian.
Comment 3 Christian Rose 2002-12-03 22:59:07 UTC
I suspect this problem isn't as much about the bugzilla code itself as
it really is about the database backend. If it's MySQL, MySQL by
default stores everything as iso-8859-1. I don't know if there are any
free DBMS that currently allows for text storage in UTF-8, unfortunately.

Although, to some extent it really is a bugzilla problem, since
bugzilla doesn't specify the character set used on the pages (and
hence not for the input in forms) and hence everything gets treated as
iso-8859-1 (which is the default for HTML 4 unless otherwise
specified). So if bugzilla could be modified to specify the character
set, this problem would be solved to some extent.

In fact, when I went searching now in Mozilla bugzilla, this is what is all about.
Comment 4 Carlos Perelló Marín 2002-12-03 23:09:30 UTC
Is not directly related to this bug, but I think you can find this
information useful.

PostgreSQL allows you to store all information as UTF-8 since long ago.

Comment 5 Luis Villa 2002-12-04 03:18:31 UTC
Another great reason to upgrade to postgres whenever someone ports
bugzilla. :)
Comment 6 Luis Villa 2004-03-26 22:50:25 UTC
Upstream bugs and seem relevant, FWIW. Neither
seem optimistic about fixes in the short term, though.
Comment 7 Danilo Segan 2005-02-12 14:17:15 UTC
Another comment: UTF-8 is completely valid ISO-8859-1 as well, so you can store
it without changing anything in MySQL.  Of course, you'd want to convert
ISO-8859-1 to UTF-8 first, and then adding "AddDefaultCharset UTF-8" in Apache
configuration should be sufficient.

The only problem with this would be that collation would not be perfect (eg.
strings such as "éto" and "eto" might get sorted way apart), and case
insensitive features might fail on non-ASCII, but that's not too big of a
problem I think.

FWIW, I believe MySQL 4.0 supports UTF-8 as well.
Comment 8 Ross Golder 2005-02-13 05:41:14 UTC
Danilo, it looks like UTF-8 support in MySQL is currently just experimental
support in the 4.1 development series. If it was working in 4.0 that would have
been nice :)
Comment 9 Ross Golder 2005-02-13 06:09:50 UTC
My mistake, looks like 4.1 isn't a development version, it's the latest
recommended version.
Comment 10 Luis Villa 2005-02-19 18:35:41 UTC
I can't find the link offhand, but I believe the next bugzilla release will
require 4.1, so if the admins want to think about upgrading at some point... :)
Comment 11 Christian Rose 2005-02-19 23:27:21 UTC
RHEL 4 seems to provide MySQL 4.1.x, so that version of MySQL is likely to get
installed when and if the servers get upgraded to RHEL 4.
Comment 12 Ross Golder 2005-02-20 07:30:43 UTC
I've upgraded mysql on to v4.1, if that helps. GNOME bugzilla
is currently working from a v3.23.28 mysql on, but you could
always dump a snapshot onto button, set up a test instance of bugzilla working
from that, and well, progress this etc :)
Comment 13 Markus Bertheau 2005-03-21 19:48:04 UTC
UTF-8 support from the database is probably not needed. It's needed if the
database should case-fold or count the characters in a string. I'm not sure that
that's needed in bugzilla.
Any database can store UTF-8, including MySQL 3.
Comment 14 Markus Bertheau 2005-03-21 20:40:40 UTC
supports UTF-8" means.
Comment 15 James Henstridge 2005-03-24 02:45:24 UTC
To ensure that clients send back UTF-8, any form where the user can enter text
should be modified to include the accept-charset attribute.  For example:

  <form name="changeform" method="post" action="process_bug.cgi"

Without this, if a user manually sets the charset on a bug page (eg. to
correctly display a comment entered before bugzilla started serving its pages as
UTF-8), they may submit data in that charset.

This attribute is documented here:
It is supported by Mozilla and to a lesser extent, by IE (it will respect the
attribute provided its value is "UTF-8").
Comment 16 Olav Vitters 2005-03-25 22:03:04 UTC
Changed the script from to
run on b.g.o. Results:

                        total rows  non-ascii  non-utf8
attachments.description:    35229        35        15
attachments.filename:       35229        26        20
attachments.mimetype:       35229         0         0
bugs.bug_file_loc:         136974         0         0
bugs.short_desc:           136974       746       211
bugs.status_whiteboard:    136974         0         0
longdescs.thetext:         533637     28916      2802           2038         4         2
namedqueries.query:          2038         0         0
profiles.realname:          52361      1126      1062

Quoting a comment from that bug:
"Nice work :-) We need to remember, of course, that just because something
decodes as UTF-8 doesn't necessarily mean that it is."

People can always update their realname. Actually I'm only concerned with
comments. I want to change the script for only open bugs and see what that gives us.
Comment 17 Olav Vitters 2005-04-10 08:06:52 UTC
FYI, it seems Apache has been configured to specify UTF-8 for b.g.o. This wasn't
done in Bugzilla. This causes problems as old comments can be in any charset. Or
so I thought. As noone complained yet other than requesting the charset to be
also set in emails (bug 300051), I've set that for the emails.

Comment 18 Olav Vitters 2005-04-18 19:50:57 UTC
Created attachment 45412 [details] [review]
Assume bug-buddy is utf-8

Bug-buddy emails should be UTF-8 (except for possible gdb output), but b.g.o
didn't store them as UTF-8.

In the strings are assumed to be non-UTF-8. When storing
this into the database, the strings are again reencoded as UTF-8. UTF-8 again
encoded as UTF-8 will look strange. The problem is explained here:
(unfortunately I didn't know about that page until after days of trying to
figure this out I finally found the cause)

The patch 'informs' bug-buddy that the data is UTF-8. Non-welformed UTF-8 data
is stored as \xHH (HH is the hex representation of the octet that could not be
decoded to utf8). See 'perlqq mode' at:

Unfortunately this still caused the email transmitted by to
be misformed (bad subject and body). Followups where fine. A local installation
with newer perl, perl-CGI, etc was ok. Cause was the .UTF-8 in
LANG="en_US.UTF-8" (although that works fine locally). Patch also modifies the
LANG environment to work around that.

As bug-buddy gets an XML I currently believe that Halloween perhaps could be
changed to set the encoding/charset of the XML to UTF-8, but that is way more
difficult to test, and current patch works.

After some more testing I'll commit this.
Comment 19 Olav Vitters 2005-04-19 17:04:01 UTC
Committed the patch.

I've looked at about 1700 bug-buddy emails. 10 of those weren't UTF-8 (including
headers). Some had non-UTF-8 in the From:, some non-UTF-8 in the gdb data
(contents of a variable -- patch actually makes the result clearer) and one copy
pasted a few bytes of non-UTF-8 data in the report. Result will be a big
improvement (non-UTF-8 is still visible).
Comment 20 Olav Vitters 2005-04-24 19:51:32 UTC
Buglist didn't specify charset=utf-8 when doing server push (the 'Please Stand
by' page seen under Mozilla/Firefox/... browsers). Added that. Searches should
now display UTF-8 characters correctly.
Comment 21 Olav Vitters 2005-12-30 20:50:03 UTC
Decided current situation is ok (pretend everything is UTF-8, even if some data is not).