Bug 96376 – Bugzilla doesn't grok UTF-8

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 96376 - Bugzilla doesn't grok UTF-8


Summary:	Bugzilla doesn't grok UTF-8


Status:	RESOLVED FIXED

Product:	bugzilla.gnome.org
Classification:	Infrastructure
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Bugzilla Maintainers
QA Contact:	Bugzilla Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2002-10-21 06:56 UTC by Christian Rose
Modified:	2005-12-30 20:50 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Assume bug-buddy is utf-8 (1.60 KB, patch) 2005-04-18 19:50 UTC, Olav Vitters	committed	Details \| Review

Description Christian Rose 2002-10-21 06:56:36 UTC

Bugzilla doesn't grok UTF-8. UTF-8 characters entered into bug reports turn
into garbage. This may be problematic for reports involving i18n and l10n,
and especially reports in the l10n component.

Comment 1 Luis Villa 2002-12-03 19:50:09 UTC

This is not a problem that we're likely going to be able to solve
here. I'd suggest filing a bug upstream (bugzilla.mozilla.org) with
more concrete examples of the problems, Christian.

Comment 2 Christian Rose 2002-12-03 22:59:07 UTC

I suspect this problem isn't as much about the bugzilla code itself as
it really is about the database backend. If it's MySQL, MySQL by
default stores everything as iso-8859-1. I don't know if there are any
free DBMS that currently allows for text storage in UTF-8, unfortunately.

Although, to some extent it really is a bugzilla problem, since
bugzilla doesn't specify the character set used on the pages (and
hence not for the input in forms) and hence everything gets treated as
iso-8859-1 (which is the default for HTML 4 unless otherwise
specified). So if bugzilla could be modified to specify the character
set, this problem would be solved to some extent.

In fact, when I went searching now in Mozilla bugzilla, this is what
http://bugzilla.mozilla.org/show_bug.cgi?id=126266 is all about.

Comment 3 Carlos Perelló Marín 2002-12-03 23:09:30 UTC

Is not directly related to this bug, but I think you can find this
information useful.

PostgreSQL allows you to store all information as UTF-8 since long ago.

Cheers.

Comment 4 Luis Villa 2002-12-04 03:18:31 UTC

Another great reason to upgrade to postgres whenever someone ports
bugzilla. :)

Comment 5 Luis Villa 2004-03-26 22:50:25 UTC

Upstream bugs http://bugzilla.mozilla.org/show_bug.cgi?id=44343 and
http://bugzilla.mozilla.org/show_bug.cgi?id=126266 seem relevant, FWIW. Neither
seem optimistic about fixes in the short term, though.

Comment 6 Danilo Segan 2005-02-12 14:17:15 UTC

Another comment: UTF-8 is completely valid ISO-8859-1 as well, so you can store
it without changing anything in MySQL.  Of course, you'd want to convert
ISO-8859-1 to UTF-8 first, and then adding "AddDefaultCharset UTF-8" in Apache
configuration should be sufficient.

The only problem with this would be that collation would not be perfect (eg.
strings such as "éto" and "eto" might get sorted way apart), and case
insensitive features might fail on non-ASCII, but that's not too big of a
problem I think.

FWIW, I believe MySQL 4.0 supports UTF-8 as well.

Comment 7 Ross Golder 2005-02-13 05:41:14 UTC

Danilo, it looks like UTF-8 support in MySQL is currently just experimental
support in the 4.1 development series. If it was working in 4.0 that would have
been nice :)

Comment 8 Ross Golder 2005-02-13 06:09:50 UTC

My mistake, looks like 4.1 isn't a development version, it's the latest
recommended version.

http://dev.mysql.com/downloads/

Comment 9 Luis Villa 2005-02-19 18:35:41 UTC

I can't find the link offhand, but I believe the next bugzilla release will
require 4.1, so if the admins want to think about upgrading at some point... :)

Comment 10 Christian Rose 2005-02-19 23:27:21 UTC

RHEL 4 seems to provide MySQL 4.1.x, so that version of MySQL is likely to get
installed when and if the servers get upgraded to RHEL 4.
http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/release-notes/as-x86/#id3465361

Comment 11 Ross Golder 2005-02-20 07:30:43 UTC

I've upgraded mysql on button.gnome.org to v4.1, if that helps. GNOME bugzilla
is currently working from a v3.23.28 mysql on window.gnome.org, but you could
always dump a snapshot onto button, set up a test instance of bugzilla working
from that, and well, progress this etc :)

Comment 12 Markus Bertheau 2005-03-21 19:48:04 UTC

UTF-8 support from the database is probably not needed. It's needed if the
database should case-fold or count the characters in a string. I'm not sure that
that's needed in bugzilla.
Any database can store UTF-8, including MySQL 3.

Comment 13 Markus Bertheau 2005-03-21 20:40:40 UTC

https://bugzilla.mozilla.org/show_bug.cgi?id=126266#c51 details what "database
supports UTF-8" means.

Comment 14 James Henstridge 2005-03-24 02:45:24 UTC

To ensure that clients send back UTF-8, any form where the user can enter text
should be modified to include the accept-charset attribute.  For example:

  <form name="changeform" method="post" action="process_bug.cgi"
accept-charset="UTF-8">

Without this, if a user manually sets the charset on a bug page (eg. to
correctly display a comment entered before bugzilla started serving its pages as
UTF-8), they may submit data in that charset.

This attribute is documented here:
  http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3
It is supported by Mozilla and to a lesser extent, by IE (it will respect the
attribute provided its value is "UTF-8").

Comment 15 Olav Vitters 2005-03-25 22:03:04 UTC

Changed the script from https://bugzilla.mozilla.org/show_bug.cgi?id=280633 to
run on b.g.o. Results:

                        total rows  non-ascii  non-utf8
attachments.description:    35229        35        15
attachments.filename:       35229        26        20
attachments.mimetype:       35229         0         0
bugs.bug_file_loc:         136974         0         0
bugs.short_desc:           136974       746       211
bugs.status_whiteboard:    136974         0         0
longdescs.thetext:         533637     28916      2802
namedqueries.name:           2038         4         2
namedqueries.query:          2038         0         0
profiles.realname:          52361      1126      1062

Quoting a comment from that bug:
"Nice work :-) We need to remember, of course, that just because something
decodes as UTF-8 doesn't necessarily mean that it is."

People can always update their realname. Actually I'm only concerned with
comments. I want to change the script for only open bugs and see what that gives us.

Comment 16 Olav Vitters 2005-04-10 08:06:52 UTC

FYI, it seems Apache has been configured to specify UTF-8 for b.g.o. This wasn't
done in Bugzilla. This causes problems as old comments can be in any charset. Or
so I thought. As noone complained yet other than requesting the charset to be
also set in emails (bug 300051), I've set that for the emails.

Comment 17 Olav Vitters 2005-04-18 19:50:57 UTC

Created attachment 45412 [details] [review]
Assume bug-buddy is utf-8

Bug-buddy emails should be UTF-8 (except for possible gdb output), but b.g.o
didn't store them as UTF-8.

In bug-buddy-import.pl the strings are assumed to be non-UTF-8. When storing
this into the database, the strings are again reencoded as UTF-8. UTF-8 again
encoded as UTF-8 will look strange. The problem is explained here:
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
(unfortunately I didn't know about that page until after days of trying to
figure this out I finally found the cause)

The patch 'informs' bug-buddy that the data is UTF-8. Non-welformed UTF-8 data
is stored as \xHH (HH is the hex representation of the octet that could not be
decoded to utf8). See 'perlqq mode' at:
http://search.cpan.org/~jhi/perl-5.8.1/ext/Encode/Encode.pm#Handling_Malformed_Data


Unfortunately this still caused the email transmitted by bug-buddy-import.pl to
be misformed (bad subject and body). Followups where fine. A local installation
with newer perl, perl-CGI, etc was ok. Cause was the .UTF-8 in
LANG="en_US.UTF-8" (although that works fine locally). Patch also modifies the
LANG environment to work around that.

As bug-buddy gets an XML I currently believe that Halloween perhaps could be
changed to set the encoding/charset of the XML to UTF-8, but that is way more
difficult to test, and current patch works.

After some more testing I'll commit this.

Comment 18 Olav Vitters 2005-04-19 17:04:01 UTC

Committed the patch.

I've looked at about 1700 bug-buddy emails. 10 of those weren't UTF-8 (including
headers). Some had non-UTF-8 in the From:, some non-UTF-8 in the gdb data
(contents of a variable -- patch actually makes the result clearer) and one copy
pasted a few bytes of non-UTF-8 data in the report. Result will be a big
improvement (non-UTF-8 is still visible).

Comment 19 Olav Vitters 2005-04-24 19:51:32 UTC

Buglist didn't specify charset=utf-8 when doing server push (the 'Please Stand
by' page seen under Mozilla/Firefox/... browsers). Added that. Searches should
now display UTF-8 characters correctly.

Comment 20 Olav Vitters 2005-12-30 20:50:03 UTC

Decided current situation is ok (pretend everything is UTF-8, even if some data is not).