Bug 97856 – need to support more legacy charsets

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 97856 - need to support more legacy charsets


Summary:	need to support more legacy charsets


Status:	RESOLVED FIXED

Product:	rhythmbox
Classification:	Other
Component:	Monkey Media
Version:	HEAD
Hardware:	Other Linux

Importance:	High major
Target Milestone:	0.6.0
Assigned To:	RhythmBox Maintainers
QA Contact:	RhythmBox Maintainers

URL:
Whiteboard:

Duplicates:	120939 129298 131386 157270 (view as bug list)
Depends on:
Blocks:

Reported:	2002-11-06 16:07 UTC by Young-Ho Cha
Modified:	2005-08-31 15:43 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
first row is euc-kr id3tag and second row is utf8 id3tag, both are id3 v1 (28.45 KB, image/png) 2003-08-04 03:23 UTC, Young-Ho Cha		Details
local id3 charset list from gconf (1.87 KB, patch) 2003-08-20 13:57 UTC, Young-Ho Cha	needs-work	Details \| Review
legacy id3v1 support fix. (1.23 KB, patch) 2003-11-24 12:55 UTC, Young-Ho Cha	needs-work	Details \| Review
some cleanup. (1.62 KB, patch) 2003-11-24 14:52 UTC, Young-Ho Cha	needs-work	Details \| Review
new workaround for legacy id3 tag. (1.91 KB, patch) 2003-11-25 06:07 UTC, Young-Ho Cha	none	Details \| Review

Description Young-Ho Cha 2002-11-06 16:07:52 UTC

many id3 tag use local charset yet. (ogg use utf-8, so they are displayed
properly)

some work around needed.

see http://bugzilla.gnome.org/show_bug.cgi?id=80037

Comment 1 Young-Ho Cha 2003-08-04 03:21:52 UTC

i build and installed currunt cvs tree. but, mp3 id3 tags are 
displayed broken yet.
(both of local charset and utf8)

Comment 2 Young-Ho Cha 2003-08-04 03:23:55 UTC

Created attachment 18887 [details]
first row is euc-kr id3tag and second row is utf8 id3tag, both are id3 v1

Comment 3 Colin Walters 2003-08-05 04:06:47 UTC

This is fairly difficult to fix.  It's best to convert all your tags
to id3v2, which specifies UCS16 as the standard coding, iirc.

We'll try to come up with some workaround eventually though.

Comment 4 Young-Ho Cha 2003-08-05 12:24:27 UTC

hmm. I heard id3_ucs4_utf8duplicate have some troubles.

In liteamp 0.2.3.2 (http://kldp.net/project/showfiles.php?
group_id=109 small mp3/ogg player for gnome2), first dup id3tag with 
id3_ucs4_latin1duplicate() and g_utf8_validate(latin1) or 
g_locale_convert(latin1) or g_convert(latin1) codeset defined by user.

Comment 5 Young-Ho Cha 2003-08-05 12:30:06 UTC

liteamp project page : http://kldp.net/projects/liteamp (korean page)
    download url: http://download.kldp.net/liteamp/liteamp-
0.2.3.2.tar.gz
    gentoo portage and freebsd port available. ;)

Comment 6 Colin Walters 2003-08-17 18:21:03 UTC

The latest Rhythmbox CVS now attempts to convert from the user's
locale encoding if the tag appears to be invalid.

Comment 7 Young-Ho Cha 2003-08-20 13:57:13 UTC

Created attachment 19382 [details] [review]
local id3 charset list from gconf

Comment 8 Young-Ho Cha 2003-08-20 14:03:44 UTC

but g_locale_to_utf8 have some trouble in 'utf8 locale' system.

in 'utf8 locale' system, nl_langinfo(CODESET) always returns 'UTF-8',
so id3tag with local charset have broken conversion.

so i make some workaround.

in this patch, id3tag conversion with user defined charset in gconf key.

Comment 9 Colin Walters 2003-08-21 18:02:49 UTC

Ug.  This patch kind of breaks the internal Rhythmbox dependencies,
where monkey-media is at the bottom.

I think the "right" way to fix this is to use some language/encoding
autodetection framework, and possibly prompt the user.  

Of course the *really* right way is to take the people who write
programs which put the locale charset in mp3s and repeatedly hit them
over the head with the Unicode 4.0 code charts until they fix their
programs.

Comment 10 Colin Walters 2003-08-31 07:38:58 UTC

*** Bug 120939 has been marked as a duplicate of this bug. ***

Comment 11 Sergey Kuleshov 2003-08-31 07:57:42 UTC

I think that prompting the user for the correct encoding of his tags
is probable the easiest, still most effective way to solve the issue.

Talking about suggestion to "ake the people who write
programs which put the locale charset in mp3s and repeatedly hit them
over the head with the Unicode 4.0 code charts" I would say is not a
best sollution. Many international users do not use Unicode (UTF in
particular) cause quite a few apps still don't handle it properly.

Comment 12 Eungkyu Song 2003-11-16 06:53:00 UTC

Please fix this bug quickly.
I'm rebuilding rhythmbox everytime with Cha's patch.
This bug is fairly important for non-latin people.
We can't see any id3v1-tagged-song's title, album or artist.

Comment 13 Bastien Nocera 2003-11-16 12:18:50 UTC

First of all, rhythmbox doesn't depend on eel, so you'll have to
replace the eel convienience functions.

This change:
-		utf8 = id3_ucs4_utf8duplicate (ucs4);
+		utf8 = id3_ucs4_latin1duplicate (ucs4);
breaks properly tagged files in UTF-8.

Couldn't the legacy codeset be detected depending on the locale?

If you could also provide:
- locale used
- legacy codeset used
and an example file reproducing the problem.
A screenshot showing both the broken and right displays would be
appreciated.

Comment 14 Colin Walters 2003-11-16 17:04:51 UTC

I just stole the code for legacy charset support from
libgnome-desktop, hacked it up a bit, and used it for monkey-media.

It's committed to the latest arch/cvs.  Let me know if it works for you!

Comment 15 Alexander Kirillov 2003-11-17 14:17:08 UTC

IS is possible to allow the user to select the encoding to be used for
tags? E.g., for Russian there are at least 3 commonly used encodings
(windows-1251, koi8-r, utf8) - and most mp3 around use windows-1251
for tags, while I use utf8 as locale encoding on my machine. Of
course, I can convert tags, but if I just want to take a quick look at
a song I found, it'd be so much easier to just select the encoding in
a preferences dialog than fire up a separate tag conversion program....

Comment 16 jensflorian 2003-11-17 16:35:02 UTC

Why not let the user select the proper Encodings in Preferences and
then use UTF-8 as fallback?

Comment 17 Colin Walters 2003-11-18 07:55:02 UTC

Alexander: My patch could be extended to have a GList of legacy
charsets associated with each locale.  I don't personally have enough
expertise about the character sets involved and which would be best to
try first to do the modifications myself.

But the most recent changes should fix a majority of cases.  This is
an issue that can never be fixed entirely.  If you don't know what
charset some data is in, you can never be sure what you've retrieved
from it is valid, even if it doesn't violate the character set definition.

The real solution here is to use ID3V2 which specifies the charset.

I'll leave this bug open for now as a reminder.

Comment 18 Young-Ho Cha 2003-11-24 12:54:23 UTC

there is some trouble in id3v1 legacy charset support.

first, id3_ucs4_utf8duplicate is not vaild (i told before).

id3_ucs4_utf8duplicate always retuns valid utf8 string(but we do not
want), because it makes utf8 string with slicing by one byte. (even if
string is invalid utf8)

and in get_encoding_from_locale, hashtable is set NULL, so can't find
valid encoding from hashtable.

and when get encoding from locale, must remove "UTF-8" locale, because
already checked utf8 string in MP3_stream_info_impl_id3_tag_get_utf8

i'll submit patch.

Comment 19 Young-Ho Cha 2003-11-24 12:55:47 UTC

Created attachment 21757 [details] [review]
legacy id3v1 support fix.

Comment 20 Young-Ho Cha 2003-11-24 14:52:18 UTC

Created attachment 21762 [details] [review]
some cleanup.

Comment 21 Young-Ho Cha 2003-11-24 16:42:05 UTC

Ah.. I have some mistaken. 

this problem are happend when id3_field_textencoding ==
ID3_FIELD_TEXTENCODING_ISO_8859_1 only, (because many id3 editors are
set textencoding file to iso8859-1 only). 

so, i think that this problem solve with ..
 if (id3_field_textencoding == ID3_FIELD_TEXTENCODING_ISO_8859_1 ||
id3_field_textencoding == NULL )
  call id3_ucs4_latin1duplicate
 else
  call id3_ucs4_utf8duplicate

but there is no id3_field_gettextencoding function.

how we can implement that?

Comment 22 Young-Ho Cha 2003-11-25 06:06:31 UTC

i made new patch for workaround.
just try convert with legacy charset, and finally use
id3_ucs4_utf8duplicate.

it works for most mp3 files (i tested mp3 files with tagging id3v2,
itunes, winamp, etc..)

Comment 23 Young-Ho Cha 2003-11-25 06:07:25 UTC

Created attachment 21787 [details] [review]
new workaround for legacy id3 tag.

Comment 24 Colin Walters 2003-11-25 06:54:05 UTC

For anyone else following this bug, here's an abbreviated IRC log:

<walters>	ganadist: * commited
walters@rhythmbox.org--2003b/rhythmbox--mainline--0.6--patch-343
<walters>	ganadist: that has some of those fixes, but not the latin1
thing yet.
<ganadist>	thanks :)
<ganadist>	i tryed get textencoding information from id3taglib, but i
can't.
<walters>	ganadist: ok.
<walters>	ganadist: i think what's happening is that libid3 thinks the
file is in latin1
<walters>	ganadist: so it converts it to utf8 for us
<walters>	ganadist: which succeeds, since you can interpret EUC-KR as
latin1.
<walters>	ganadist: but obviously that's the wrong thing
<ganadist>	hmm.
<walters>	ganadist: i'm not sure how exactly to solve this.
<walters>	ganadist: since the returned string *is* valid UTF-8
<ganadist>	yes.
<walters>	ganadist: probably the only real way to fix it would be to
do some language-specific analysis
<walters>	ganadist: if the song name is "¿µ¿ø", well then you
figure you got the encoding wrong.
<walters>	the reason your solution works is that it does no conversion
on the string
<ganadist>	most legacy id3tags are some :(
<walters>	which is wrong too
<ganadist>	yes. i know.
<walters>	ok
<walters>	maybe in the short term what we could do is look for lots of
non-letters in a row
<walters>	if that's true, then we try the _latin1_duplicate
<walters>	ganadist: what do you think?
<ganadist>	i think that is good :)
<walters>	ganadist: do you want to try implementing that?
<ganadist>	i'll try.
<walters>	ganadist: btw, if you can figure out how to peek directly at
the buffers without id3 doing any conversion, that would be helpful
here i think

Comment 25 Young-Ho Cha 2003-11-25 16:30:13 UTC

i tried, but there is no way of validation for ucs4 string.

unavoidably, just check with validatable charset like legacy charset
or utf8, then trying convert with ucs4 later.


PS:
id3_ucs4_latin1duplicate() is just duplicated id3 tag frame(there is
no decoding behavior) like this. so you can think that works like
id3_ucs_rawduplicate() 

void id3_latin1_copy(id3_latin1_t *dest, id3_latin1_t const *src)
{
  while ((*dest++ = *src++))
    ;
}

Comment 26 Young-Ho Cha 2003-11-25 16:38:12 UTC

sorry, my misunderstanding.

id3_length_t id3_latin1_encodechar(id3_latin1_t *latin1, id3_ucs4_t ucs4)
{
  *latin1 = ucs4;
  if (ucs4 > 0x000000ffL)
    *latin1 = ID3_UCS4_REPLACEMENTCHAR;
                                                                     
          
  return 1;
}

Comment 27 Colin Walters 2003-11-27 02:42:04 UTC

So basically libid3 gives us no way to access the raw tag data?

Comment 28 Young-Ho Cha 2003-11-28 15:39:08 UTC

no. we can access by id3_ucs4_latin1duplicate() for legacy charset.

id3_ucs4_utf8duplicate can return valid string only when id3tag is
encoded ucs4, but id3tag can be any(legacy) charset encoded by many
tagging programs unfortunately.

and id3_ucs4_latin1duplicate can't return valid string only when
id3tag is encoded ucs4. 

so i suggested decoding order in my patch.

PS. utf8validation check twice before rb_unicodify and in rb_unicodify.

Comment 29 Colin Walters 2003-12-14 13:46:05 UTC

*** Bug 129298 has been marked as a duplicate of this bug. ***

Comment 30 Colin Walters 2004-01-23 08:33:25 UTC

*** Bug 131386 has been marked as a duplicate of this bug. ***

Comment 31 Christophe Fergeau 2004-11-03 19:10:01 UTC

*** Bug 157270 has been marked as a duplicate of this bug. ***

Comment 32 Christophe Fergeau 2004-11-03 19:17:37 UTC

Quick update on that bug: rhythmbox no longer uses monkey-media, so most of the
patches that were discussed in that bug are no longer relevant. 
Rhythmbox now uses GStreamer for tag reading, GStreamer behaves in the following
way:
* for id3v2 tags, the charset is encoded in the tag, so GStreamer uses that
encoding (I think it can only be ISO-8859-1, UTF-8 or UTF-16). If you have id3v2
tags using a different encoding, then you'd better fix those tags.
* for id3v1 tags, the encoding is unspecified. GStreamer reads them in the
following way: it tries to read them as UTF-8, if it doesn't work, it tries to
do a current locale=>UTF-8 conversion, and if it still doesn't work, it does a
ISO-8859-1=>UTF-8 conversion. So if you have "badly" encoded id3v1 tags, what
you can do is to set your locale to be something like fr_FR.ISO-8859-15, import
your files and save your library, the encoding should be correct then... It
probably doesn't handle all the corner cases mentioned in this bug though, but
is still a good first step.

Can people being hit by these tag encoding issues comment whether it's enough
for them or not ?

Comment 33 Alexander Kirillov 2004-11-04 19:36:29 UTC

No, not enough.
The problem is, even though for id3v2 tags, the charset *should be* encoded in
the tag, most taggers (EasyTag, grip, soundjuicer, not to mention a whole slew
of Windows encoders) do not set the encoding correctly - so the tag encoding bit
says that it is ISO-8859-1 even though it is actually utf8 or worse, koi8-r. It
is not a rhythmbox problem, but it would be good if rhythmbox provided a
workaround, allowing use manually override the encoding, much as evolution
allows you to manually select encoding of an email you have received.  Yes, it
is ugly -- but better than having unreadable tags. 

If you know of a good program taht can fix the tags by setting the encoding bit
correctly, please let me know (this is a FAQ!)

Comment 34 Christophe Fergeau 2004-11-04 23:52:22 UTC

Rhythmbox can't do much about that since this is all handled by gstreamer. I'd
rather have a separate "fix my tags" program than hacking that around in
rhythmbox imo. And obviously, the writer of tagging apps need to be hit with a
clue stick...

Comment 35 Feng Zhou 2004-12-12 02:55:14 UTC

This problem has been such a headache for me that I wrote a command line tool 
(in Java) to convert native encoded tags (v1 or v2) to UTF8 (v2). You can grab 
it at: http://www.cs.berkeley.edu/~zf/id3iconv/

Comment 36 Young-Ho Cha 2005-03-14 13:03:07 UTC

if use gstreamer backend, it solved by bug #149274, 
but if use xine backend, it need more works.

Comment 37 Christophe Fergeau 2005-03-14 13:08:55 UTC

The xine backend has more serious issues than that one ;) I'm closing this bug
then, since the xine backend isn't really supported these days.