Bug 642993 – Use culture-dependent code page rather than Latin1

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 642993 - Use culture-dependent code page rather than Latin1


Summary:	Use culture-dependent code page rather than Latin1


Status:	RESOLVED WONTFIX

Product:	taglib-sharp
Classification:	Other
Component:	General
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	taglib-sharp-maint
QA Contact:	taglib-sharp-maint

URL:
Whiteboard:

Duplicates:	645486 649435 (view as bug list)
Depends on:
Blocks:

Reported:	2011-02-22 20:00 UTC by Atsushi Eno
Modified:	2018-08-17 19:49 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Proposed fix (631 bytes, patch) 2011-02-22 20:00 UTC, Atsushi Eno	rejected	Details \| Review

Description Atsushi Eno 2011-02-22 20:00:20 UTC

Created attachment 181629 [details] [review]
Proposed fix

taglib-sharp (specifically ByteVector) has an incorrect assumption that old non-UTF8 tags are all in Latin1. They are actually undefined and most likely platform-language-specific encoding. For example, it is Shift_JIS in Japanese Windows. And this Latin1 assumption brings wrong results for my MP3s.

To use most-likely-correct encoding, I replaced Encoding.GetEncoding ("latin1") with Encoding.GetEncoding (CultureInfo.CurrentCulture.TextInfo.ANSICodePage) in the attached fix.

Comment 1 Alexander Kojevnikov 2011-02-24 14:37:42 UTC

Review of attachment 181629 [details] [review]:

That's what UseBrokenLatin1Behavior is for, set it to true to have this behaviour.

I'm against turning it on by default as your patch effectively does. Tags must specify the correct encoding if they need non-latin1 characters.

Comment 2 Alexander Kojevnikov 2011-02-24 14:40:37 UTC

Closing as WONTFIX, but feel free to re-open if I misunderstood the bug's intent.

Comment 3 Atsushi Eno 2011-02-24 15:46:00 UTC

Yes, I believe you're misunderstanding :) In the current ByteVector, use_broken_latin1 is used like this:

if (use_broken_latin1)
        return Encoding.Default;

try {
        return Encoding.GetEncoding ("latin1");
} catch (ArgumentException) {
        return Encoding.UTF8;
}

But Encoding.Default is *UTF8* on (modern) Linux. It does not solve the actual problem on Linux. Encoding.Default solution solves the problem only when you are on Windows.

The default encoding should be based on the user's culture. My approach to use ANSICodePage saves non-Latin1 people while Latin1 people are still safe.

The assumption that "Latin1 is the worldwide default" is wrong, as I explained above. Mono mcs had made exactly the same "the default C# source encoding must be in Latin1 by default" mistake (it is also Shift_JIS in Japanese on .NET).

(I don't know if/how I can reopen this bug; there is no selectable control shown for me.)

Comment 4 Gabriel Burt 2011-02-24 15:58:13 UTC

Shouldn't your patch also change the if (use_broken_latin1) return Encoding.Default; to return ANSICodePage?

Comment 5 Atsushi Eno 2011-02-24 17:11:31 UTC

I have to say, I don't understand how that flag is used for ;) But then the flag almost does not look making sense anymore...

Comment 6 Alexander Kojevnikov 2011-02-26 05:29:50 UTC

I suggest then to replace:

if (use_broken_latin1)
        return Encoding.Default;

with:

if (use_broken_latin1)
        return Encoding.GetEncoding (CultureInfo.CurrentCulture.TextInfo.ANSICodePage);

I'm very much against enabling this behaviour by default as suggested in your patch. Not specifying the encoding used for the text is bad because it encourages users to create non-portable files.

Let me know if the above fix will work for your scenario.

(In reply to comment #5)
> I have to say, I don't understand how that flag is used for ;)

Did you read the UseBrokenLatin1Behavior's XML comment?

Comment 7 Atsushi Eno 2011-02-26 07:35:53 UTC

I have some questions for you:

1) In which scenario is your solution better than mine? To me, it just sounds like that a) Latin1 people can live with "bad behavior" that lacks Encoding specification while b) non-Latin1 people cannot. It is just cultural discrimination. Note that there is a lot of countries that use non-Latin1 encodings.

2) With my solution, people can even specify their favorite culture and encodings by running Banshee like "LANG=en_US banshee". With UseBrokenLatin1Behavior, how can people switch to that "only Latin1 people thinks it is broken" mode? Banshee can still internally switch to UseBrokenLatin1Behavior by default, but (question (1) goes here again) ?

(So, to answer your question: no, I believe it doesn't work for my scenario.)

3) How indeed UseBrokenLatin1Behavior is indeed used for? (I indeed have already read the XML comment and I asked *how* it is used in reality.) I know that Banshee does not enable it by default, and hence Banshee is very much useless for Japanese with lots of existing mp3 files from years ago. If they are left unreadable, I'd just leave Banshee and stick to other music players such as iTunes.

Comment 8 Atsushi Eno 2011-02-26 07:44:22 UTC

Or simply: 4) Is there any reason why specifying "Latin1" is not a bad behavior?

Comment 9 Gabriel Burt 2011-02-26 18:41:06 UTC

Ah, OK, I think I get it.  From the ID3v2 docs:

  If nothing else is said a string is represented as ISO-8859-1
  [http://www.id3.org/id3v2.3.0#head-1a37d4a15deafc294208ccfde950f77e47000bca]

From the UseBrokenLatin1Behavior docs:

  Many media players and taggers incorrectly treat
  Latin-1 fields as "default encoding" fields.

This is exactly Atsushi's situation.  The bug in taglib-sharp is that on Linux in Japanese locale, UseBrokenLatin1Behavior should use Shift_JIS not Encoding.Default.  I think changing that is the real fix here.

We shouldn't change the "latin1" try/catch bit, because as mentioned in the docs, ID3v2 says that if the encoding is unspecified, the fields are latin1.  In Atsushi's case that's not true, so UseBrokenLatin1Behavior needs to be set true.

Comment 10 Gabriel Burt 2011-02-26 18:43:02 UTC

Atsushi, with regards to your (2) point, UseBrokenLatin1Behavior would need to either be turned on by default in Banshee, or exposed as an option in Preferences.

Comment 11 Gabriel Burt 2011-02-26 18:43:43 UTC

Or taglib-sharp modified to "to do the right thing" automagically somehow.  :)

Comment 12 Atsushi Eno 2011-02-26 19:26:55 UTC

Oh, wow, okay, ID3v2 specification itself is the source of illness of cultural discrimination. No fair-minded association can explain there is rational reason to specify Latin1 as the default but they indeed chose such inappropriate path. Many of Japanese who dig this problem in depth expressed anger against the spec.

Taking the fact into consideration, I'm okay with gabaug's choice (of either) to make required changes in Banshee. I believe that every taglib-sharp-based apps should solve this issue too though.

Finally I'd insist this: no one becomes unhappy if this "support non-Latin1 by default" behavior is default, while most of non-Latin1 people becomes unhappy on the other choice. Banshee won't spread "incorrectly-encoded" song files by this choice so I don't believe Banshee should feed sinful. The implementation can be said sloppy, but it is only against discriminating specification, which I believe should get broken. Wrong rule is just wrong and should be fixed. Note that ID3 specification is informal.

Comment 13 Atsushi Eno 2011-02-26 19:29:18 UTC

One more thing: note that the "default is Latin1" part is added only after ID3v2 and ID3v1 didn't specify that. And there are song files that are encoded by tools that are released before ID3v2.

Comment 14 Alexander Kojevnikov 2011-03-22 01:40:14 UTC

*** Bug 645486 has been marked as a duplicate of this bug. ***

Comment 15 unruledboy 2011-03-23 21:37:02 UTC

I think using Encoding.Default when it is "broken" is NOT a good idea, for example I am using a windows with en-gb as the default encoding, but actually I am parsing files that are with gb2312 encoding, so either latin1 or en-gb default) will be WRONG.

I still want to stick to have the ability for the user to specify the encoding.

What I have done is simply add one more static property called BrokenEncoding, when use broken is true and when BrokenEncoding is NOT null, then taglib will use it, but when use broken is true and BrokenEncoding is null, then taglib will stick to Encoding.Default.

Comment 16 Alexander Kojevnikov 2011-05-20 01:30:06 UTC

*** Bug 649435 has been marked as a duplicate of this bug. ***

Comment 17 Urmas 2011-05-20 18:44:25 UTC

Also the specified character set should be used if file has id3v2 tags which claim to be in "UTF16-LE" encoding, as in 95% of cases it's just zero-expanded local codepage characters.

Comment 18 André Klapper 2018-08-17 19:49:13 UTC

taglib-sharp has moved to Github a while ago.
Furthermore, GNOME Bugzilla will be shut down and replaced by gitlab.gnome.org.

If the problem reported in this Bugzilla ticket is still valid, please report
it to https://github.com/mono/taglib-sharp/issues instead. Thank you!

Closing this report as WONTFIX as part of Bugzilla Housekeeping.