GNOME Bugzilla – Bug 642993
Use culture-dependent code page rather than Latin1
Last modified: 2018-08-17 19:49:13 UTC
Created attachment 181629 [details] [review] Proposed fix taglib-sharp (specifically ByteVector) has an incorrect assumption that old non-UTF8 tags are all in Latin1. They are actually undefined and most likely platform-language-specific encoding. For example, it is Shift_JIS in Japanese Windows. And this Latin1 assumption brings wrong results for my MP3s. To use most-likely-correct encoding, I replaced Encoding.GetEncoding ("latin1") with Encoding.GetEncoding (CultureInfo.CurrentCulture.TextInfo.ANSICodePage) in the attached fix.
Review of attachment 181629 [details] [review]: That's what UseBrokenLatin1Behavior is for, set it to true to have this behaviour. I'm against turning it on by default as your patch effectively does. Tags must specify the correct encoding if they need non-latin1 characters.
Closing as WONTFIX, but feel free to re-open if I misunderstood the bug's intent.
Yes, I believe you're misunderstanding :) In the current ByteVector, use_broken_latin1 is used like this: if (use_broken_latin1) return Encoding.Default; try { return Encoding.GetEncoding ("latin1"); } catch (ArgumentException) { return Encoding.UTF8; } But Encoding.Default is *UTF8* on (modern) Linux. It does not solve the actual problem on Linux. Encoding.Default solution solves the problem only when you are on Windows. The default encoding should be based on the user's culture. My approach to use ANSICodePage saves non-Latin1 people while Latin1 people are still safe. The assumption that "Latin1 is the worldwide default" is wrong, as I explained above. Mono mcs had made exactly the same "the default C# source encoding must be in Latin1 by default" mistake (it is also Shift_JIS in Japanese on .NET). (I don't know if/how I can reopen this bug; there is no selectable control shown for me.)
Shouldn't your patch also change the if (use_broken_latin1) return Encoding.Default; to return ANSICodePage?
I have to say, I don't understand how that flag is used for ;) But then the flag almost does not look making sense anymore...
I suggest then to replace: if (use_broken_latin1) return Encoding.Default; with: if (use_broken_latin1) return Encoding.GetEncoding (CultureInfo.CurrentCulture.TextInfo.ANSICodePage); I'm very much against enabling this behaviour by default as suggested in your patch. Not specifying the encoding used for the text is bad because it encourages users to create non-portable files. Let me know if the above fix will work for your scenario. (In reply to comment #5) > I have to say, I don't understand how that flag is used for ;) Did you read the UseBrokenLatin1Behavior's XML comment?
I have some questions for you: 1) In which scenario is your solution better than mine? To me, it just sounds like that a) Latin1 people can live with "bad behavior" that lacks Encoding specification while b) non-Latin1 people cannot. It is just cultural discrimination. Note that there is a lot of countries that use non-Latin1 encodings. 2) With my solution, people can even specify their favorite culture and encodings by running Banshee like "LANG=en_US banshee". With UseBrokenLatin1Behavior, how can people switch to that "only Latin1 people thinks it is broken" mode? Banshee can still internally switch to UseBrokenLatin1Behavior by default, but (question (1) goes here again) ? (So, to answer your question: no, I believe it doesn't work for my scenario.) 3) How indeed UseBrokenLatin1Behavior is indeed used for? (I indeed have already read the XML comment and I asked *how* it is used in reality.) I know that Banshee does not enable it by default, and hence Banshee is very much useless for Japanese with lots of existing mp3 files from years ago. If they are left unreadable, I'd just leave Banshee and stick to other music players such as iTunes.
Or simply: 4) Is there any reason why specifying "Latin1" is not a bad behavior?
Ah, OK, I think I get it. From the ID3v2 docs: If nothing else is said a string is represented as ISO-8859-1 [http://www.id3.org/id3v2.3.0#head-1a37d4a15deafc294208ccfde950f77e47000bca] From the UseBrokenLatin1Behavior docs: Many media players and taggers incorrectly treat Latin-1 fields as "default encoding" fields. This is exactly Atsushi's situation. The bug in taglib-sharp is that on Linux in Japanese locale, UseBrokenLatin1Behavior should use Shift_JIS not Encoding.Default. I think changing that is the real fix here. We shouldn't change the "latin1" try/catch bit, because as mentioned in the docs, ID3v2 says that if the encoding is unspecified, the fields are latin1. In Atsushi's case that's not true, so UseBrokenLatin1Behavior needs to be set true.
Atsushi, with regards to your (2) point, UseBrokenLatin1Behavior would need to either be turned on by default in Banshee, or exposed as an option in Preferences.
Or taglib-sharp modified to "to do the right thing" automagically somehow. :)
Oh, wow, okay, ID3v2 specification itself is the source of illness of cultural discrimination. No fair-minded association can explain there is rational reason to specify Latin1 as the default but they indeed chose such inappropriate path. Many of Japanese who dig this problem in depth expressed anger against the spec. Taking the fact into consideration, I'm okay with gabaug's choice (of either) to make required changes in Banshee. I believe that every taglib-sharp-based apps should solve this issue too though. Finally I'd insist this: no one becomes unhappy if this "support non-Latin1 by default" behavior is default, while most of non-Latin1 people becomes unhappy on the other choice. Banshee won't spread "incorrectly-encoded" song files by this choice so I don't believe Banshee should feed sinful. The implementation can be said sloppy, but it is only against discriminating specification, which I believe should get broken. Wrong rule is just wrong and should be fixed. Note that ID3 specification is informal.
One more thing: note that the "default is Latin1" part is added only after ID3v2 and ID3v1 didn't specify that. And there are song files that are encoded by tools that are released before ID3v2.
*** Bug 645486 has been marked as a duplicate of this bug. ***
I think using Encoding.Default when it is "broken" is NOT a good idea, for example I am using a windows with en-gb as the default encoding, but actually I am parsing files that are with gb2312 encoding, so either latin1 or en-gb default) will be WRONG. I still want to stick to have the ability for the user to specify the encoding. What I have done is simply add one more static property called BrokenEncoding, when use broken is true and when BrokenEncoding is NOT null, then taglib will use it, but when use broken is true and BrokenEncoding is null, then taglib will stick to Encoding.Default.
*** Bug 649435 has been marked as a duplicate of this bug. ***
Also the specified character set should be used if file has id3v2 tags which claim to be in "UTF16-LE" encoding, as in 95% of cases it's just zero-expanded local codepage characters.
taglib-sharp has moved to Github a while ago. Furthermore, GNOME Bugzilla will be shut down and replaced by gitlab.gnome.org. If the problem reported in this Bugzilla ticket is still valid, please report it to https://github.com/mono/taglib-sharp/issues instead. Thank you! Closing this report as WONTFIX as part of Bugzilla Housekeeping.