GNOME Bugzilla – Bug 735515
Tracker failed to extract mp3 information
Last modified: 2015-07-05 10:39:00 UTC
See file in attachement, tracker fail to extract informations. gnumdk@arch:~$ tracker-info ~/Musique/Collection/Punk\ Francais/Bolchoi/Bolchoi/04\ -\ Fier.mp3|grep title 'http://purl.org/dc/elements/1.1/title' = '䙩敲' 'nie:title' = '䙩敲'
Here mp3 file: https://drive.google.com/file/d/0B3oeICmTBtl7RmVaaUFZeE42Tkk/edit?usp=sharing
Thanks for the bug report, and providing the file :). This is not a failure to extract tags, but a mishap in the encoding detection in those. In short, ID3 tags supposedly support 4 encodings: ISO-8859-1, UTF-16, UTF-16BE and UTF-8. However, the first of all has been traditionally abused, storing other 8-bit encodings, which we attempt to read correctly. In order to do so, we concatenate all strings together and drop it to libicu encoding detection, and then use the given encoding on all "ISO-8859-1" tags. This specific file has valid ISO-8859-1 tags though, although the 'ï' in the album title misleads libicu into thinking this is "UTF-16BE" (??), so string conversion goes all kinds of wrong. I checked however that libicu is able to return the confidence on the result (which we're not using), and it's extremely low on this string, just 10%. If we propagate and check on that value, we might fallback to the default encoding. I'm pushing a patch doing so, it makes the text on this file to be extracted correctly.
The following fix has been pushed: ede17cc extract-mp3: Bail out on encoding detection if confidence is too low
Created attachment 306850 [details] [review] extract-mp3: Bail out on encoding detection if confidence is too low Libicu encoding detection is able to tell the confidence it got on the detection, we should be using that in case the confidence is too low, as that means the returned encoding is probably bogus, and we have an encoding to fallback on. This fixes detection on the file reported on bug #735515, where a couple of 'ï' chars (valid ISO-8859-1) make libicu detect UTF-16BE, although with an extremely low confidence.