After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 735515 - Tracker failed to extract mp3 information
Tracker failed to extract mp3 information
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Extractor
unspecified
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
tracker-extractor
Depends on:
Blocks:
 
 
Reported: 2014-08-27 12:59 UTC by Cédric Bellegarde
Modified: 2015-07-05 10:39 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
extract-mp3: Bail out on encoding detection if confidence is too low (5.47 KB, patch)
2015-07-05 10:39 UTC, Carlos Garnacho
none Details | Review

Description Cédric Bellegarde 2014-08-27 12:59:47 UTC
See file in attachement, tracker fail to extract informations.

gnumdk@arch:~$ tracker-info ~/Musique/Collection/Punk\ Francais/Bolchoi/Bolchoi/04\ -\ Fier.mp3|grep title
  'http://purl.org/dc/elements/1.1/title' = '䙩敲'
  'nie:title' = '䙩敲'
Comment 1 Cédric Bellegarde 2014-08-27 13:01:44 UTC
Here mp3 file:
https://drive.google.com/file/d/0B3oeICmTBtl7RmVaaUFZeE42Tkk/edit?usp=sharing
Comment 2 Carlos Garnacho 2015-07-05 10:30:58 UTC
Thanks for the bug report, and providing the file :). This is not a failure to extract tags, but a mishap in the encoding detection in those.

In short, ID3 tags supposedly support 4 encodings: ISO-8859-1, UTF-16, UTF-16BE and UTF-8. However, the first of all has been traditionally abused, storing other 8-bit encodings, which we attempt to read correctly.

In order to do so, we concatenate all strings together and drop it to libicu encoding detection, and then use the given encoding on all "ISO-8859-1" tags. 

This specific file has valid ISO-8859-1 tags though, although the 'ï' in the album title misleads libicu into thinking this is "UTF-16BE" (??), so string conversion goes all kinds of wrong.

I checked however that libicu is able to return the confidence on the result (which we're not using), and it's extremely low on this string, just 10%. If we propagate and check on that value, we might fallback to the default encoding.

I'm pushing a patch doing so, it makes the text on this file to be extracted correctly.
Comment 3 Carlos Garnacho 2015-07-05 10:38:56 UTC
The following fix has been pushed:
ede17cc extract-mp3: Bail out on encoding detection if confidence is too low
Comment 4 Carlos Garnacho 2015-07-05 10:39:00 UTC
Created attachment 306850 [details] [review]
extract-mp3: Bail out on encoding detection if confidence is too low

Libicu encoding detection is able to tell the confidence it got on
the detection, we should be using that in case the confidence is
too low, as that means the returned encoding is probably bogus, and
we have an encoding to fallback on.

This fixes detection on the file reported on bug #735515, where
a couple of 'ï' chars (valid ISO-8859-1) make libicu detect UTF-16BE,
although with an extremely low confidence.