Bug 735515 – Tracker failed to extract mp3 information

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 735515 - Tracker failed to extract mp3 information


Summary:	Tracker failed to extract mp3 information


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	tracker-extractor

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-08-27 12:59 UTC by Cédric Bellegarde
Modified:	2015-07-05 10:39 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
extract-mp3: Bail out on encoding detection if confidence is too low (5.47 KB, patch) 2015-07-05 10:39 UTC, Carlos Garnacho	none	Details \| Review

Description Cédric Bellegarde 2014-08-27 12:59:47 UTC

See file in attachement, tracker fail to extract informations.

gnumdk@arch:~$ tracker-info ~/Musique/Collection/Punk\ Francais/Bolchoi/Bolchoi/04\ -\ Fier.mp3|grep title
  'http://purl.org/dc/elements/1.1/title' = '䙩敲'
  'nie:title' = '䙩敲'

Comment 1 Cédric Bellegarde 2014-08-27 13:01:44 UTC

Here mp3 file:
https://drive.google.com/file/d/0B3oeICmTBtl7RmVaaUFZeE42Tkk/edit?usp=sharing

Comment 2 Carlos Garnacho 2015-07-05 10:30:58 UTC

Thanks for the bug report, and providing the file :). This is not a failure to extract tags, but a mishap in the encoding detection in those.

In short, ID3 tags supposedly support 4 encodings: ISO-8859-1, UTF-16, UTF-16BE and UTF-8. However, the first of all has been traditionally abused, storing other 8-bit encodings, which we attempt to read correctly.

In order to do so, we concatenate all strings together and drop it to libicu encoding detection, and then use the given encoding on all "ISO-8859-1" tags. 

This specific file has valid ISO-8859-1 tags though, although the 'ï' in the album title misleads libicu into thinking this is "UTF-16BE" (??), so string conversion goes all kinds of wrong.

I checked however that libicu is able to return the confidence on the result (which we're not using), and it's extremely low on this string, just 10%. If we propagate and check on that value, we might fallback to the default encoding.

I'm pushing a patch doing so, it makes the text on this file to be extracted correctly.

Comment 3 Carlos Garnacho 2015-07-05 10:38:56 UTC

The following fix has been pushed:
ede17cc extract-mp3: Bail out on encoding detection if confidence is too low

Comment 4 Carlos Garnacho 2015-07-05 10:39:00 UTC

Created attachment 306850 [details] [review]
extract-mp3: Bail out on encoding detection if confidence is too low

Libicu encoding detection is able to tell the confidence it got on
the detection, we should be using that in case the confidence is
too low, as that means the returned encoding is probably bogus, and
we have an encoding to fallback on.

This fixes detection on the file reported on bug #735515, where
a couple of 'ï' chars (valid ISO-8859-1) make libicu detect UTF-16BE,
although with an extremely low confidence.