Bug 702183 – Tracker fails to extract properly the title and other metadata of a song

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 702183 - Tracker fails to extract properly the title and other metadata of a song


Summary:	Tracker fails to extract properly the title and other metadata of a song


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-06-13 13:52 UTC by Arnel Borja
Modified:	2014-03-21 10:21 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Conversation in #tracker about the issue (9.72 KB, text/x-log) 2013-06-13 13:52 UTC, Arnel Borja	Details
Output of tracker-extract in jhbuild (6.28 KB, text/plain) 2013-06-13 13:53 UTC, Arnel Borja	Details
Output of tracker-extract in Fedora 19 Beta (6.50 KB, text/plain) 2013-06-13 13:54 UTC, Arnel Borja	Details

Description Arnel Borja 2013-06-13 13:52:12 UTC

Created attachment 246719 [details]
Conversation in #tracker about the issue

I have a Japanese song which tracker fails to extract the title properly.

Expected: よってS.O.S
Result: ¤è¤Ã¤ÆS.O.S

Other info such as the artist name is wrong.

Seems like charset guessing failed; tracker-extract guessed it's IBM866 (which is Cyrillic). Other possible reason is that it has both id3v1 and id3v2 tags.

I will attach the conversation in IRC about it (a week ago) and the output of tracker-extract both in jhbuild and Fedora 19 Beta.

Comment 1 Arnel Borja 2013-06-13 13:53:18 UTC

Created attachment 246720 [details]
Output of tracker-extract in jhbuild

Output running "G_MESSAGES_DEBUG=all /opt/gnome/libexec/tracker-extract -v 3 -f ~/Music/01.RAMMに這いよるXXX\ -\ よってS.O.S.mp3" inside jhbuild shell.

Comment 2 Arnel Borja 2013-06-13 13:54:51 UTC

Created attachment 246721 [details]
Output of tracker-extract in Fedora 19 Beta

Output of "G_MESSAGES_DEBUG=all /usr/libexec/tracker-extract -v 3 -f ~/Music/01.RAMMに這いよるXXX\ -\ よってS.O.S.mp3" when run in terminal.

This is where the wrong guessing shows up.

Comment 3 Arnel Borja 2013-06-13 15:10:05 UTC

The song that fails:
http://hugefiles.net/9l1mwbyshrgm

By the way, Nautilus and Rhythmbox could properly parse the metadata.

Comment 4 Martyn Russell 2013-07-11 18:14:57 UTC

Yea, we have to guess charsets sometimes and it's clear that here we guess 'IBM866' incorrectly.

We use enca for the guessing and it's not really the best for that.

I recently noticed libicu has an API we should be trying here:

  ucsdet_detect();

The documentation is here:

  http://userguide.icu-project.org/conversion/detection

However, if we use libunistring, we would have to fallback to something like enca :/

Comment 5 Carlos Garnacho 2013-10-02 17:07:05 UTC

I've added in master ICU based encoding detection which seems more reliable than the enca library, but checking again with the file in this bug it turned out that strings still came out wrong, big5 was being picked instead of any japanese encoding.

Some more fiddling with the file metadata, and I came to realize that strings in ID3 tags are all in inconsistent/broken charsets, so it was impossible that GStreamer could get correct UTF8 out of that. So those strings had to be stored somewhere else, and checking the file with a hex editor confirmed it, the tags in nice UTF8 are stored in APE tag format, which the mp3 extractor doesn't know about.

I tried using taglib to implement a tracker extractor as it's supposed to be fast and implements support for those tags, but it turned out to be impractical as taglib's C API is quite limited, and its API calls trigger inotify monitor events that get tracker-miner-fs and tracker-extract into a loop.

As the point in having a standalone MP3 extractor is speed wrt the gstreamer one (4x faster from quick testing), I guess next likable goal is to add basic reading support for APE tags in there.

Comment 6 Martyn Russell 2014-03-21 10:21:10 UTC

Marking this as fixed. Arnel let us know if you think more is needed and reopen this bug.