GNOME Bugzilla – Bug 702183
Tracker fails to extract properly the title and other metadata of a song
Last modified: 2014-03-21 10:21:10 UTC
Created attachment 246719 [details] Conversation in #tracker about the issue I have a Japanese song which tracker fails to extract the title properly. Expected: よってS.O.S Result: ¤è¤Ã¤ÆS.O.S Other info such as the artist name is wrong. Seems like charset guessing failed; tracker-extract guessed it's IBM866 (which is Cyrillic). Other possible reason is that it has both id3v1 and id3v2 tags. I will attach the conversation in IRC about it (a week ago) and the output of tracker-extract both in jhbuild and Fedora 19 Beta.
Created attachment 246720 [details] Output of tracker-extract in jhbuild Output running "G_MESSAGES_DEBUG=all /opt/gnome/libexec/tracker-extract -v 3 -f ~/Music/01.RAMMに這いよるXXX\ -\ よってS.O.S.mp3" inside jhbuild shell.
Created attachment 246721 [details] Output of tracker-extract in Fedora 19 Beta Output of "G_MESSAGES_DEBUG=all /usr/libexec/tracker-extract -v 3 -f ~/Music/01.RAMMに這いよるXXX\ -\ よってS.O.S.mp3" when run in terminal. This is where the wrong guessing shows up.
The song that fails: http://hugefiles.net/9l1mwbyshrgm By the way, Nautilus and Rhythmbox could properly parse the metadata.
Yea, we have to guess charsets sometimes and it's clear that here we guess 'IBM866' incorrectly. We use enca for the guessing and it's not really the best for that. I recently noticed libicu has an API we should be trying here: ucsdet_detect(); The documentation is here: http://userguide.icu-project.org/conversion/detection However, if we use libunistring, we would have to fallback to something like enca :/
I've added in master ICU based encoding detection which seems more reliable than the enca library, but checking again with the file in this bug it turned out that strings still came out wrong, big5 was being picked instead of any japanese encoding. Some more fiddling with the file metadata, and I came to realize that strings in ID3 tags are all in inconsistent/broken charsets, so it was impossible that GStreamer could get correct UTF8 out of that. So those strings had to be stored somewhere else, and checking the file with a hex editor confirmed it, the tags in nice UTF8 are stored in APE tag format, which the mp3 extractor doesn't know about. I tried using taglib to implement a tracker extractor as it's supposed to be fast and implements support for those tags, but it turned out to be impractical as taglib's C API is quite limited, and its API calls trigger inotify monitor events that get tracker-miner-fs and tracker-extract into a loop. As the point in having a standalone MP3 extractor is speed wrt the gstreamer one (4x faster from quick testing), I guess next likable goal is to add basic reading support for APE tags in there.
Marking this as fixed. Arnel let us know if you think more is needed and reopen this bug.