GNOME Bugzilla – Bug 354161
Not All Files Found: Too Large Music Directory to Index?
Last modified: 2006-09-10 01:36:51 UTC
Please describe the problem: I have a large flat directory (3220 .mp3 files). Seems it isn't all indexed. Is this a known limit? (Possibly fixed upstream of my distro, Ubuntu 6.06?) For example, I ask beagle for "Beatles" (using beagle-query or GUI) and I get 13 songs (plus two beatles mentions in PDFs and one item that appears to be contents from a web page in the Firefox cache, 16 hits total). I run "locate Beatles" and I get 71 songs (only songs). Interestingly, the first 13 hits from "locate" are the same 13 songs beagle-query finds (though in a different order). Spot checking the extra songs found by "locate" confirms they do currently exist on my disk. This happens with other music in that directory. I don't think there has been any change in the contents of that large directory since before I installed beagle. The directory was created by gtkpod. An example file name that beagle does not find: /home/kentborg/gtkpodmusic/The Beatles - Wild Honey Pie.mp3 An example that it does find: /home/kentborg/gtkpodmusic/The Beatles - Words Of Love.mp3 I don't know if things other than music are not being indexed. beagle-status says: Scheduler: Count: 20888 Status: Waiting on empty queue Pending Tasks: Scheduler queue is empty. Steps to reproduce: Every song query seems to reproduce the problem, haven't tried a reindex from scratch. (What are the correct steps for that? Any valuable experiments I should try first?) Actual results: Expected results: Does this happen every time? Other information: "beagled -fg" reports version 0.2.6, this is on a notebook, running Ubuntu 6.06. The initial indexing ran overnight and got things hot, upset the battery (blinking the charge LED, which I had not seen before), but the OS did not crash.
Just thinking aloud here, could we use the static indexer to try and diagnose this elusive problem?
This problem has been reported before in #341841. I tried to reproduce this with lots of mp3 files in 1 directory but wasnt able to reproduce the bug. It looks like there is something other than just lots_of_files playing nasty here. Maybe something in the names of the files or xdgmime or something even worse. Some information which might might be helpful: * Is it reproducible everytime ? - then probably not an xdgmime issue. * If that directory is added as a indexing root, does that problem happen ? if no, then probably some directory traversal bug. * Maybe keep removing half of the files from the directory to see if any particular file is causing this.
(In reply to comment #1) > Just thinking aloud here, could we use the static indexer to try and diagnose > this elusive problem? Static indexer might help, but it's a different code path. It's probably just as easy to set up a sandbox and test it using the daemon, like so: BEAGLE_HOME=/tmp/sandbox BEAGLE_EXERCISE_THE_DOG=1 beagled --debug --fg --allow-backend files
I am not well experienced with beagle, so please be patient with me... Not wanting to destroy a captive instance of a possibly elusive bug, I did the following: - create a new user - "cp -a" my music directory to the new user's home - fire up "beagled --fg" - wait - "beagle-query Beatles" Result: Same results. Identical files in identical order. Only 13 hits. Conclusion: I can reproduce this bug. Another clue: The 13 files returned are the same as the first 13 files returned by locate. I think there is a file that beagled somehow barfed on, someplace between the 13th and 14th Beatles song. Note, these files do not have traditional computer names. They came from loading CDs into Itunes on a Mac, syncing with an Ipod, and using gtkpod to suck those songs into my Ubuntu notebook (BTW, gtkpod is getting confused on these files too). The files names include: spaces, apostrophies, ampersands, dashes, parentheses, exclamation points, underscores, multiple periods, upper and lowercase letters, digits, long names (e.g., 230 characters long for one), diacriticals (e-acute, e-grave, o-diaeresis, e-diaeresis, E-dot, i-diaeresis, o-acute, n-tilda, i-acute), sexed single close quotes, cross-hash pound signs...and others I am sure I missed in my survey (scrolling though the directory in emacs). How well tested is beagle against inband data in names being interpreted as delimiters? I am guessing some name is a problem. Any suggestions for what experiments I might try next? (Please be explicit, I am not a beagle expert.) Thanks, -kb
Kent: Filenames should not be a problem, with perhaps the exception of the diacritics if the filename isn't UTF-8. But it probably is if you're seeing the names correctly and not garbage. You might want to try it again in your second home directory; nuke the ~/.beagle directory and rerun it; make sure you pass in --debug to the command-line. It would be helpful if you could tar up your ~/.beagle/Log directory and attach it to the bug; that would tell us if files are being detected incorrectly, if there was an error that caused indexing to stop, etc.
(In reply to comment #4) > Note, these files do not have traditional computer names. They came from > loading CDs into Itunes on a Mac, syncing with an Ipod, and using gtkpod to > suck those songs into my Ubuntu notebook (BTW, gtkpod is getting confused on > these files too). The files names include: spaces, apostrophies, ampersands, > dashes, parentheses, exclamation points, underscores, multiple periods, upper > and lowercase letters, digits, long names (e.g., 230 characters long for one), > diacriticals (e-acute, e-grave, o-diaeresis, e-diaeresis, E-dot, i-diaeresis, > o-acute, n-tilda, i-acute), sexed single close quotes, cross-hash pound > signs...and others I am sure I missed in my survey (scrolling though the > directory in emacs). > > How well tested is beagle against inband data in names being interpreted as > delimiters? I am guessing some name is a problem. Now and then name problems do show up but nothing so severe. Beagle generally handles weird naming good enough, as Joe said. One way to test against name problem would be to copy one simple small good mp3 file for each of the 3220 files, having names as those in that directory. Output "ls -1" to a file and then run a script reading a name from that file and copying a fixed mp3 as that file. Whatever you do, the log files would be very helpful. Dont let the instance run away.
(In reply to comment #5) > You might want to try it again in your second home directory; nuke the > ~/.beagle directory and rerun it; make sure you pass in --debug to the > command-line. OK. The new indexing still returns the incomplete 13 Beatles songs. I attach the logs. You will note the user has a Google Earth installation from a few months ago, but I don't think it interfered, there is no other Google Earth installation on this machine.
Created attachment 72279 [details] Log files from debug run
Weird! There are basically no errors here, it seems as thought it's just not seeing some of the files. I am going to whip up a test program for you to try to see if the bug is in our DirectoryWalker code.
Created attachment 72334 [details] Test program Compile the program like so: mcs -debug test-directorywalker.cs -r:Util.dll (You might have to specify a full path for it; it comes from beagle, so it'd be something like -r:/usr/lib/beagle/Util.dll) Then to run it you'll need something like: LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono test-directorywalker.exe Run it in your directory with the thousands of mp3 files. It should output them one by one and at the end say how many files there were. Compare this with the output of: find . -maxdepth 0 -type f | wc -l They should be roughly equal. If not, then we have a problem.
(In reply to comment #10) > Compile the program like so: > > mcs -debug test-directorywalker.cs -r:Util.dll I don't see an "mcs" on my machine. So I cast about in Synaptic and install mono-mcs package. Now run "$ mcs -debug test-directorywalker.cs -r:/usr/lib/beagle/Util.dll" and that works. So I think you want to see this (or, I guess maybe you don't want to see this): google-earth-user@bottom:~$ cd gtkpodmusic/ google-earth-user@bottom:~/gtkpodmusic$ find . -maxdepth 1 -type f | wc -l 3220 google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe | wc -l 320 google-earth-user@bottom:~/gtkpodmusic$ Let me know if I did it wrong. -kb
Could you just run this line like google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe | wc -l 320 google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe that? so we can see the output of the test program. Although on a separate note, that would mean our code is catching about 1/10th of the files in that directory....
So I try this: google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe > /tmp/318_music_list.txt And I will attach the output... -kb
Created attachment 72337 [details] Output from test-directorywalker.exe
You did it absolutely right, and there must be a bug in our directorywalker code. Can you also attach the output of the find command, sans the "| wc -l" ?
Created attachment 72340 [details] complete list of music files google-earth-user@bottom:~/gtkpodmusic$ find . -maxdepth 1 -type f > /tmp/3220_music_list.txt
Created attachment 72344 [details] Updated test program, this time precompiled I've attached a precompiled test program, so you don't need to bother with the mcs step. Can you please run it and attach the output? Also the results of piping it through "grep ^got | wc -l" would be helpful.
google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe Unhandled Exception: System.IO.FileNotFoundException: No such file or directory ----> Mono.Unix.UnixIOException: No such file or directory in <0x00013> Mono.Unix.UnixMarshal:ThrowExceptionForLastError () in <0x00063> Beagle.Util.DirectoryWalker2:readdir (IntPtr dir, System.Text.StringBuilder buffer) in <0x0002f> Beagle.Util.DirectoryWalker2+FileEnumerator:MoveNext () in <0x000ec> X:Main () Did I run it the wrong way? (Did you build it right for me to use?) -kb
It's supposed to do that, but it should output more data too. Lemme update the test program just in case.
Created attachment 72348 [details] Try this one, precompiled
Created attachment 72350 [details] Output from latest test-directorywalker.exe I don't know what the output means, but it is very interesting--I bet you have it cornered once you see this. -kb, the Kent who can't spell "output". google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe > /tmp/test-directory-outout.txt Unhandled Exception: System.IO.FileNotFoundException: No such file or directory ----> Mono.Unix.UnixIOException: No such file or directory in <0x00013> Mono.Unix.UnixMarshal:ThrowExceptionForLastError () in <0x0007b> Beagle.Util.DirectoryWalker2:readdir (IntPtr dir, System.Text.StringBuilder buffer) in <0x0002f> Beagle.Util.DirectoryWalker2+FileEnumerator:MoveNext () in <0x000ec> X:Main ()
Yeah, that is very, very interesting. Thanks for the info, I'll dig into it.
Created attachment 72352 [details] Hopefully a test program which fixes it Can you try this one? Also precompiled.
Created attachment 72353 [details] Output from latest test-directorywalker.exe Sounds like you have the fix. 'fess up, what was wrong? -kb google-earth-user@bottom:~/gtkpodmusic$ LD_LIBRARY_PATH=/usr/lib/beagle:$LD_LIBRARY_PATH MONO_PATH=/usr/lib/beagle mono ../test-directorywalker.exe > /tmp/test-directory-output2.txt
Apparently mono was transparently resizing the internal buffer we were using to get the file names to something pretty small. I'm not entirely sure why, but adding a call to StringBuilder.EnsureCapacity() before we used it fixed it. The reason why I never saw it is because I only ever tried it with short filenames: up to 6 characters in length. You probably would only ever see this if you had some extremely long filenames. I just checked in the fix to CVS. Thanks a ton for your help tracking this down!
*** Bug 341841 has been marked as a duplicate of this bug. ***
> I just checked in the fix to CVS. Thanks a ton for your help tracking this > down! Thanks for the quick fix. -kb, the Kent who hopes this might be the kind of bug that would arrive in a bug fix in Ubuntu.
Joe: Maybe this should make us do a micro-release (like 0.2.9.1) just for packagers since this is a pretty big showstopper for a lot of people?
I'd prefer to push through with another few bug fixes and do a 0.2.10 in a week or two. It's only affected 2 or 3 people (at least, who have reported it) from what I can tell.
(btw, the reason being that a 0.2.9.1 release isn't really any less work than a 0.2.10 release.)
Not an issue, sounds good, just wanted to put it out there that we should get it out pretty soon. Any specific things on that bug list that need help?
Would you be so kind to put the link to the patch here?
http://cvs.gnome.org/bonsai/cvsquery.cgi?branch=&dir=beagle&who=joeshaw&date=explicit&mindate=2006-09-06%2018:05&maxdate=2006-09-06%2018:07