GNOME Bugzilla – Bug 476761
Profile zipfiles have duplicated files
Last modified: 2008-02-21 15:45:32 UTC
Igor Morgado reported this. The profile zipfiles have duplicated files in them; this makes them bigger than needed and incorrect. See the attached part of the debug log for dir-monitor;files-source;storage. When the storage module is asked to add a directory, it recurses within the directory, adding everything it finds to the zipfile. However, it is the *file* module that knows what should be added (since dir-monitor already told it the exact files to use). The storage module should not recurse when adding directories.
Created attachment 95559 [details] sabayon-duplicated-files.txt
I looked into this. It looks like it is the zip_directory() method in storage.py which seems to be recursing in a blind way. When a non empty directory is added to a profile, a entry for that goes in the directory section of the metadata, and yet another entry goes into the file section of the metadata, the second entry being for the file inside the directory. When zip_foreach() is called, the file inside the directory is added to the zip multiple times - once while zip_directory (called by zip_foreach) recurses inside the parent directory, and another time when zip_foreach() directly adds the file (due to the entry for the file being in the file section of the metadata). I'm attaching a patch which should make zip_directory() behave.
Created attachment 103469 [details] [review] Patch to make zip_directory() behave
Thanks for tracking this down, Sayamindu :) This looks good; it needs a bit of work: - Doing "filename in DIRECTORIES_TO_IGNORE" and the same for directories is not quite enough. I think it's time to pull out should_ignore_*() from dirmonitor.py and put it somewhere else, maybe in util.py. Then, the rest of the code should use those functions. - The patch calls save_zip.namelist() on every file. This is a bit of overkill, and will give us O(n^2) performance in the end... can we call namelist() only once at the beginning of the save() process, and then update our "own" list with the newly-added files? I'm wondering how to have an automated test for the storage module that lets us catch all of these errors in a single spot... that's a good weekend project, hint hint :)
(In reply to comment #4) > Thanks for tracking this down, Sayamindu :) > > This looks good; it needs a bit of work: > > - Doing "filename in DIRECTORIES_TO_IGNORE" and the same for directories is not > quite enough. I think it's time to pull out should_ignore_*() from > dirmonitor.py and put it somewhere else, maybe in util.py. Then, the rest of > the code should use those functions. > > - The patch calls save_zip.namelist() on every file. This is a bit of > overkill, and will give us O(n^2) performance in the end... can we call > namelist() only once at the beginning of the save() process, and then update > our "own" list with the newly-added files? > > I'm wondering how to have an automated test for the storage module that lets us > catch all of these errors in a single spot... that's a good weekend project, > hint hint :) > I'm not sure whether it would be sane to move around should_ignore_* at this stage of the release cycle. However, I think it will be a good idea to fix at least the duplicate file issue - am attaching a patch for that (I have modified it as per your comment, and we have our own list to keep track of files now).
Created attachment 105513 [details] [review] A less resource intensive version of the previous patch.
Created attachment 105555 [details] [review] sabayon-bgo476761-make-should-ignore-generic.diff Sayamindu, does this help implement what you want? The patch pulls out should_ignore_* from dirmonitor.py, and puts them in util.py for anything to use.
(In reply to comment #7) > Created an attachment (id=105555) [edit] > sabayon-bgo476761-make-should-ignore-generic.diff > > Sayamindu, does this help implement what you want? The patch pulls out > should_ignore_* from dirmonitor.py, and puts them in util.py for anything to > use. > Thanks Federico for this :-). Only that util.py should import fnmatch and the should_ignore_dir() method should call "return should_ignore_dir (base_dir, ignore_dir_list, parent)" instead of "return should_ignore_dir (parent)" during recursion.
Testing before committing is for sissies :) Thanks for catching that; it's in svn now.
Created attachment 105702 [details] [review] Latest version of the patch, making use of util.should_ignore_* This patch makes use of util.should_ignore_*. It also has it's own list of files in the zipfile and updates the list as and when files are added. Ok to commit ?
Based on the discussion with Johnny on IRC, I updated the patch to use util.get_home_dir() instead of os.environ["HOME"], and committed the patch. Closing the bug.