GNOME Bugzilla – Bug 303409
FIlterCHM and Supporting Class Patch
Last modified: 2005-06-08 18:53:31 UTC
FilterCHM and supporting files. FilterCHM: a filter for Microsoft Compiled Html Files. It uses libchm. I did not see changes on the FilterHtml on the no-dbus-branch so I think there will not be any problems with this source code.
Created attachment 46140 [details] [review] Patch for the CHM file format filter
Created attachment 46400 [details] [review] FilterCHM patch corrections Add a entry in TileDoc.cs, so we can see the chm title if any. As the filter uses a external library (libchm) wrapped by CHMFile.cs. We need a way to check that the user has it. I really don't know how to do that. Should it be an option? (--enable-chm) or should we check if the libary is in the system? or both? I've added some kind of skeleton check at configure.in. It defines HAS_LIBCHM. If it is defined the necesary files will be inlcluded in the compilation. About the filter. The first idea was to create some kind of FilterMarkup abstract class, and put all *commons* functions there (as vvaradhan and dsd said). BUT FilterCHM uses almost all the functions in FilterHtml. So FilterCHM is NOT a FilterHtml, FilterCHM 'uses' the Html filter, but AFAIK, we can't code it that way *right now*, so I had to change the acces level of some methods at FilterHtml to protected and modify FilterHtml's constructor to prevent filter collision.
*** Bug 302924 has been marked as a duplicate of this bug. ***
I tested the filter on the chm version of the mysql reference manual (http://dev.mysql.com/get/Downloads/Manual/manual.chm/from/pick), a 3.3M file. It pegged my CPU for 5 minutes before I finally killed the process? Are you able to filter the mysql manual on your box, or is something somewhere extremely inefficient? In case it matters, I'm using chmlib 0.35.
First I thought it was the CHMFile.cs code making the process quite inneficient, but after some testing, it's look like FilterHtml parsing methods (for some reason) are making the process slow (too slow). It's not only a problem of that particular .chm file, but this one took quite long time to be parsed, It took me 8 minutes or so of parsing that file. I will work on that tonight, see what is going on and try to make it faster.
Some result from last night. Well, first of all, a chm file is a collection of html files (often compressed). for example, ( the MySql) manual.chm has aprox. 7.1Mb of html contents for being parsed. I didn't know why i it spent so much time parsing, but my tests shows me that is the HtmlAgilityPack or the FilterHtml methods used to parse those files. I did a litle lame profiling using the logger and what spent all the time was the html parsing. The File extraction from the CHM files is quite fast. I think that what we should do is being less cpu agressive when parsing html. How to do that? I don't know ... All the test I did depens on every chm file, beacuse they may have different html structure. Here is what i got: 05-05-19 21.55.16.18 09874 IndexH DEBUG: FilterCHM: Parsing:manual.chm 05-05-19 22.06.57.49 09874 IndexH DEBUG: FilterCHM: Finished Aprox 12 Minutes Parsing manual.chm 3.3 MB (manual.chm has 7.1 mb of html files in 236 html files) 05-05-19 22.17.26.73 08170 IndexH DEBUG: FilterCHM: Parsing:olib.chm 05-05-19 22.20.20.21 08170 IndexH DEBUG: FilterCHM: Finished Almost 3 minutes on 1.2 MB CHM. (olib.chm has 1.4 MB of html in 128 html files) 05-05-19 23.27.43.94 08708 IndexH DEBUG: FilterCHM: Parsing:afact.chm 05-05-19 23.30.10.99 08708 IndexH DEBUG: FilterCHM: Finished Almost 4 minutes on 40 MB CHM. (afact.chm has 2.4 MB of html files in 146 html files) But that does not prove anything, so i dumped all manual.chm's html files contents into a big html file and run beagled. It Happened the same thing (even worst), FilterHtml beahaves pretty bad when parsing big files. Have someone tested the HtmlFilter before? I'd really like to hear some suggestions to making the filter a not CPU killer; sleep a short period of time between html files may be an option but I think is pretty lame and should not work. NOTE: For some reason Beagled parsed twice some files, making the process even more painful This is no exclusive of chm files. I think the only way to make this filter faster is making to improve FilterHtml or HtmlAgilityPack.
I think the key thing in the last comment is the statement "manual.chm has 7.1 mb of html files in 236 html files" Right now, beagle's filtering infrastructure isn't well-suited for dealing with files containing lots of subfiles. Fixing this is one of our goals for the current development cycle... our target is to be able to support zip and tar files well, and the improvements we need to make should also help here. The problem is that all of the scheduling happens on the daemon-side, and there is no way for the helper process (where the indexing happens) to give hints back to the scheduler about needing to yield but do more work in the future. It would be nice to make the HtmlAgilityPack faster, but that is probably a harder problem. That code is pretty complicated --- parsing the broken html that is out in the wild is tricky. But I'm not sure if it has ever been carefully profiled, so there might be some easy speed-ups in there.
Meanwhile I Can send a minimalistic version of this filter, just getting the Document Title, and only parsing the principal page (kind of index.html) and the toc tree (i have no done this yet). This can be acomplished quite fast. And when the filtering infrastructure be well-suited for this kind of job I will resend it. I will work on this tonight.
That sounds like a good first step. Thanks.
Created attachment 46985 [details] [review] Minimalistic FilterCHM This is a filter that only read the title,the tocfile and the default file (kind of index.html of chm files).
The minimalistic CHM filter is now in CVS. Thanks!