Bug 303409 – FIlterCHM and Supporting Class Patch

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 303409 - FIlterCHM and Supporting Class Patch


Summary:	FIlterCHM and Supporting Class Patch


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	0.0.x
Hardware:	Other Linux

Importance:	High enhancement
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Duplicates:	302924 (view as bug list)
Depends on:
Blocks:

Reported:	2005-05-07 20:41 UTC by Miguel Fernando Cabrera
Modified:	2005-06-08 18:53 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch for the CHM file format filter (18.44 KB, patch) 2005-05-07 20:42 UTC, Miguel Fernando Cabrera	none	Details \| Review
FilterCHM patch corrections (19.79 KB, patch) 2005-05-13 17:35 UTC, Miguel Fernando Cabrera	none	Details \| Review
Minimalistic FilterCHM (22.57 KB, patch) 2005-05-28 22:23 UTC, Miguel Fernando Cabrera	none	Details \| Review

Description Miguel Fernando Cabrera 2005-05-07 20:41:34 UTC

FilterCHM and supporting files.
FilterCHM: a filter for Microsoft Compiled Html Files.
It uses libchm.
I did not see changes on the FilterHtml on the no-dbus-branch so I think
there will not be any problems with this source code.

Comment 1 Miguel Fernando Cabrera 2005-05-07 20:42:31 UTC

Created attachment 46140 [details] [review]
Patch for the CHM file format filter

Comment 2 Miguel Fernando Cabrera 2005-05-13 17:35:00 UTC

Created attachment 46400 [details] [review]
FilterCHM patch corrections

Add a entry in TileDoc.cs, so we can see the chm title	if any.
As the filter uses a external library (libchm) wrapped by	CHMFile.cs. We
need a way to check that the user has it.
I really don't know how to do that. Should it be an option? (--enable-chm)  or 
should we check if the libary is in the system? or both? I've added some kind
of skeleton check at configure.in. It defines HAS_LIBCHM. If it is defined the
necesary files will be inlcluded in the compilation.

About the filter.

The first idea was to create some kind of FilterMarkup abstract class, 
and put all *commons* functions there (as vvaradhan and dsd
said). BUT FilterCHM uses almost all the functions in FilterHtml.
So FilterCHM is NOT a FilterHtml, FilterCHM 'uses' the Html
filter, but AFAIK,  we can't code it that way  *right now*, so I had to
change the acces level of some methods at FilterHtml to protected and 
modify FilterHtml's constructor to prevent filter collision.

Comment 3 Daniel Drake 2005-05-18 23:24:45 UTC

*** Bug 302924 has been marked as a duplicate of this bug. ***

Comment 4 Jon Trowbridge 2005-05-19 03:30:44 UTC

I tested the filter on the chm version of the mysql reference manual
(http://dev.mysql.com/get/Downloads/Manual/manual.chm/from/pick), a 3.3M file. 
It pegged my CPU for 5 minutes before I finally killed the process?  Are you
able to filter the mysql manual on your box, or is something somewhere extremely
inefficient?

In case it matters, I'm using chmlib 0.35.

Comment 5 Miguel Fernando Cabrera 2005-05-19 21:19:11 UTC

First I thought it was the CHMFile.cs code making the process quite inneficient,
but after some testing, it's look like FilterHtml parsing methods (for some
reason) are making the process slow (too slow).
It's not only a problem of that particular .chm file, but this one took quite
long time to be parsed, It took me 8 minutes or so of parsing that file. 
I will work on that tonight, see what is going on and try to make it faster.

Comment 6 Miguel Fernando Cabrera 2005-05-20 16:01:31 UTC

Some result from last night.

Well, first of all, a chm file is a collection of html files (often compressed).
for example, ( the MySql)  manual.chm has aprox. 7.1Mb of  html contents for 
being parsed.

I didn't know why i it spent so much time parsing, but my tests shows me that  
is the HtmlAgilityPack or the FilterHtml methods used to parse those files.
I did a litle lame profiling using the logger and what spent all the time was 
the html parsing.
The File extraction from the CHM files is quite fast.

I think that what we should do is being less cpu agressive when parsing html. 
How to do that? I don't know ...

All the test I did  depens on every chm file, beacuse they may have different 
html structure.
Here is what i got:

05-05-19 21.55.16.18 09874 IndexH DEBUG: FilterCHM: Parsing:manual.chm
05-05-19 22.06.57.49 09874 IndexH DEBUG: FilterCHM: Finished 

Aprox 12  Minutes Parsing manual.chm  3.3 MB (manual.chm has 7.1 mb of html 
files in 236 html files)


05-05-19 22.17.26.73 08170 IndexH DEBUG: FilterCHM: Parsing:olib.chm
05-05-19 22.20.20.21 08170 IndexH DEBUG: FilterCHM: Finished
Almost 3 minutes on 1.2 MB CHM. (olib.chm  has 1.4 MB of html in 128 html files)

05-05-19 23.27.43.94 08708 IndexH DEBUG: FilterCHM: Parsing:afact.chm
05-05-19 23.30.10.99 08708 IndexH DEBUG: FilterCHM: Finished
Almost 4 minutes on 40 MB CHM. (afact.chm has 2.4 MB of html files in 146 html 
files)

But that does not prove anything, so i dumped all manual.chm's html  files 
contents into  a big html file and run beagled.   It Happened the same thing 
(even worst), FilterHtml beahaves  pretty bad when parsing  big files. Have 
someone tested the HtmlFilter before?    I'd really like to hear some 
suggestions to making the filter a not CPU  killer; sleep a short period of 
time between html files may be an option  but I think is pretty lame and should 
not work.    

NOTE:  For some reason Beagled parsed twice some files, making the process even 
more painful  This is no exclusive of chm files.

I think the only way to make this filter faster is making to improve FilterHtml 
or HtmlAgilityPack.

Comment 7 Jon Trowbridge 2005-05-20 17:31:02 UTC

I think the key thing in the last comment is the statement "manual.chm has 7.1
mb of html files in 236 html files"

Right now, beagle's filtering infrastructure isn't well-suited for dealing with
files containing lots of subfiles.  Fixing this is one of our goals for the
current development cycle... our target is to be able to support zip and tar
files well, and the improvements we need to make should also help here.

The problem is that all of the scheduling happens on the daemon-side, and there
is no way for the helper process (where the indexing happens) to give hints back
to the scheduler about needing to yield but do more work in the future.

It would be nice to make the HtmlAgilityPack faster, but that is probably a
harder problem.  That code is pretty complicated --- parsing the broken html
that is out in the wild is tricky.  But I'm not sure if it has ever been
carefully profiled, so there might be some easy speed-ups in there.

Comment 8 Miguel Fernando Cabrera 2005-05-20 21:14:27 UTC

Meanwhile I Can send a minimalistic version of this filter, just getting the 
Document Title, and only parsing the principal page (kind of index.html) and 
the toc tree (i have no done this yet).  This can be acomplished quite fast. 
And when  the  filtering infrastructure be well-suited for this kind of job I 
will resend it. I will work on this tonight.

Comment 9 Jon Trowbridge 2005-05-20 21:27:34 UTC

That sounds like a good first step.  Thanks.

Comment 10 Miguel Fernando Cabrera 2005-05-28 22:23:58 UTC

Created attachment 46985 [details] [review]
Minimalistic FilterCHM

This is a filter that only read the title,the tocfile and the default file
(kind of index.html of chm files).

Comment 11 Jon Trowbridge 2005-06-08 18:53:31 UTC

The minimalistic CHM filter is now in CVS.  Thanks!