Bug 314844 – [PATCH] Test a new Source Filter : FilterTex.cs

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 314844 - [PATCH] Test a new Source Filter : FilterTex.cs


Summary:	[PATCH] Test a new Source Filter : FilterTex.cs


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	0.0.x
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	Community
Assigned To:	Debajyoti Bera
QA Contact:	Veerapuram Varadhan

URL:
Whiteboard:

Duplicates:	350571 (view as bug list)
Depends on:
Blocks:

Reported:	2005-08-30 15:21 UTC by Dav
Modified:	2007-09-28 15:54 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
New TeX filter (4.43 KB, text/x-csharp) 2005-08-30 15:22 UTC, Dav		Details
new filter source (9.15 KB, text/x-csharp) 2005-08-30 15:22 UTC, Dav		Details
Patch for beagle 1.0.1 to index TeX files (6.28 KB, patch) 2005-10-27 08:04 UTC, Norbert Preining	none	Details \| Review
patch beagle 2.0 to support TeX and .sty (5.63 KB, patch) 2006-01-24 12:19 UTC, Norbert Preining	rejected	Details \| Review
FilterTeX based on FilterRTF (15.27 KB, text/plain) 2006-03-03 21:14 UTC, auxsvr		Details
Updated FilterTeX (12.65 KB, text/plain) 2007-02-12 21:08 UTC, auxsvr		Details
Updated FilterTeX (4.53 KB, application/x-tgz) 2007-09-21 12:59 UTC, auxsvr		Details
Proposed for inclusion (13.25 KB, text/plain) 2007-09-25 23:53 UTC, Debajyoti Bera		Details

Description Dav 2005-08-30 15:21:10 UTC

I've created a new FilterSource to index Tex files. Basically I've just
customize the FilterC.cs file adding the Latex keywords and and a new entry in
the FilterSource- "LangType".

I need help to link this new filter to beagle and to test it (I cannot compile
the CVS version).

thanks for your help and feedbacks (see FilterSource.cs and FilterTex.cs in
attachment)

Comment 1 Dav 2005-08-30 15:22:01 UTC

Created attachment 51559 [details]
New TeX filter

Comment 2 Dav 2005-08-30 15:22:31 UTC

Created attachment 51560 [details]
new filter source

Comment 3 Veerapuram Varadhan 2005-08-30 15:37:35 UTC

Thats a good start.

However, I wouldn't classify TeX filter under FilterSource, which is intended
for Programming/Scripting languages.

TeX should fall under the categories of word-processor documents, as it provides
more information than just textual contents.

So, I once had a simple Texi filter that parses TeX files according to the type
of keywords, however, I lost the file when I was playing with my disk partitions.

Here is the simple outline:

1)  TeX has two types of commands: Blocks that end with @end <block-name> and {}
2)  Also, there are some "escape" commands like "@@" "@{" etc.
3)  Other non-block commands like @Title, @Chapter etc. (IIRC).

Each of these types of commands have to be processed separately.

FilterOpenOffice.cs would be a good place to start and understand how the
word-processor filters work.

Thanks again for your patch. :)

Comment 4 Dav 2005-08-31 07:27:30 UTC

Actually, the Tex filter should be placed in the same place as the FilterHTML in
the "Text Documents" section.

In both case, we have a  structured text with markups that produce a formatted
document.

The FilterOpenOffice.cs filter seems to only consider the openoffice mimetypes.

I'll try to transform the FilterSource + FilterTex  a standalone filter
"FilterTex" as the html one...

Just a question : the keyword in the FilterTex are LaTeX keywords. Are there
specific mime-types to distinguish Tex and Latex files ?

Comment 5 Veerapuram Varadhan 2005-08-31 12:14:22 UTC

Well, when you say "structured text" with markups, yes, Abiword/RTF/OpenOffice
(a zip of bunch of ascii files) are also just text files with appropriate
markups that they understand.

Mimetype detection:  Beagle uses the gnome-vfs methods for mimetype detection. 
Just use nautilus or gnomevfs-info to find the mime-type to use in your filter.

Comment 6 Norbert Preining 2005-10-27 08:02:51 UTC

I attach a patch against version 1.0.1 which I use on my system here, i.e.
debian/sid source package plus this patch. It works quite well.

Comment 7 Norbert Preining 2005-10-27 08:04:11 UTC

Created attachment 53939 [details] [review]
Patch for beagle 1.0.1 to index TeX files

Comment 8 Debajyoti Bera 2005-12-06 22:18:51 UTC

I believe the patch works but a latex/tex file contains a lot more information 
than simple words. It has information about the title, authors, their emails, 
chapter/section names and most importantly the bib-entries (finding citations 
is the cool thing a lot of data-mining people are trying to solve).  
 
I dont think the filtersource idea of just finding non-keyword words is the 
correct approach; just not sure what's the right way either :(

Comment 9 Dav 2005-12-07 07:46:36 UTC

I agree with  you  but only few information   can be (and  should  be)
extracted in a canonical way (title,author, section names..).

However, it  may be  difficult to  check  emails and   citations since
either the syntax  in not standard or there  may be file  dependencies
(bibitem are usually stored in a external bibtex file).

I'll try to set up a wiki page with ideas/comments on this filter.

Comment 10 Debajyoti Bera 2005-12-07 12:56:07 UTC

A wiki will be great. Please paste the link in bugzilla.

Comment 11 Dav 2005-12-07 13:59:33 UTC

I've just created a page in the Beagle wiki about the FilterLaTeX specifications.

link: http://www.beagle-project.org/FilterLaTeX_Spec

Comment 12 Norbert Preining 2006-01-24 12:19:56 UTC

Created attachment 58010 [details] [review]
patch beagle 2.0 to support TeX and .sty

Patch against beagle 2.0. Tested with the current Debian sid package. If someone has interest I can put up a apt-get repos for a beagle patched with TeX support.

Comment 13 auxsvr 2006-03-03 21:14:37 UTC

Created attachment 60603 [details]
FilterTeX based on FilterRTF

DISCLAIMER:This is my first attempt at C#. I think there is a bug where BUG is marked in the file, which regards a change of the stack without any pushes or pops. What do you think?

Comment 14 Joe Shaw 2006-08-09 15:47:10 UTC

*** Bug 350571 has been marked as a duplicate of this bug. ***

Comment 15 Debajyoti Bera 2006-10-29 21:59:50 UTC

Text content from tex/latex files can be extracted relatively easily (and better) with FilterExternal and standalong programs like untex. Unless the filters are able to extract specific metadata like author/title/bib entries, a separate filter doesnt make sense.

Comment 16 auxsvr 2007-02-12 21:08:02 UTC

Created attachment 82421 [details]
Updated FilterTeX

This version parses almost any complex tex file I could throw at it. It extracts metadata (author, title, abstract) and hot text (emph, section, bibitem etc.). Suggestions and other feedback are welcome.

Comment 17 auxsvr 2007-09-21 12:59:47 UTC

Created attachment 95956 [details]
Updated FilterTeX

This version fixes a bug that caused the filter to omit spaces when collating words and makes it possible to compile the filter from the beagle source, courtesy of Dr. Robert Moniot. If you want to compile outside the source tree, you need to add the assembly line from the AssemblyInfo.cs to the FilterTeX.cs file and compile according to the instructions in the latter.

Comment 18 Debajyoti Bera 2007-09-24 15:29:49 UTC

I will see how it does with some of my latex files ;-). If all goes good, I will check this in.

Assigning to myself.

Comment 19 Debajyoti Bera 2007-09-25 00:18:38 UTC

I dont see where is it extracting metadata like author, title. Also, it outputs all the math mode symbols verbatim (i.e. with $2^2$). This is no good than just extracting the raw text of the filter.

Check the sample output of extract-content. I can attach the latex document but you can probably create one anyway. Its a simple math heavy latex document with a \title{} and an \author{}.

Calling TeXParse (true)
Filter: Beagle.Filters.FilterTeX (determined in .83s)
MimeType: text/x-tex

Properties:
  Timestamp = 2007-08-01 00:07:48 (Utc)

Calling TeXParse (false)
Content:
The circuit to simulate a multi-$Z$ layer looks like this: \input {circuit-diagram-top-level.pdf_t}  The top $n$ qubits are the original data qubits.  The rest are ancilla qubits.  All the qubits are arranged in $n$ blocks $B_1,\ldots,B_n$  of $n$ qubits per block.  The qubits in block $B_i$ are labeled $b_{i1},\ldots,b_ {in}$.  Each $A_i$ subcircuit looks like this: \input {circuit-diagram-A-i.pdf_t}  The qubits $c_{i1},\ldots,c_ {in}$ are control qubits.  For $1\le  j\le  n$, the qubits $b_{ij}$ and $c_{ij}$ are connected to a Toffoli gate with an ancilla as the target.  Note that the controlled multi-$Z$ gate has its control on the $i$th such ancilla in $A_i$, with targets on all the other ancill\ae \ in  $A_i$.  Here is the state evolution from $\ket {\vec {d}} = \ket {d_1\cdots  d_n}$.  I'm suppressing the $c_{ij}$ qubits and ancill\ae \ internal  to the $A_i$ subcircuits in the ket labels.  Note that after the first layer of fanouts, each qubit $b_{ij}$ carries the value $d_j$. \ket {\vec {d},\vec {0},\ldots, \vec {0}} & \mapsto  & \ket {\vec {d},\vec {d},\ldots, \vec {d}}  & \mapsto  & (-1)^{\sum_i  d_ic_{ii}\left( \sum_ {j\ne  i} d_jc_{ij}\right)} \ket {\vec {d},\vec {d},\ldots, \vec {d}}  & \mapsto  & (-1)^{\sum_i  d_ic_{ii}\left( \sum_ {j\ne  i} d_jc_{ij}\right)} \ket {\vec {d},\vec {0},\ldots, \vec {0}}   To simulate some multi-$Z$ gate whose control is on the $i$th qubit, say, we do this in block $B_i$ by setting $c_{ii}$ to $1$ and setting $c_{ij}$ to $1$ for every $j$ where the $j$th qubit is a target of the gate.  All the other $c$-qubits in $B_i$ are set to $0$.  We can do this in separate blocks for multiple gates on the same layer, because no two gates can share the same control qubit.  Any $c$-qubits in unused blocks are set to $0$.

Comment 20 Debajyoti Bera 2007-09-25 00:30:36 UTC

Hmm... I did something wrong, the author and title seems to be extracted now. Weird. Anyway ... anything about the $math$ stuff ? Seems to me, either drop the entire text within $...$ or dont print the '$' (just ignore them i.e. ab$c$d becomes abcd). Rest looks good.

BTW, I had to make several syntactic changes. So if you make any change, can you attach the "diff -u" against the previous attachment. Thanks.

Comment 21 auxsvr 2007-09-25 15:29:01 UTC

I left the math-mode markup on purpose, so as to add latex rendering of it later in kerry, for instance. I know that it isn't easy or well-defined, nevertheless I think that the outcome will be worth the trouble.

The filter does two passes, one for the metadata (TeXParse (true)) and one for the text (TeXParse (false)); the output the latter is the one you posted above. 

Could you post your changes? Thanks.

Comment 22 Debajyoti Bera 2007-09-25 18:20:05 UTC

I will post my changes later tonight. They were mostly style issues (the old rtf filter was horrible in style matter). A few comments now:

1) Why do we need to parse it twice ? I changed it to make it only one pass and it seems to work fine. Note that, filter are _allowed_ to return text in DoPullProperties (the text will be stored and processed later).

2) For latex documents, the raw latex text is more suitable to show the context. And in that way, using the $...$ in snippets is better. So snippetmode should be false (rather, originalistext marked true).

3) Given #3, $2$ will probably be ignored by the lucene analyzer anyway and wont be stored in the index. So it doesnt really matter if we keep the $ or remove it. To see what gets stored in lucene, use --analyze (or something like this) in beagle-extract-content.

Comment 23 auxsvr 2007-09-25 19:39:04 UTC

1) There's no reason for this, I probably copied this verbatim from the RTF
filter, so you may find more blunders of this type. I don't really know much
about beagle, only what appears in the filter anatomy tutorial.

2) I don't really know whether snippetmode or originalistext is preferred or
what their function is exactly.

3) I used --tokenize and the output contained constructs like $k$, I also see
the markup in kerry, so I assume that it does what I intended, is this correct?

If the property text (e.g. abstract) is too large for the filesystem extended
attribute to store (I assume that the maximum size is 1024 characters, I
suspect that in fact it is half that, because of character encodings), then it
is stored as plain text, not as a beagle property content. Is there a better
way to solve this?

Comment 24 Debajyoti Bera 2007-09-25 21:00:20 UTC

(1) and (2): I figured that out. The changes I made are mostly of that nature.

(3) Not the tokenize, there should be another --analyze or smthng like this. It could be its only in the svn trunk.

The property texts are not stored in filesystem attributes (now I know what you meant by the comment in the source). So no worries there. You dont even need to make any checks. Just throw everything in the property text and it will be fine.

Comment 25 Debajyoti Bera 2007-09-25 23:53:50 UTC

Created attachment 96203 [details]
Proposed for inclusion

I would like to commit this one. Pls test.

About the $...$ issue, do whatever requires less processing. Functionality wise they will cause the exact same behaviour.

Comment 26 Debajyoti Bera 2007-09-27 01:14:00 UTC

Okie. Checked in the filter r3996. Yay, finally :)
Thanks guys.

Future fixes and requests to be filed as new bugs (w/ patches please).

Comment 27 auxsvr 2007-09-27 19:45:20 UTC

I'm sorry for the delay, I couldn't make it work with beagle-0.2.18 and was trying to find out the reason (I think that TextReader is at fault here, I was unable to get any output from the filter). It works with the svn version, with the difference that the $ symbols are missing in some cases and there's no hot content. Is there anything I can do to fix this?

Thanks again for your patience and effort.

Comment 28 Debajyoti Bera 2007-09-28 15:54:01 UTC

I dont remember offhand what could have changed in the trunk from 0.2.x branch, but I will try to investigate. In any case, functionally this is not different from the one you posted (its the same basic code), so you can keep using yours till 0.3.0 comes out.

I might have forgotten to undo some of the '$' changes, I will try to remember what I did. HotContent is generated but not reported because beagle doesnt use hotcontent (it didnt earlier either, so that part of the code was removed from trunk).

BTW about the latex formatting of the output of beagle-extract-content, you should not be fooled with it. As I mentioned earlier, use beagle-extract-content --analyze to see what content is actually stored in beagle. Again, only the svn trunk has this option but the behaviour is same for 0.2.18.