GNOME Bugzilla – Bug 314844
[PATCH] Test a new Source Filter : FilterTex.cs
Last modified: 2007-09-28 15:54:01 UTC
I've created a new FilterSource to index Tex files. Basically I've just customize the FilterC.cs file adding the Latex keywords and and a new entry in the FilterSource- "LangType". I need help to link this new filter to beagle and to test it (I cannot compile the CVS version). thanks for your help and feedbacks (see FilterSource.cs and FilterTex.cs in attachment)
Created attachment 51559 [details] New TeX filter
Created attachment 51560 [details] new filter source
Thats a good start. However, I wouldn't classify TeX filter under FilterSource, which is intended for Programming/Scripting languages. TeX should fall under the categories of word-processor documents, as it provides more information than just textual contents. So, I once had a simple Texi filter that parses TeX files according to the type of keywords, however, I lost the file when I was playing with my disk partitions. Here is the simple outline: 1) TeX has two types of commands: Blocks that end with @end <block-name> and {} 2) Also, there are some "escape" commands like "@@" "@{" etc. 3) Other non-block commands like @Title, @Chapter etc. (IIRC). Each of these types of commands have to be processed separately. FilterOpenOffice.cs would be a good place to start and understand how the word-processor filters work. Thanks again for your patch. :)
Actually, the Tex filter should be placed in the same place as the FilterHTML in the "Text Documents" section. In both case, we have a structured text with markups that produce a formatted document. The FilterOpenOffice.cs filter seems to only consider the openoffice mimetypes. I'll try to transform the FilterSource + FilterTex a standalone filter "FilterTex" as the html one... Just a question : the keyword in the FilterTex are LaTeX keywords. Are there specific mime-types to distinguish Tex and Latex files ?
Well, when you say "structured text" with markups, yes, Abiword/RTF/OpenOffice (a zip of bunch of ascii files) are also just text files with appropriate markups that they understand. Mimetype detection: Beagle uses the gnome-vfs methods for mimetype detection. Just use nautilus or gnomevfs-info to find the mime-type to use in your filter.
I attach a patch against version 1.0.1 which I use on my system here, i.e. debian/sid source package plus this patch. It works quite well.
Created attachment 53939 [details] [review] Patch for beagle 1.0.1 to index TeX files
I believe the patch works but a latex/tex file contains a lot more information than simple words. It has information about the title, authors, their emails, chapter/section names and most importantly the bib-entries (finding citations is the cool thing a lot of data-mining people are trying to solve). I dont think the filtersource idea of just finding non-keyword words is the correct approach; just not sure what's the right way either :(
I agree with you but only few information can be (and should be) extracted in a canonical way (title,author, section names..). However, it may be difficult to check emails and citations since either the syntax in not standard or there may be file dependencies (bibitem are usually stored in a external bibtex file). I'll try to set up a wiki page with ideas/comments on this filter.
A wiki will be great. Please paste the link in bugzilla.
I've just created a page in the Beagle wiki about the FilterLaTeX specifications. link: http://www.beagle-project.org/FilterLaTeX_Spec
Created attachment 58010 [details] [review] patch beagle 2.0 to support TeX and .sty Patch against beagle 2.0. Tested with the current Debian sid package. If someone has interest I can put up a apt-get repos for a beagle patched with TeX support.
Created attachment 60603 [details] FilterTeX based on FilterRTF DISCLAIMER:This is my first attempt at C#. I think there is a bug where BUG is marked in the file, which regards a change of the stack without any pushes or pops. What do you think?
*** Bug 350571 has been marked as a duplicate of this bug. ***
Text content from tex/latex files can be extracted relatively easily (and better) with FilterExternal and standalong programs like untex. Unless the filters are able to extract specific metadata like author/title/bib entries, a separate filter doesnt make sense.
Created attachment 82421 [details] Updated FilterTeX This version parses almost any complex tex file I could throw at it. It extracts metadata (author, title, abstract) and hot text (emph, section, bibitem etc.). Suggestions and other feedback are welcome.
Created attachment 95956 [details] Updated FilterTeX This version fixes a bug that caused the filter to omit spaces when collating words and makes it possible to compile the filter from the beagle source, courtesy of Dr. Robert Moniot. If you want to compile outside the source tree, you need to add the assembly line from the AssemblyInfo.cs to the FilterTeX.cs file and compile according to the instructions in the latter.
I will see how it does with some of my latex files ;-). If all goes good, I will check this in. Assigning to myself.
I dont see where is it extracting metadata like author, title. Also, it outputs all the math mode symbols verbatim (i.e. with $2^2$). This is no good than just extracting the raw text of the filter. Check the sample output of extract-content. I can attach the latex document but you can probably create one anyway. Its a simple math heavy latex document with a \title{} and an \author{}. Calling TeXParse (true) Filter: Beagle.Filters.FilterTeX (determined in .83s) MimeType: text/x-tex Properties: Timestamp = 2007-08-01 00:07:48 (Utc) Calling TeXParse (false) Content: The circuit to simulate a multi-$Z$ layer looks like this: \input {circuit-diagram-top-level.pdf_t} The top $n$ qubits are the original data qubits. The rest are ancilla qubits. All the qubits are arranged in $n$ blocks $B_1,\ldots,B_n$ of $n$ qubits per block. The qubits in block $B_i$ are labeled $b_{i1},\ldots,b_ {in}$. Each $A_i$ subcircuit looks like this: \input {circuit-diagram-A-i.pdf_t} The qubits $c_{i1},\ldots,c_ {in}$ are control qubits. For $1\le j\le n$, the qubits $b_{ij}$ and $c_{ij}$ are connected to a Toffoli gate with an ancilla as the target. Note that the controlled multi-$Z$ gate has its control on the $i$th such ancilla in $A_i$, with targets on all the other ancill\ae \ in $A_i$. Here is the state evolution from $\ket {\vec {d}} = \ket {d_1\cdots d_n}$. I'm suppressing the $c_{ij}$ qubits and ancill\ae \ internal to the $A_i$ subcircuits in the ket labels. Note that after the first layer of fanouts, each qubit $b_{ij}$ carries the value $d_j$. \ket {\vec {d},\vec {0},\ldots, \vec {0}} & \mapsto & \ket {\vec {d},\vec {d},\ldots, \vec {d}} & \mapsto & (-1)^{\sum_i d_ic_{ii}\left( \sum_ {j\ne i} d_jc_{ij}\right)} \ket {\vec {d},\vec {d},\ldots, \vec {d}} & \mapsto & (-1)^{\sum_i d_ic_{ii}\left( \sum_ {j\ne i} d_jc_{ij}\right)} \ket {\vec {d},\vec {0},\ldots, \vec {0}} To simulate some multi-$Z$ gate whose control is on the $i$th qubit, say, we do this in block $B_i$ by setting $c_{ii}$ to $1$ and setting $c_{ij}$ to $1$ for every $j$ where the $j$th qubit is a target of the gate. All the other $c$-qubits in $B_i$ are set to $0$. We can do this in separate blocks for multiple gates on the same layer, because no two gates can share the same control qubit. Any $c$-qubits in unused blocks are set to $0$.
Hmm... I did something wrong, the author and title seems to be extracted now. Weird. Anyway ... anything about the $math$ stuff ? Seems to me, either drop the entire text within $...$ or dont print the '$' (just ignore them i.e. ab$c$d becomes abcd). Rest looks good. BTW, I had to make several syntactic changes. So if you make any change, can you attach the "diff -u" against the previous attachment. Thanks.
I left the math-mode markup on purpose, so as to add latex rendering of it later in kerry, for instance. I know that it isn't easy or well-defined, nevertheless I think that the outcome will be worth the trouble. The filter does two passes, one for the metadata (TeXParse (true)) and one for the text (TeXParse (false)); the output the latter is the one you posted above. Could you post your changes? Thanks.
I will post my changes later tonight. They were mostly style issues (the old rtf filter was horrible in style matter). A few comments now: 1) Why do we need to parse it twice ? I changed it to make it only one pass and it seems to work fine. Note that, filter are _allowed_ to return text in DoPullProperties (the text will be stored and processed later). 2) For latex documents, the raw latex text is more suitable to show the context. And in that way, using the $...$ in snippets is better. So snippetmode should be false (rather, originalistext marked true). 3) Given #3, $2$ will probably be ignored by the lucene analyzer anyway and wont be stored in the index. So it doesnt really matter if we keep the $ or remove it. To see what gets stored in lucene, use --analyze (or something like this) in beagle-extract-content.
1) There's no reason for this, I probably copied this verbatim from the RTF filter, so you may find more blunders of this type. I don't really know much about beagle, only what appears in the filter anatomy tutorial. 2) I don't really know whether snippetmode or originalistext is preferred or what their function is exactly. 3) I used --tokenize and the output contained constructs like $k$, I also see the markup in kerry, so I assume that it does what I intended, is this correct? If the property text (e.g. abstract) is too large for the filesystem extended attribute to store (I assume that the maximum size is 1024 characters, I suspect that in fact it is half that, because of character encodings), then it is stored as plain text, not as a beagle property content. Is there a better way to solve this?
(1) and (2): I figured that out. The changes I made are mostly of that nature. (3) Not the tokenize, there should be another --analyze or smthng like this. It could be its only in the svn trunk. The property texts are not stored in filesystem attributes (now I know what you meant by the comment in the source). So no worries there. You dont even need to make any checks. Just throw everything in the property text and it will be fine.
Created attachment 96203 [details] Proposed for inclusion I would like to commit this one. Pls test. About the $...$ issue, do whatever requires less processing. Functionality wise they will cause the exact same behaviour.
Okie. Checked in the filter r3996. Yay, finally :) Thanks guys. Future fixes and requests to be filed as new bugs (w/ patches please).
I'm sorry for the delay, I couldn't make it work with beagle-0.2.18 and was trying to find out the reason (I think that TextReader is at fault here, I was unable to get any output from the filter). It works with the svn version, with the difference that the $ symbols are missing in some cases and there's no hot content. Is there anything I can do to fix this? Thanks again for your patience and effort.
I dont remember offhand what could have changed in the trunk from 0.2.x branch, but I will try to investigate. In any case, functionally this is not different from the one you posted (its the same basic code), so you can keep using yours till 0.3.0 comes out. I might have forgotten to undo some of the '$' changes, I will try to remember what I did. HotContent is generated but not reported because beagle doesnt use hotcontent (it didnt earlier either, so that part of the code was removed from trunk). BTW about the latex formatting of the output of beagle-extract-content, you should not be fooled with it. As I mentioned earlier, use beagle-extract-content --analyze to see what content is actually stored in beagle. Again, only the svn trunk has this option but the behaviour is same for 0.2.18.