GNOME Bugzilla – Bug 359204
SGF filter
Last modified: 2009-01-19 19:26:30 UTC
I want Beagle to be able to filter my Go games; the file format standard for these is called SGF. So I've written a really rough implementation of such a beast: http://gimmego.googlecode.com/svn/trunk/Source/SGFFilter.cs (It's my first Beagle filter, and even my first C# program, so excuse the mess -- also tell me how to improve it.) See the top of the referenced file, where it has a list of "WAYS THIS FILTER COULD BE BETTER". I'll keep improving the file at this location to resolve these, if I can.
Would you be horribly against uploading a sample .sgf file for us to test this against?
That sounds like a completely reasonable request. I've been using files from the Go Teaching Ladder <http://gtl.xmp.net> for most of my testing. Each of the zip files at <http://gtl.xmp.net/reviews/zip> has 100 SGF files in it. For performance testing, I've been throwing Kogo's Joseki Dictionary at it: <http://waterfire.us/joseki.htm>. This is a 1 MB SGF file -- far larger than any other SGF file you're ever likely to see. On my 5-year-old computer, I can beagle-extract-content on Kogo's (>/dev/null) in around 7 seconds, which isn't too horrible, but it can of course still be improved. What I really want is an SGF file that uses every last feature of the file format. I'll see about creating one of those soon. :-)
Great. We would love to support more files. Couple of comments (mostly answering your questions): 1) For the code to be shipped with beagle, it should be released under a compatible license. See any of the shipped filter files for the license they use. The LICENSE file goes into detail about this. If you agree to do so, then please include the included text as given in other filter files. 2) I assume the _explanatory_ comments can be removed/curtailed when its ready to be included :) 3) Can it be better than Regexes :( ? They are horribly expensive. But if they are the only feasible option, then thats what we have to live with. And yes, filters are reused. 4) Makes the regexes static. Compiles regexes are better utilized when static. 5) This is scare "string text = reader.ReadToEnd();" - how large can these files be ? Loading the entire file into a large string can really blow up the memory. I am really scared of that line /* shivering */ 6) Yes. You can make up your own property names. Its recommended that you be consistent with other filters (i.e. if you are writing a filter for an image and other image filter use fixme:height for storing height, then use fixme:height for your filter too). Since SGF files are one of its own kind, I guess you can go crazy. 7) "I don't do much checking of field types at all" - how does that affect the working of the filter ? 8) Escaped characters should be escaped. Whitespace collapsing, adding null, empty fields or text - those come free with beagle :) 9) See the archive filter or zip filter to see how child indexables work. 10) None. But just to make 10 comments. Feels good :-D. Make the changes, and add a SGF file and the filter file as attachment.
Thanks for the feedback! 1. If it gets included with Beagle, I'm not picky about license, so sure. 2. If it gets included with Beagle, I'm not picky about much at all, really. :-) 3. There are many ways to implement it, of course. I figured that a clever solution would be to notice that SGF FF[4] is a regular language, if you don't care about branching. But if regexes are expensive in the C#/mono world, I can certainly implement a parser manually. 4. (OK.) 5. Most are a couple KB, tops. The one exception (there's always one) is at <http://waterfire.us/joseki.htm>, which is about 1 MB (it's a monstrous dictionary of openings). Even on my ancient computer (900 MHz!), it still only took something like 7 seconds. 6. I guess I need to write a PGN filter for chess games, too. I'm not really a chess player any more, but then I could at least claim it's consistent with another filter. :-) 7. For example, if you made up a fake tag (that isn't in SGF) like "XY[hello, world]", it would register that text, instead of (say) ignoring it, or complaining that the file is invalid. I may have also been referring to encodings: it's possible to declare that parts of an SGF file are in a different encoding (which parts depends on the tag), but my code doesn't handle encodings yet (they're icky in SGF). 8. Nice. 9. Maybe someday... 10. (Ditto!)
Hey, what's the status of this filter ? Is the attached version the final version ?
I'm not dead yet! :-) I've rewritten it using a handwritten parser instead of regexes (it's indeed faster). It also deals with non-ASCII encodings, and escaped characters. Due to hardware trouble, I can't submit it today (and it needs a little bit of cleanup work still), but I'll likely have it ready this weekend.
Closing this bug report as no further information has been provided. Please feel free to reopen this bug if you can provide the information asked for. Thanks!