
                        TagSoup - Just Keep On Truckin'

  Introduction

   This is the home page of TagSoup, a SAX-compliant parser written in
   Java that, instead of parsing well-formed or valid XML, parses HTML as
   it is found in the wild: [1]poor, nasty and brutish, though quite
   often far from short. TagSoup is designed for people who have to
   process this stuff using some semblance of a rational application
   design. By providing a SAX interface, it allows standard XML tools to
   be applied to even the worst HTML. TagSoup also includes a
   command-line processor that reads HTML files and can generate either
   clean HTML or well-formed XML that is a close approximation to XHTML.

   TagSoup is free and Open Source software, licensed under the
   [2]Academic Free License version 3.0, a cleaned-up and patent-safe
   BSD-style license which allows proprietary re-use. It's also licensed
   under the [3]GNU GPL version 2.0, since unfortunately the GPL and the
   AFL are incompatible. You can choose to license TagSoup from me under
   either the GPL or the AFL.

  Warning: TagSoup will not build in Java 5.0!

   Due to a bug in Java 5.0's default XSLT implementation, TagSoup will
   not build out of the box on Java 5.0. Instead, build it under Java
   1.4, or else install Xalan or Saxon, and the result will then work
   fine under either 1.4 or 5.0.

  TagSoup 1.0.1 released

   One new user-supplied feature, plus two features or bugfixes,
   depending on how you look at them. None are critical, so you needn't
   update unless you care.

   Previous versions of TagSoup always ignored whitespace in elements
   that don't have PCDATA as a possible child. Now, if you turn on the
   ignorableWhitespaceFeature (or use the --ignorable option), that
   whitespace will be returned to your application through the previously
   unused ContentHandler.ignorableWhitespace callback. This isn't done by
   default for backwards compatibility, and also because HTML is an SGML
   application and SGML parsers routinely dropped such whitespace.

   If you install a LexicalHandler in order to pick up comments and
   DOCTYPE declarations (or use the --lexical option), you may get
   comments or public identifiers that aren't valid XML: in particular,
   comments may contain -- sequences. TagSoup will now insert a space
   into such sequences, as well as immediately after a final - in a
   comment. Likewise, TagSoup will now change all illegal characters in
   public identifiers to spaces. What's more, the --lexical option will
   now cause a DOCTYPE declaration to be output if there is one in the
   input.

  TagSoup 1.0 Final released

   Another small change: There is a switch --norestart to prevent
   restartable elements from being restarted.

   This is the end of my current plans for TagSoup. I will continue to
   fix bugs, but it now does everything that I foresaw back in 2002 when
   I started this project, and a great deal more. Thanks to everyone on
   the tagsoup-friends mailing list for their efforts.

   [4]Download the TagSoup 1.0.1 jar file here. It's about 50K long.
   [5]Download the full TagSoup 1.0.1 source here. If you don't have zip,
   you can use jar to unpack it.

  What TagSoup does

   TagSoup is designed as a parser, not a whole application; it isn't
   intended to permanently clean up bad HTML, as [6]HTML Tidy does, only
   to parse it on the fly. Therefore, it does not convert presentation
   HTML to CSS or anything similar. It does guarantee well-structured
   results: tags will wind up properly nested, default attributes will
   appear appropriately, and so on.

   The semantics of TagSoup are as far as practical those of actual HTML
   browsers. In particular, never, never will it throw any sort of syntax
   error: the TagSoup motto is [7]"Just Keep On Truckin'". But there's
   much, much more. For example, if the first tag is LI, it will supply
   the application with enclosing HTML, BODY, and UL tags. Why UL?
   Because that's what browsers assume in this situation. For the same
   reason, overlapping tags are correctly restarted whenever possible:
   text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text

   gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

   By intention, TagSoup is small and fast. It does not depend on the
   existence of any framework other than SAX, and should be able to work
   with any framework that can accept SAX parsers. In particular, [8]XOM
   is known to work.

   You can replace the low-level HTML scanner with one based on Sean
   McGrath's [9]PYX format (very close to James Clark's ESIS format). You
   can also supply an AutoDetector that peeks at the incoming byte stream
   and guesses a character encoding for it. Otherwise, the platform
   default is used. If you need an autodetector of character sets,
   consider trying to adapt the [10]Mozilla one; if you succeed, let me
   know.

  Note: TagSoup in Java 1.1

   If you go through the TagSoup source and replace all references to
   HashMap with Hashtable and recompile under Java 1.4, TagSoup will work
   fine in Java 1.1 VMs. Thanks to Thorbjrn Vinne for this discovery.

  The TSaxon XSLT-for-HTML processor

   [11]I am also distributing [12]TSaxon, a repackaging of version 6.5.5
   of Michael Kay's Saxon XSLT version 1.0 implementation that includes
   TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
   process either HTML or XML documents with XSLT stylesheets.

  TagSoup as a stand-alone program

   It is possible to run TagSoup as a program by saying java -jar
   tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
   line will be parsed individually. If no files are specified, the
   standard input is read.

   The following options are understood:

   --files
          Output into individual files, with html extensions changed to
          xhtml. Otherwise, all output is sent to the standard output.

   --html
          Output is in clean HTML: the XML declaration is suppressed, as
          are end-tags for the known empty elements.

   --omit-xml-declaration
          The XML declaration is suppressed.

   --method=html
          End-tags for the known empty HTML elements are suppressed.

   --pyx
          Output is in PYX format.

   --pyxin
          Input is in PYXoid format (need not be well-formed).

   --nons
          Namespaces are suppressed. Normally, all elements are in the
          XHTML 1.x namespace, and all attributes are in no namespace.

   --nobogons
          Bogons (unknown elements) are suppressed. Normally, they are
          treated as empty.

   --nodefaults
          suppress default attribute values

   --nocolons
          change explicit colons in element and attribute names to
          underscores

   --norestart
          don't restart any normally restartable elements

   --ignorable
          output whitespace in elements with element-only content

   --any
          Bogons are given a content model of ANY rather than EMPTY.

   --lexical
          Pass through HTML comments. Has no effect when output is in PYX
          format.

   --reuse
          Reuse a single instance of TagSoup parser throughout. Normally,
          a new one is instantiated for each input file.

   --nocdata
          Change the content models of the script and style elements to
          treat them as ordinary #PCDATA (text-only) elements, as in
          XHTML, rather than with the special CDATA content model.

   --encoding=encoding
          Specify the input encoding. The default is the Java platform
          default.

   --help
          Print help.

   --version
          Print the version number.

  More information

   I gave a presentation (a nocturne, so it's not on the schedule) at
   [13]Extreme Markup Languages 2004 about TagSoup, updated from the one
   presented in 2002 at the New York City XML SIG and at XML 2002. This
   is the main high-level documentation about how TagSoup works. Formats:
   [14]OpenDocument [15]Powerpoint [16]PDF.

   I also had people add [17]"evil" HTML to a large poster so that I
   could [18]clean it up; View Source is probably more useful than
   ordinary browsing. The original instructions were:

                        SOUPE DE BALISES (BE EVIL)!
   Ecritez une balise ouvrante (sans attributs)
   ou fermante HTML ici, s.v.p.

   There is a [19]tagsoup-friends mailing list hosted at [20]Yahoo
   Groups. You can [21]join via the Web, or by sending a blank email to
   [22]tagsoup-friends-subscribe@yahoogroups.com. The [23]archives are
   open to all.

   Online TagSoup processing for publicly accessible HTML documents is
   now [24]available courtesy of Leigh Dodds.

References

   1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
   2. http://www.opensource.org/licenses/afl-3.0.php
   3. http://www.opensource.org/licenses/gpl-license.php
   4. http://ccil.org/~cowan/XML/tagsoup/tagsoup-1.0.1.jar
   5. http://ccil.org/~cowan/XML/tagsoup/tagsoup-1.0.1-src.zip
   6. http://tidy.sf.net/
   7. http://www.crumbmuseum.com/truckin.html
   8. http://www.cafeconleche.org/XOM
   9. http://gnosis.cx/publish/programming/xml_matters_17.html
  10. http://jchardet.sourceforge.net/
  11. http://www.ccil.org/~cowan
  12. http://ccil.org/~cowan/XML/tagsoup/tsaxon
  13. http://www.extrememarkup.com/extreme/2004
  14. http://ccil.org/~cowan/XML/tagsoup/tagsoup.odp
  15. http://ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
  16. http://ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
  17. http://ccil.org/~cowan/XML/tagsoup/extreme.html
  18. http://ccil.org/~cowan/XML/tagsoup/extreme.xhtml
  19. http://groups.yahoo.com/group/tagsoup-friends
  20. http://groups.yahoo.com/
  21. http://groups.yahoo.com/group/tagsoup-friends/join
  22. mailto:tagsoup-friends-subscribe@yahoogroups.com
  23. http://groups.yahoo.com/group/tagsoup-friends/messages
  24. http://xmlarmyknife.org/docs/xhtml/tagsoup/
