GNOME Bugzilla – Bug 321632
htmlReadMemory broken if LIBXML_LEGACY_ENABLED not set
Last modified: 2008-07-19 11:20:14 UTC
Distribution/Version: Ubuntu/5.10 Revision 1.46 of globals.c introduced a change that inithtmlDefaultSAXHandler(&gs->htmlDefaultSAXHandler); was only called if both LIBXML_HTML_ENABLED and LIBXML_LEGACY_ENABLED were defined. ctxt->sax is defined by htmlReadMemory when following the following path (only when SAX1 is compiled in): htmlReadMemory at HTMLparser.c:5941 xmlCreateMemoryParserCtxt at parser.c:12360 xmlNewParserCtxt at parserInternals.c:1807 xmlInitParserCtxt at parserInternals.c:1553 xmlDefaultSAXHandlerInit at SAX2.c:2754 xmlSAXVersion at SAX2.c Then the ctxt->sax pointer is initialized by xmlInitParserCtxt() So at HTMLparser.c:5944, ctxt->sax != NULL, and has been initialized to the values of xmlDefaultSAXHandler. Breakpoint 1, htmlReadMemory ( buffer=0x8dfa710 "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">\r\n<HTML><HEAD>\r\n<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=us-ascii\">\r\n<TITLE>Message</TITLE>\r\n\r\n<META content=\"MSHTML 6."..., size=4951, URL=0x848d7d1 "html:", encoding=0x8df5568 "utf-8", options=65633) at HTMLparser.c:5945 5945 memcpy(ctxt->sax, &htmlDefaultSAXHandler, sizeof(xmlSAXHandlerV1)); (gdb) print *ctxt->sax $1 = {internalSubset = 0x821ecf0 <xmlSAX2InternalSubset>, isStandalone = 0x821ec60 <xmlSAX2IsStandalone>, hasInternalSubset = 0x821ec90 <xmlSAX2HasInternalSubset>, hasExternalSubset = 0x821ecc0 <xmlSAX2HasExternalSubset>, resolveEntity = 0x821efd0 <xmlSAX2ResolveEntity>, getEntity = 0x821f040 <xmlSAX2GetEntity>, entityDecl = 0x821f230 <xmlSAX2EntityDecl>, notationDecl = 0x821f5b0 <xmlSAX2NotationDecl>, attributeDecl = 0x821f3c0 <xmlSAX2AttributeDecl>, elementDecl = 0x821f520 <xmlSAX2ElementDecl>, unparsedEntityDecl = 0x821f660 <xmlSAX2UnparsedEntityDecl>, setDocumentLocator = 0x821f7b0 <xmlSAX2SetDocumentLocator>, startDocument = 0x821f7c0 <xmlSAX2StartDocument>, endDocument = 0x821f8c0 <xmlSAX2EndDocument>, startElement = 0, endElement = 0, reference = 0x8221120 <xmlSAX2Reference>, characters = 0x82211a0 <xmlSAX2Characters>, ignorableWhitespace = 0x82211a0 <xmlSAX2Characters>, processingInstruction = 0x82213f0 <xmlSAX2ProcessingInstruction>, comment = 0x8221520 <xmlSAX2Comment>, warning = 0x81d4340 <xmlParserWarning>, error = 0x81d41b0 <xmlParserError>, fatalError = 0x81d41b0 <xmlParserError>, getParameterEntity = 0x821f200 <xmlSAX2GetParameterEntity>, cdataBlock = 0x8221650 <xmlSAX2CDataBlock>, externalSubset = 0x821eda0 <xmlSAX2ExternalSubset>, initialized = 3740122799, _private = 0x0, startElementNs = 0x82209c0 <xmlSAX2StartElementNs>, endElementNs = 0x82210b0 <xmlSAX2EndElementNs>, serror = 0} This is then overwritten at HTMLparser.c:5945 with the value of htmlDefaultSAXHandler. Except htmlDefaultSAXHandler is not filled in the per-thread data: (gdb) print *(xmlGlobalState *) pthread_getspecific(globalkey) $2 = {xmlParserVersion = 0x849a7a2 "20622", xmlDefaultSAXLocator = { getPublicId = 0x821ebd0 <xmlSAX2GetPublicId>, getSystemId = 0x821ebe0 <xmlSAX2GetSystemId>, getLineNumber = 0x821ec00 <xmlSAX2GetLineNumber>, getColumnNumber = 0x821ec30 <xmlSAX2GetColumnNumber>}, xmlDefaultSAXHandler = {internalSubset = 0x821ecf0 <xmlSAX2InternalSubset>, isStandalone = 0x821ec60 <xmlSAX2IsStandalone>, hasInternalSubset = 0x821ec90 <xmlSAX2HasInternalSubset>, hasExternalSubset = 0x821ecc0 <xmlSAX2HasExternalSubset>, resolveEntity = 0x821efd0 <xmlSAX2ResolveEntity>, getEntity = 0x821f040 <xmlSAX2GetEntity>, entityDecl = 0x821f230 <xmlSAX2EntityDecl>, notationDecl = 0x821f5b0 <xmlSAX2NotationDecl>, attributeDecl = 0x821f3c0 <xmlSAX2AttributeDecl>, elementDecl = 0x821f520 <xmlSAX2ElementDecl>, unparsedEntityDecl = 0x821f660 <xmlSAX2UnparsedEntityDecl>, setDocumentLocator = 0x821f7b0 <xmlSAX2SetDocumentLocator>, startDocument = 0x821f7c0 <xmlSAX2StartDocument>, endDocument = 0x821f8c0 <xmlSAX2EndDocument>, startElement = 0x821ffb0 <xmlSAX2StartElement>, endElement = 0x8220710 <xmlSAX2EndElement>, reference = 0x8221120 <xmlSAX2Reference>, characters = 0x82211a0 <xmlSAX2Characters>, ignorableWhitespace = 0x82211a0 <xmlSAX2Characters>, processingInstruction = 0x82213f0 <xmlSAX2ProcessingInstruction>, comment = 0x8221520 <xmlSAX2Comment>, warning = 0x81d4340 <xmlParserWarning>, error = 0x81d41b0 <xmlParserError>, fatalError = 0x81d41b0 <xmlParserError>, getParameterEntity = 0x821f200 <xmlSAX2GetParameterEntity>, cdataBlock = 0x8221650 <xmlSAX2CDataBlock>, externalSubset = 0x821eda0 <xmlSAX2ExternalSubset>, initialized = 1}, docbDefaultSAXHandler = {internalSubset = 0, isStandalone = 0, hasInternalSubset = 0, hasExternalSubset = 0, resolveEntity = 0, getEntity = 0, entityDecl = 0, notationDecl = 0, attributeDecl = 0, elementDecl = 0, unparsedEntityDecl = 0, setDocumentLocator = 0, startDocument = 0, endDocument = 0, startElement = 0, endElement = 0, reference = 0, characters = 0, ignorableWhitespace = 0, processingInstruction = 0, comment = 0, warning = 0, error = 0, fatalError = 0, getParameterEntity = 0, cdataBlock = 0, externalSubset = 0, initialized = 0}, htmlDefaultSAXHandler = { internalSubset = 0, isStandalone = 0, hasInternalSubset = 0, hasExternalSubset = 0, resolveEntity = 0, getEntity = 0, entityDecl = 0, notationDecl = 0, attributeDecl = 0, elementDecl = 0, unparsedEntityDecl = 0, setDocumentLocator = 0, startDocument = 0, endDocument = 0, startElement = 0, endElement = 0, reference = 0, characters = 0, ignorableWhitespace = 0, processingInstruction = 0, comment = 0, warning = 0, error = 0, fatalError = 0, getParameterEntity = 0, cdataBlock = 0, externalSubset = 0, initialized = 0}, xmlFree = 0x8091420 <free>, xmlMalloc = 0x8090ca0 <malloc>, xmlMemStrdup = 0x821cb70 <xmlStrdup>, xmlRealloc = 0x8091130 <realloc>, xmlGenericError = 0x81b334e <xml_generic_error_handler>, xmlStructuredError = 0x81b33ce <xml_structured_error_handler>, xmlGenericErrorContext = 0xb7b94584, oldXMLWDcompatibility = 0, xmlBufferAllocScheme = XML_BUFFER_ALLOC_EXACT, xmlDefaultBufferSize = 4096, xmlSubstituteEntitiesDefaultValue = 0, xmlDoValidityCheckingDefaultValue = 0, xmlGetWarningsDefaultValue = 1, xmlKeepBlanksDefaultValue = 1, xmlLineNumbersDefaultValue = 0, xmlLoadExtDtdDefaultValue = 0, xmlParserDebugEntities = 0, xmlPedanticParserDefaultValue = 0, xmlSaveNoEmptyTags = 0, xmlIndentTreeOutput = 1, xmlTreeIndentString = 0x849a790 " ", xmlRegisterNodeDefaultValue = 0, xmlDeregisterNodeDefaultValue = 0, xmlMallocAtomic = 0x8090ca0 <malloc>, xmlLastError = {domain = 0, code = 0, message = 0x0, level = XML_ERR_NONE, file = 0x0, line = 0, str1 = 0x0, str2 = 0x0, str3 = 0x0, int1 = 0, int2 = 0, ctxt = 0x0, node = 0x0}, xmlParserInputBufferCreateFilenameValue = 0, xmlOutputBufferCreateFilenameValue = 0} To compare that with the static version of htmlDefaultSAXHandler: (gdb) print htmlDefaultSAXHandler $3 = {internalSubset = 0x821ecf0 <xmlSAX2InternalSubset>, isStandalone = 0, hasInternalSubset = 0, hasExternalSubset = 0, resolveEntity = 0, getEntity = 0x821f040 <xmlSAX2GetEntity>, entityDecl = 0, notationDecl = 0, attributeDecl = 0, elementDecl = 0, unparsedEntityDecl = 0, setDocumentLocator = 0x821f7b0 <xmlSAX2SetDocumentLocator>, startDocument = 0x821f7c0 <xmlSAX2StartDocument>, endDocument = 0x821f8c0 <xmlSAX2EndDocument>, startElement = 0x821ffb0 <xmlSAX2StartElement>, endElement = 0x8220710 <xmlSAX2EndElement>, reference = 0, characters = 0x82211a0 <xmlSAX2Characters>, ignorableWhitespace = 0x82213e0 <xmlSAX2IgnorableWhitespace>, processingInstruction = 0x82213f0 <xmlSAX2ProcessingInstruction>, comment = 0x8221520 <xmlSAX2Comment>, warning = 0x81d4340 <xmlParserWarning>, error = 0x81d41b0 <xmlParserError>, fatalError = 0x81d41b0 <xmlParserError>, getParameterEntity = 0, cdataBlock = 0x8221650 <xmlSAX2CDataBlock>, externalSubset = 0, initialized = 1} So after the call to memcpy, *ctxt->sax has been mostly cleared (as much of it as is defined by xmlSAXHandlerV1 anyway): (gdb) print *ctxt->sax $11 = {internalSubset = 0, isStandalone = 0, hasInternalSubset = 0, hasExternalSubset = 0, resolveEntity = 0, getEntity = 0, entityDecl = 0, notationDecl = 0, attributeDecl = 0, elementDecl = 0, unparsedEntityDecl = 0, setDocumentLocator = 0, startDocument = 0, endDocument = 0, startElement = 0, endElement = 0, reference = 0, characters = 0, ignorableWhitespace = 0, processingInstruction = 0, comment = 0, warning = 0, error = 0, fatalError = 0, getParameterEntity = 0, cdataBlock = 0, externalSubset = 0, initialized = 0, _private = 0x0, startElementNs = 0x82209c0 <xmlSAX2StartElementNs>, endElementNs = 0x82210b0 <xmlSAX2EndElementNs>, serror = 0} With the change of the #ifdef for inithtmlDefaultSAXHandler in revision 1.46 of globals.c, this area of the per-thread information is copied from the static version, and the SAX pointers are present. The effect is that the buffer fails to be parsed, due to lack of SAX callbacks.
I tried to look at this, I'm a bit lost: First you can't remove the LEGACY ifdef because if not compiled with legacy, the inithtmlDefaultSAXHandler just doesn't exist ! My understanding is that: - your program is multithreaded (xmllint isn't so xmllint --html doesn't show the problem) - your program doesn't call xmlInitParser() (it should see the page about thread support) - xmlInitParser() calls htmlDefaultSAXHandlerInit () which calls xmlSAX2InitHtmlDefaultSAXHandler() which sets up the default handler. But honnestly without a test case reproducing the problem I could not really understand and fix it. Note that *all* the default SAXv1 handlers are not setup in xmlInitializeGlobalState() if LIBXML_LEGACY_ENABLED is not defined. It's not specific to the HTML parser ... Daniel
Closing this bug report as no further information has been provided. Please feel free to reopen this bug if you can provide the information asked for. Thanks!