Bug 687301 – [PATCH] Tokenizer doesn't recognize some valid HTML attributes

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 687301 - [PATCH] Tokenizer doesn't recognize some valid HTML attributes


Summary:	[PATCH] Tokenizer doesn't recognize some valid HTML attributes


Status:	RESOLVED FIXED

Product:	doxygen
Classification:	Other
Component:	general
Version:	1.8.2-SVN
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Dimitri van Heesch
QA Contact:	Dimitri van Heesch

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-11-01 01:02 UTC by mason malone
Modified:	2012-12-26 16:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fix html attribute parsing (959 bytes, patch) 2012-11-01 01:02 UTC, mason malone	none	Details \| Review

Description mason malone 2012-11-01 01:02:20 UTC

Created attachment 227769 [details] [review]
Fix html attribute parsing

There are many valid characters that can appear in HTML attribute names that Doxygen doesn't allow, notably the hyphen. This means you can't use data attributes (which always take the form data-foo="bar") in HTML, and that can be pretty annoying since data attributes are frequently used to pass data to Javascript apps. 

The attached patch modifies doctokenizer.l to be more liberal in what characters are allowed in attribute names. The regular expression for HTMLATTID was derived from this: 
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#attributes-0
"Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode."

Comment 1 Dimitri van Heesch 2012-11-17 10:06:09 UTC

I don't mind adding the '-' but allowing even more characters will probably lead to cases were text will suddenly be parsed as an attribute.

Besides that, using arbitrary names for attributes is not part of the HTML standard. The 4.01 standard only lists these as valid for instance:
http://www.w3.org/TR/REC-html40/index/attributes.html

Comment 2 Dimitri van Heesch 2012-11-18 11:07:25 UTC

Changed version 'latest' to '1.8.2-SVN' so I can remove 'latest' as an option as it is a moving target.

Comment 3 Dimitri van Heesch 2012-12-26 16:09:10 UTC

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.8.3. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information
that you think can be relevant.