GNOME Bugzilla – Bug 606592
non 2-letter language codes being rejected
Last modified: 2010-11-04 14:41:01 UTC
xmllint is only accepting 2-letter language codes and not 3-letter language codes. TeRSirG.xml.p5.formatted:84: element language: Schemas validity error : Element '{http://www.tei-c.org/ns/1.0}language', attribute 'ident': 'mao' is not a valid value of the atomic type 'xs:language'. Bid001Kahu.xml.p5.formatted:107: element language: Schemas validity error : Element '{http://www.tei-c.org/ns/1.0}language', attribute 'ident': 'rap' is not a valid value of the atomic type 'xs:language'. Bid001Kahu.xml.p5.formatted:972: element foreign: Schemas validity error : Element '{http://www.tei-c.org/ns/1.0}foreign', attribute '{http://www.w3.org/XML/1998/namespace}lang': 'rap' is not a valid value of the atomic type 'xs:language'. Both 'moa' and 'rap' are present in /usr/share/xml/iso-codes/iso_639_3.xml and /usr/share/xml/iso-codes/iso_639.xml. I understand that 'moa' is deprecated in favour of 'mi' by recent RFC's but I believe it's still valid for older XML files. 'rap' does not have a two-letter language code (it's the indigenous language of Easter Island). xs:Name and locally defined schema types seem to be working as expected. I'm not sure of the version I'm using, but it reports: $ xmllint -version xmllint: using libxml version 20632 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib This was discussed at http://www.mail-archive.com/xslt@gnome.org/msg01323.html but nothing appears to have happened that I can see. cheers stuart
A live file demonstrating the issue can be found at: http://www.nzetc.org/tei-source/Bid001Kahu.xml For an HTML rendering of that see: http://www.nzetc.org/tm/scholarly/tei-Bid001Kahu.html cheers stuart
http://www.w3.org/TR/xmlschema-2/#language references RFC 3066 for the value space. The implementation in of the XSD Datatype in libxml2 used the routine xmlCheckLanguageID() using the old productions from XML-1.0: * [33] LanguageID ::= Langcode ('-' Subcode)* * [34] Langcode ::= ISO639Code | IanaCode | UserCode * [35] ISO639Code ::= ([a-z] | [A-Z]) ([a-z] | [A-Z]) * [36] IanaCode ::= ('i' | 'I') '-' ([a-z] | [A-Z])+ * [37] UserCode ::= ('x' | 'X') '-' ([a-z] | [A-Z])+ * [38] Subcode ::= ([a-z] | [A-Z])+ which doesn't allow the 2 values you pointed out. There is apparently a mismatch between the two but I don't think I will have much time to upgrade this to the far more flexible syntax allowed in http://www.ietf.org/rfc/rfc3066.txt but patches welcome on-list, Daniel
FWIW there is a newer rfc/bcp document at http://www.ietf.org/rfc/bcp/bcp47.txt , explicitly allowing ISO-639-3 (I think the one you mention only allowed 3-letter-codes from ISO-639-2). Note: the current (5th) edition of the XML Rec says: "The values of the [xml:lang] attribute are language identifiers as defined by [IETF BCP 47], Tags for the Identification of Languages; in addition, the empty string may be specified." ( http://www.w3.org/TR/xml/#sec-lang-tag ) In a multi-lingual environment, this is in fact a blocker (it makes you resign from using xml:lang, which is sometimes undoable in e.g. database or corpus settings, or invent phony codes, or give up xmllint). Currently, I can't validate Northern Sotho, Tonga and Tok Pisin documents (two them are among the official languages of South Africa and Papua New Guinea, one is a widespread lingua franca). This list can only grow (2-letter codes will never be introduced for such languages, per the ISO 639 Registration Authority Joint Advisory Committee's declaration, cf. the bcp doc). So I'm wondering if the priority of this bug shouldn't go up. (I realise that it's now primarily the question of getting a proper patch in, but with a higher priority this is more likely; will try to get some attention to this issue elsewhere in the meantime).
Created attachment 163691 [details] [review] Patch for bug 606592
Created attachment 163692 [details] Simple program to test xml:lang
Thanks for the patch, but I actually implemented RFC 5646 instead which is the current successor to RFC 1766 and RFC 4646 (that XML REC references now) It should parse most interoperable languages tags now, it makes for a more complex patch but I think it's closer to what people now expect. thanks ! Daniel
Wonderful news! :-) I'll be sure to check it out as soon as I can. Thanks a bunch!