After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 606592 - non 2-letter language codes being rejected
non 2-letter language codes being rejected
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2010-01-11 01:48 UTC by stuart yeates
Modified: 2010-11-04 14:41 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch for bug 606592 (3.53 KB, patch)
2010-06-15 16:06 UTC, Jimmy O'Regan
none Details | Review
Simple program to test xml:lang (861 bytes, text/x-csrc)
2010-06-15 16:12 UTC, Jimmy O'Regan
  Details

Description stuart yeates 2010-01-11 01:48:41 UTC
xmllint is only accepting 2-letter language codes and not 3-letter language codes.

TeRSirG.xml.p5.formatted:84: element language: Schemas validity error :
Element '{http://www.tei-c.org/ns/1.0}language', attribute 'ident':
'mao' is not a valid value of the atomic type 'xs:language'.

Bid001Kahu.xml.p5.formatted:107: element language: Schemas validity
error : Element '{http://www.tei-c.org/ns/1.0}language', attribute
'ident': 'rap' is not a valid value of the atomic type 'xs:language'.

Bid001Kahu.xml.p5.formatted:972: element foreign: Schemas validity error
: Element '{http://www.tei-c.org/ns/1.0}foreign', attribute
'{http://www.w3.org/XML/1998/namespace}lang': 'rap' is not a valid value
of the atomic type 'xs:language'.

Both 'moa' and 'rap' are present in /usr/share/xml/iso-codes/iso_639_3.xml and /usr/share/xml/iso-codes/iso_639.xml.

I understand that 'moa' is deprecated in favour of 'mi' by recent RFC's but I believe it's still valid for older XML files. 'rap' does not have a two-letter language code (it's the indigenous language of Easter Island).

xs:Name and locally defined schema types seem to be working as expected.

I'm not sure of the version I'm using, but it reports:

$ xmllint -version
xmllint: using libxml version 20632
    compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1
FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv
ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug
Zlib

This was discussed at http://www.mail-archive.com/xslt@gnome.org/msg01323.html but nothing appears to have happened that I can see.

cheers
stuart
Comment 1 stuart yeates 2010-01-11 02:26:53 UTC
A live file demonstrating the issue can be found at:

http://www.nzetc.org/tei-source/Bid001Kahu.xml 

For an HTML rendering of that see:

http://www.nzetc.org/tm/scholarly/tei-Bid001Kahu.html

cheers
stuart
Comment 2 Daniel Veillard 2010-01-11 13:05:43 UTC
http://www.w3.org/TR/xmlschema-2/#language

references RFC 3066 for the value space.
The implementation in of the XSD Datatype in libxml2
used the routine  xmlCheckLanguageID() using the old productions
from XML-1.0:

 * [33] LanguageID ::= Langcode ('-' Subcode)*
 * [34] Langcode ::= ISO639Code |  IanaCode |  UserCode
 * [35] ISO639Code ::= ([a-z] | [A-Z]) ([a-z] | [A-Z])
 * [36] IanaCode ::= ('i' | 'I') '-' ([a-z] | [A-Z])+
 * [37] UserCode ::= ('x' | 'X') '-' ([a-z] | [A-Z])+
 * [38] Subcode ::= ([a-z] | [A-Z])+

which doesn't allow the 2 values you pointed out.
There is apparently a mismatch between the two but I don't
think I will have much time to upgrade this to the far more
flexible syntax allowed in 

http://www.ietf.org/rfc/rfc3066.txt

but patches welcome on-list,

Daniel
Comment 3 Piotr Banski 2010-02-28 16:04:14 UTC
FWIW there is a newer rfc/bcp document at http://www.ietf.org/rfc/bcp/bcp47.txt , explicitly allowing ISO-639-3 (I think the one you mention only allowed 3-letter-codes from ISO-639-2).

Note: the current (5th) edition of the XML Rec says:

"The values of the [xml:lang] attribute are language identifiers as defined by [IETF BCP 47], Tags for the Identification of Languages; in addition, the empty string may be specified." ( http://www.w3.org/TR/xml/#sec-lang-tag )

In a multi-lingual environment, this is in fact a blocker (it makes you resign from using xml:lang, which is sometimes undoable in e.g. database or corpus settings, or invent phony codes, or give up xmllint). 

Currently, I can't validate Northern Sotho, Tonga and Tok Pisin documents (two them are among the official languages of South Africa and Papua New Guinea, one is a widespread lingua franca). This list can only grow (2-letter codes will never be introduced for such languages, per the ISO 639 Registration Authority Joint Advisory Committee's declaration, cf. the bcp doc).

So I'm wondering if the priority of this bug shouldn't go up. (I realise that it's now primarily the question of getting a proper patch in, but with a higher priority this is more likely; will try to get some attention to this issue elsewhere in the meantime).
Comment 4 Jimmy O'Regan 2010-06-15 16:06:13 UTC
Created attachment 163691 [details] [review]
Patch for bug 606592
Comment 5 Jimmy O'Regan 2010-06-15 16:12:20 UTC
Created attachment 163692 [details]
Simple program to test xml:lang
Comment 6 Daniel Veillard 2010-11-04 14:28:16 UTC
Thanks for the patch, but I actually implemented RFC 5646 instead which is
the current successor to RFC 1766 and RFC 4646 (that XML REC references now)
It should parse most interoperable languages tags now, it makes for a more
complex patch but I think it's closer to what people now expect.

 thanks !

Daniel
Comment 7 Piotr Banski 2010-11-04 14:41:01 UTC
Wonderful news! :-) I'll be sure to check it out as soon as I can.

Thanks a bunch!