GNOME Bugzilla – Bug 590183
Fix bidi implementation with regards to Unicode 5.2.0 clarifications
Last modified: 2012-08-25 20:31:04 UTC
Specially, the shortcuts we use for not calling into FriBidi need to be updated.
Here's the details BTW: https://bugzilla.mozilla.org/show_bug.cgi?id=762710#c30
Pasting for the record. Here's my original report from 2009: ============= The rule N1 from http://www.unicode.org/reports/tr9/#N1 reads: """ N1. A sequence of neutrals takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers act as if they were R in terms of their influence on neutrals. Start-of-level-run (sor) and end-of-level-run (eor) are used at level run boundaries. R N R → R R R L N L → L L L R N AN → R R AN AN N R → AN R R R N EN → R R EN EN N R → EN R R Note that any AN or EN remaining after W7 will be in an right-to-left context. """ Bug 1: The text of the first paragraph says "European and Arabic numbers act as if they were R in terms of their influence on neutrals." It is not clear what this means. There are at least the following two possible interpretations: * The text is trying to loosely describe the logic behind the six rules that follow and should not be taken literally. In particular, the sequences "AN N AN", "EN N EN", "AN N EN", and "EN N AN" are NOT processed as if AN and EN act like an R. This is most probably what the rule was meant to be. The text however is definitely wrong. My colleague's testings suggest that this is what OS X implements. * Before applying the 6 rules listed, temporarily convert any AN or EN type to R, then proceed to apply the rules. This reading is what I implemented in FriBidi years ago. I just checked and the Java reference implementation also reads it like this. I didn't check the code but I'm fairly sure that the C++ reference implementation does the same. The problems with reading it like this are numerous: - It conflicts with the 6 rules listed as there will be no EN and AN anymore and the rules should be simplified to only: R N R → R R R L N L → L L L - The major problem with this approach however is that it can produce strongly RTL characters in an otherwise LTR paragraph. This is in consistent with the following paragraph from Implementation Notes: """ One of the most effective optimizations is to first test for right-to-left characters and not invoke the Bidirectional Algorithm unless they are present. """ Here is the test case: <U+0041,U+0661,U+002D,U+0662> That's Latin capital letter A, Arabic digit 1, hyphen-minus, Arabic digit 2. The original bidi types are <L,AN,ES,AN>, and they reach rule N1 as <L,AN,N,AN>, at which point this reading of the rule N1 changes them to <L,R,R,R> and things go south from there. Bug 2: The last line in rule N1 reads: "Note that any AN or EN remaining after W7 will be in an right-to-left context." This is wrong as my example above shows. The "L,AN" sequence reaches N1 fine and it's NOT in a "right-to-left context", whatever that means. That sentence should plain be removed. =================== Here's the draft that this got applied to the standard: http://www.unicode.org/reports/tr9/tr9-20.html
Fixed in Pango master.