Bug 590183 – Fix bidi implementation with regards to Unicode 5.2.0 clarifications

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 590183 - Fix bidi implementation with regards to Unicode 5.2.0 clarifications


Summary:	Fix bidi implementation with regards to Unicode 5.2.0 clarifications


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:	585426

Reported:	2009-07-29 19:33 UTC by Behdad Esfahbod
Modified:	2012-08-25 20:31 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Behdad Esfahbod 2009-07-29 19:33:35 UTC

Specially, the shortcuts we use for not calling into FriBidi need to be updated.

Comment 1 Behdad Esfahbod 2012-08-21 18:47:35 UTC

Here's the details BTW:

https://bugzilla.mozilla.org/show_bug.cgi?id=762710#c30

Comment 2 Behdad Esfahbod 2012-08-21 18:48:58 UTC

Pasting for the record.

Here's my original report from 2009:

=============
The rule N1 from http://www.unicode.org/reports/tr9/#N1 reads:

"""
N1. A sequence of neutrals takes the direction of the surrounding strong text
if the text on both sides has the same direction. European and Arabic numbers
act as if they were R in terms of their influence on neutrals.
Start-of-level-run (sor) and end-of-level-run (eor) are used at level run
boundaries.

    R  N  R  → R  R  R

    L  N  L  → L  L  L

    R  N  AN → R  R  AN

    AN N  R  → AN R  R

    R  N  EN → R  R  EN

    EN N  R  → EN R  R

Note that any AN or EN remaining after W7 will be in an right-to-left context.
"""


Bug 1:

The text of the first paragraph says "European and Arabic numbers act as if
they were R in terms of their influence on neutrals."  It is not clear what
this means.  There are at least the following two possible interpretations:

  * The text is trying to loosely describe the logic behind the six rules that
follow and should not be taken literally.  In particular, the sequences "AN N
AN", "EN N EN", "AN N EN", and "EN N AN" are NOT processed as if AN and EN act
like an R.  This is most probably what the rule was meant to be.  The text
however is definitely wrong.  My colleague's testings suggest that this is
what OS X implements.

  * Before applying the 6 rules listed, temporarily convert any AN or EN type
to R, then proceed to apply the rules.  This reading is what I implemented in
FriBidi years ago.  I just checked and the Java reference implementation also
reads it like this.  I didn't check the code but I'm fairly sure that the C++
reference implementation does the same.  The problems with reading it like
this are numerous:

    - It conflicts with the 6 rules listed as there will be no EN and AN
anymore and the rules should be simplified to only:

      R N R → R R R
      L N L → L L L

    - The major problem with this approach however is that it can produce
strongly RTL characters in an otherwise LTR paragraph.  This is in consistent
with the following paragraph from Implementation Notes:

"""
One of the most effective optimizations is to first test for right-to-left
characters and not invoke the Bidirectional Algorithm unless they are present.
"""

Here is the test case:

  <U+0041,U+0661,U+002D,U+0662>

That's Latin capital letter A, Arabic digit 1, hyphen-minus, Arabic digit 2.
The original bidi types are <L,AN,ES,AN>, and they reach rule N1 as
<L,AN,N,AN>, at which point this reading of the rule N1 changes them to
<L,R,R,R> and things go south from there.




Bug 2:

The last line in rule N1 reads: "Note that any AN or EN remaining after W7
will be in an right-to-left context."  This is wrong as my example above
shows.  The "L,AN" sequence reaches N1 fine and it's NOT in a "right-to-left
context", whatever that means.  That sentence should plain be removed.
===================

Here's the draft that this got applied to the standard:
http://www.unicode.org/reports/tr9/tr9-20.html

Comment 3 Behdad Esfahbod 2012-08-21 18:50:42 UTC

Fixed in Pango master.