Bug 91542 – Make some characters neutral for shaper selection

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 91542 - Make some characters neutral for shaper selection


Summary:	Make some characters neutral for shaper selection


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	future
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:	112503 118302

Reported:	2002-08-23 19:00 UTC by Owen Taylor
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
pango-script.c (9.65 KB, text/plain) 2002-11-18 02:25 UTC, Owen Taylor	Details
pango-script.h (3.92 KB, text/plain) 2002-11-18 02:25 UTC, Owen Taylor	Details
testscript.c (7.01 KB, text/plain) 2002-11-18 02:27 UTC, Owen Taylor	Details
gen-script-table.pl (1.55 KB, text/plain) 2002-11-18 02:29 UTC, Owen Taylor	Details

Description Owen Taylor 2002-08-23 19:00:49 UTC

Support needs to be added for identifying characters as
"neutral" with respect to the choice of language engine. 
Currently, a block of, say Arabic text,
will be split into one-word runs of Arabic, with intervening
one-character runs for the Basic shaper for the space
character. This is, as might be imagined, a fairly major
performance problem.

Comment 1 Owen Taylor 2002-11-17 17:45:43 UTC

Some references:

http://mail.gnome.org/archives/gtk-i18n-list/2001-December/msg00013.html

http://mail.gnome.org/archives/gtk-i18n-list/2002-August/msg00062.html

http://www.unicode.org/unicode/reports/tr24/

Comment 2 Owen Taylor 2002-11-17 18:33:48 UTC

The code in ICU Eric was referring to is:

http://oss.software.ibm.com/cvs/icu/icu/source/extra/scrptrun/

Looks pretty simple given a function to compute UTR #24 script of
a given character.

Comment 3 Owen Taylor 2002-11-17 18:41:08 UTC

Source for script information is:

 http://www.unicode.org/Public/UNIDATA/Scripts.txt

Comment 4 Owen Taylor 2002-11-17 20:21:29 UTC

Appears that the ICU link above is a C++ prototype; there
is a C implementation:

 http://oss.software.ibm.com/cvs/icu/icu/source/common/usc_impl.c
 http://oss.software.ibm.com/cvs/icu/icu/source/common/usc_impl.h

That appears to be the current code.

Comment 5 Owen Taylor 2002-11-18 02:25:38 UTC

Created attachment 12359 [details]
pango-script.c

Comment 6 Owen Taylor 2002-11-18 02:25:49 UTC

Created attachment 12360 [details]
pango-script.h

Comment 7 Owen Taylor 2002-11-18 02:27:10 UTC

Created attachment 12361 [details]
testscript.c

Comment 8 Owen Taylor 2002-11-18 02:29:27 UTC

Created attachment 12362 [details]
gen-script-table.pl

Comment 9 Owen Taylor 2002-11-18 02:32:37 UTC

Attached port of the ICU algorithm to Pango, along with
code for looking up the script assignments.

(At least the script assignments should eventually go into
GLib, maybe the iterator too, so it probably makes sense to
protect this stuff with PANGO_ENABLE_ENGINE, to avoid
it being generally relied upon.)

Now just need to figure out how to hook it up to the engines.
I think it makes most sense to treat each engine as handling
some set of scripts, but the problem with this is that the
COMMON and INHERITED characters will result in engines getting
characters that they didnt' have to handle before, so all the 
engines will need to be audited in this regard.

Comment 10 Owen Taylor 2003-08-03 22:00:06 UTC

I've checked the script-range detection code into CVS now,
I'm still working on figureing out how to use it.

Comment 11 Owen Taylor 2003-09-23 23:58:14 UTC

OK, a complete rewrite of itemization is now in CVS. The algorithm
is more or less:

 - Correct language tags for rendering characters based on
   script information. (If Arabic script text is tagged as
   'en' change the language tag to 'ar')

 - Pick fonts for rendering characters based on corrected
   language and font description.

 - Pick fonts for non-rendering characters using the font
   for adjacent rendering characters.

Seems to work reasonably well, though I'm sure we'll discover
some additional problems that need to be fixed up.