GNOME Bugzilla – Bug 91542
Make some characters neutral for shaper selection
Last modified: 2004-12-22 21:47:04 UTC
Support needs to be added for identifying characters as "neutral" with respect to the choice of language engine. Currently, a block of, say Arabic text, will be split into one-word runs of Arabic, with intervening one-character runs for the Basic shaper for the space character. This is, as might be imagined, a fairly major performance problem.
Some references: http://mail.gnome.org/archives/gtk-i18n-list/2001-December/msg00013.html http://mail.gnome.org/archives/gtk-i18n-list/2002-August/msg00062.html http://www.unicode.org/unicode/reports/tr24/
The code in ICU Eric was referring to is: http://oss.software.ibm.com/cvs/icu/icu/source/extra/scrptrun/ Looks pretty simple given a function to compute UTR #24 script of a given character.
Source for script information is: http://www.unicode.org/Public/UNIDATA/Scripts.txt
Appears that the ICU link above is a C++ prototype; there is a C implementation: http://oss.software.ibm.com/cvs/icu/icu/source/common/usc_impl.c http://oss.software.ibm.com/cvs/icu/icu/source/common/usc_impl.h That appears to be the current code.
Created attachment 12359 [details] pango-script.c
Created attachment 12360 [details] pango-script.h
Created attachment 12361 [details] testscript.c
Created attachment 12362 [details] gen-script-table.pl
Attached port of the ICU algorithm to Pango, along with code for looking up the script assignments. (At least the script assignments should eventually go into GLib, maybe the iterator too, so it probably makes sense to protect this stuff with PANGO_ENABLE_ENGINE, to avoid it being generally relied upon.) Now just need to figure out how to hook it up to the engines. I think it makes most sense to treat each engine as handling some set of scripts, but the problem with this is that the COMMON and INHERITED characters will result in engines getting characters that they didnt' have to handle before, so all the engines will need to be audited in this regard.
I've checked the script-range detection code into CVS now, I'm still working on figureing out how to use it.
OK, a complete rewrite of itemization is now in CVS. The algorithm is more or less: - Correct language tags for rendering characters based on script information. (If Arabic script text is tagged as 'en' change the language tag to 'ar') - Pick fonts for rendering characters based on corrected language and font description. - Pick fonts for non-rendering characters using the font for adjacent rendering characters. Seems to work reasonably well, though I'm sure we'll discover some additional problems that need to be fixed up.