GNOME Bugzilla – Bug 382437
Thai language engine improvements
Last modified: 2018-05-22 12:24:10 UTC
testboundaries fails because of the way the thai-lang module works. Namely, it adds line breaks where a line break is prohibited. The fix is indeed to just call into libthai for thai text. Thep believes that the other bits add some kind of context for libthai, but apparently the current code is broken. Going to fall back to my own idea of breaking the text into pieces of thai and non-thai chars and just call into libthai for the thai ones.
Changing my mind. I'll ship pango 1.15.1 with this problem, but this should be fixed before pango-1.14.9 can be released.
Hmm.. with libthai 0.1.8 in my box, and pango HEAD, "make check" says: ... Running test program "testboundaries", please wait: passed Running test program "testcolor", please wait: passed All tests passed. PASS: runtests.sh ================== All 1 tests passed ==================
Latest I can find is libthai 0.1.7. And that passes. So, should we require libthai >= 0.1.7 then? I still have this concern, that libthai is probably not implementing the Unicode line-breaking algorithm, so passing non-Thai chars to it is degrading the results.
Oops, sorry. It was libthai CVS snapshot, with post-release version bumped. Yes, you are right. The latest version is 0.1.7. And requiring libthai >= 0.1.7 should be fine. Regarding passing only Thai chars, what do you mean Thai chars, then? Some punctuations like space and period do have effects on analysis of word boundaries. Period on abbreviations, and space on a certain punctuation mark which is not allowed to be wrapped after space, for example. Should we still selectively pass punctuation marks, or simply the whole ASCII range, to libthai, then? I also have a plan to add more fine-grained API to libthai word break, which can return detailed properties, like word start, word end, line break, etc., so it can work better with pango logical attributes.
2006-12-06 Behdad Esfahbod <behdad@gnome.org> Bug 382437 – tests/testboundaries fails * configure.in: Require libthai >= 0.1.7
(In reply to comment #4) > Oops, sorry. It was libthai CVS snapshot, with post-release version bumped. > Yes, you are right. The latest version is 0.1.7. And requiring libthai >= 0.1.7 > should be fine. > > Regarding passing only Thai chars, what do you mean Thai chars, then? Some > punctuations like space and period do have effects on analysis of word > boundaries. Period on abbreviations, and space on a certain punctuation mark > which is not allowed to be wrapped after space, for example. Should we still > selectively pass punctuation marks, or simply the whole ASCII range, to > libthai, then? I'm mostly concerned about chars like « and » for example, or NBSP, or any other char that prohibits line break on one or both of its sides. These do not convert to the 8-bit Thai encoding and are doomed to get different breaking properties assigned to them. To a lesser extent the issues exists for chars like ( and ) too. I'm not sure that libthai knows how to correctly break around them (correct according to the Unicode spec), but those can be fixed as the ASCII range is at least representable in the Thai encoding. I understand about the context thing, but I don't want to degrade breakings for the common characters around Thai text.
Created attachment 89582 [details] test program (in utf8 encoding) Compile/run with: gcc -Wall `pkg-config --cflags pangocairo` ../test-pango-word-detection.c `pkg-config --libs pangocairo` && ./a.out
With pango 1.16.4 and libthai 0.1.8, I get cases where the start of a word is marked but not the corresponding end; and, furthermore, I get different word boundaries depending on preceding text. In the following, parens mark word starts and ends. The first example has a missing word end mark after the Thai phrase; the second example (a substring of the first example) correctly marks that word end but then marks ‘”,’ as a word and then marks the following space as beginning a word: consider the Thai phrase “ทำการบ้าน”, which has three components (consider) (the) (Thai) (phrase) “(ทำ)(การ)(บ้าน”, (which) (has) (three) (components) บ้าน”, which has three components (บ้าน)(”,)( (which) (has) (three) (components) I suggest that it is a bug to have two is_word_start items without an intervening is_word_end mark (where is_word_end flagged on the second is_word_start item counts as intervening). I attach above source code (using utf8 encoding) to generate the above output.
Incidentally, the text of the examples is an excerpt from http://www.unifont.org/textlayout/TheBigPicture.pdf, which claims that the correct splitting is (ทำ)(การบ้าน) (or at least that one should avoid a line break between the second and third components, i.e. not set is_line_break at the beginning of that third component, contrary to current behaviour with pango 1.16.4/libthai 0.1.8: modify the test program in the obvious way to show this). The author (Ed Trager) is not Thai, but cites the Thai page http://vuthi.blogspot.com/2004/07/cttex.html for this claim. N.B. OPENING THIS URL IN TWO GECKO-BASED BROWSERS CAUSED MY X SESSION TO CRASH. So I suggest you use a fresh X session to try viewing the page. w3m in gnome-terminal displayed it fine (at least without crashing, though I wouldn't know whether diacritics are correctly placed).
Correction: It was my window manager (fvwm) that crashed, perhaps not liking having Thai in the title bar.
(In reply to comment #8) > consider the Thai phrase “ทำการบ้าน”, which has three > components > (consider) (the) (Thai) (phrase) “(ทำ)(การ)(บ้าน”, > (which) (has) (three) (components) > > บ้าน”, which has three components > (บ้าน)(”,)( (which) (has) (three) (components) This case happens because libthai does not put line break after the last word. So, it relies on pango_default_break() to handle it. With a slightly modified program to use pango_get_log_attrs() directly, it returns: consider the Thai phrase “ทำการบ้าน”, which has three components (consider) (the) (Thai) (phrase) “(ทำ)(การ)(บ้าน)(”, (which) (has) (three) (components) บ้าน”, which has three components (บ้าน)(”,)( (which) (has) (three) (components) I have a overdue plan to revise the analysis procedure of libthai's word break library. Just busy with other jobs so far. I'll look at it soon. > I suggest that it is a bug to have two is_word_start items without an > intervening is_word_end mark (where is_word_end flagged on the second > is_word_start item counts as intervening). Right.
(In reply to comment #9) > Incidentally, the text of the examples is an excerpt from > http://www.unifont.org/textlayout/TheBigPicture.pdf, which claims that the > correct splitting is (ทำ)(การบ้าน) (or at least that one > should avoid a line break between the second and third components, i.e. not set > is_line_break at the beginning of that third component, contrary to current > behaviour with pango 1.16.4/libthai 0.1.8: modify the test program in the > obvious way to show this). The author (Ed Trager) is not Thai, but cites > the Thai page http://vuthi.blogspot.com/2004/07/cttex.html for this claim. This is the case of compound words, which libthai's data still lacks. I still keep adding more compound words to its dictionary.
(In reply to comment #11) > (In reply to comment #8) > > > consider the Thai phrase “ทำการบ้าน”, which has three > > components > > (consider) (the) (Thai) (phrase) “(ทำ)(การ)(บ้าน”, > > (which) (has) (three) (components) > > > > บ้าน”, which has three components > > (บ้าน)(”,)( (which) (has) (three) (components) > > This case happens because libthai does not put line break after the last word. > So, it relies on pango_default_break() to handle it. To speak more correctly, it's because libthai lacks API with detailed info on word begin/end. It just returns line break positions, from which pango thai-lang engine emulates the word begin/end by setting both attributes at every line break. This is also in my plan to extend libthai API for this.
How about using pango_default_break, and retain existing libthai API, and supplement pango_default_break's info by setting is_word_start,is_word_end at the word break points that libthai determines for points that are in the middle of a word according to pango_default_break, i.e. only for items that don't already have is_word_start or is_word_end set by pango_default_break: in particular, don't set is_word_end at the first (Thai) character after a space or punctuation mark or the like. I haven't thought about this proposal for long, it might need some changes. I'm assuming that we don't want ‘”,’ to be marked as a word in the above examples, i.e. I assume we want ...บ้าน)”, (which... as Pango does for most languages.
I'll need to adjust libthai word break routine to mark line break positions more correctly anyway. For example, it should not break line after ",", which causes the weird result "(บ้าน)(”,)( ..." in the last case. (This has been a planned redesign anyway.) Then, we need to adjust text chunks handling in thai-lang as Behdad already pointed out, so that line is not broken before right quotation mark (”), and so on. And your suggestion seems worth a try after that. We can check and ignore is_word_start and is_word_end setting if either was already set by pango_default_break(). By a dirty hack for testing purpose, this seems to work.
Created attachment 89641 [details] [review] use pango_default_break (for discussion) Given that a Thai run can include SCRIPT_COMMON characters, "in middle of word" isn't the same as "neither is_word_start nor is_word_end from pango_default_break". The attached patch uses the in-middle-of-word approach, and warns for other cases. This patch is for experimentation in conjunction with the test-pango-word-detection.c program. Changes needed for stable version: - Remove the "not sure what to do" warning. (Or refine the cases we're not sure about, and include part of the input string in the message.) - See the comment below g_assert. - More testing. From my limited experimentation just now, it appears that it already does the right thing [ignoring the compound word problem, which I gather is just a matter of adding to the compound dictionary]. E.g. it now gives: ... |(phrase) |“(ทำ)|(การ)|(บ้าน)”, |(which) ... (where ‘|’ indicates is_line_break), and (บ้าน)¡|¢|£|¤|¥¦§¨©(บ้าน) which look correct to me (or at least consistent with pango behaviour for other languages).
Created attachment 89642 [details] Program to show word/line breaking (assumes utf8 encoding) Updated version with a couple more example texts, and showing is_line_break positions.
> Given that a Thai run can include SCRIPT_COMMON characters, "in middle of word" > isn't the same as "neither is_word_start nor is_word_end from > pango_default_break". I mean to get rid of those cases of excessive line breaks by fixing libthai and the handling of encoding conversion. So, given that, "in middle of word" should be reduced into "neither is_word_start nor is_word_end from pango_default_break()". But I haven't decided to take the claim, anyway. As I would like to fix the above issues first. But it's a good thing to be defensive, anyway. About your patch: IIRC, you don't need to call pango_default_break() at the begining. It was already called before entering the language engine. For the case of "not sure what to do", it's likely to be libthai's bug, which, as a libthai maintainer, I mean to fix. OTOH, there will be another case outside words to handle: some certain Thai characters, e.g. 'ๆ', are often written with leading space, but it should not be wrapped to the next line. This is the opposite case to excessive line break after ',' above: pango should obey libthai for the absence of is_line_break for such case. (I need to fix libthai for that, too. Well, you've heard many "libthai fixes" from me so far. In fact, the current implementation of libthai's pre-itemization before the actual dictionary-based Thai word analysis was just briefly written and was planned to be thrown away. Only the actual analysis was written with design, after the old cttex-based code was totally replaced.)
(In reply to comment #16) > Created an attachment (id=89641) [edit] > use pango_default_break (for discussion) No need to call pango_default_break() there. Pango already has run pango_default_break() on the entire paragraph before calling the lang engine.
(In reply to comment #18) > OTOH, there will be another case outside words to handle: some certain Thai > characters, e.g. 'ๆ', are often written with leading space, but it should not > be wrapped to the next line. This is the opposite case to excessive line break > after ',' above: pango should obey libthai for the absence of is_line_break for > such case. In terms of Unicode TR#14, such characters, namely "ฯ" (U+0E2F - THAI CHARACTER PAIYANNOI) and "ๆ" (U+0E46 - THAI CHARACTER MAIYAMOK), should be in EX (Exclamation/Interrogation) class (the same class as "!", "?"). They are actually not exclamations nor question marks, but according to Unicode Line Breaking Algorithm, it's the closest class. Among the classes listed in LB13 that prohibit breaks before, they are not CL (Closing Punctuation) because they do not end sentences, nor are they NS (Nonstarter) because no break is allowed before them, even with leading space(s). And they are not used in numeric contexts, so they are not IS or SY, either. PAIYANNOI is Thai text omission mark. For example, instead of writing the full Thai name of Bangkok, which is claimed to be the longest city name in the world (~ 140 Thai-character-long, excluding intervening spaces), we just write "กรุงเทพฯ" in general documents. MAIYAMOK is Thai word repeating mark. It make the preceeding word to be pronounced twice. For example "แดง ๆ" is pronounced "Dang Dang" ("Dang" [red] pronounced twice to mean "reddish"). Line breaks are not allowed before both of them. Probably, a more appropriate place to fix this isssue is Unicode.org. But while it's still not fixed there, what is pango/glib strategy about this kind of patching?
Re pango_default_break call not needed: Heh, after posting the patch and going to bed, the two things I thought about doing next were to try removing the pango_default_break, and check that we get correct wordbreaking for an english word in the middle of Thai text. I did both tests next morning, just before checking my mail and seeing both your messages :) . But yes, English in the middle of Thai did work (for the one test I tried, namely my best translation of "do Pango homework" assuming same word order as in english). The patch I posted has a bug of not exiting the loop once bi reaches brk_n. It might be a few days until I post a revised patch, unless you mail me first. We should think about what we intend to do about TR#29 (http://www.unicode.org/reports/tr29/), the revised recommendation for text boundaries (including word boundaries) other than line breaks (replacing §5.15 of Unicode 3 book). It talks in terms of boundaries rather than starts & ends, and it tends to put a boundary around every symbol and space, which might not be desirable for current users of pango is_word_{start,end}; though there are probably only a small number of users. We might consider adding a is_word_boundary bit: having two forms of "word" might make it easier for each of them to cater best to different use cases (see opening para of §4 for a list, plus dictionary lookup for spelling/hyphenation). (Incidentally, as a user, I'm looking forward to dropping the word boundaries between ascii letters and numerals.)
My update: I've finished the redesign of itemization code of libthai word break. Initial change has been committed to its CVS. It's now more compatible to TR/UAX #14. The case of MAIYAMOK and PAIYANNOI is also addressed. Just testing and fine-tuning before releasing the new version. Still pondering whether to add the new "word boundary" (in addition to existing "word break") API in this version, though.
(In reply to comment #22) > Still pondering whether to add the new "word boundary" (in addition to existing > "word break") API in this version, though. I mean, in addition to existing "line break" API.
LibThai 0.1.9 has been released. [1] Please check it out. To not delay it further by my illness, I have postponed the API adjustment plan for now. In this release, only UAX#14 and compound words issues are done. [1] http://linux.thai.net/node/73
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/64.