GNOME Bugzilla – Bug 313907
Update break.c to handle new line-breaking types in Unicode 4.1
Last modified: 2005-11-05 00:40:39 UTC
I'm attaching patch for break.c to handle the new (Conjoining Jamo handling) line-breaking types in Unicode 4.1. The logic is exactly the same in UAX#14 (Line Breaking) and UAX#29 (Text Boundaries), so I have used the same code for both, which is neat. I used the testing patch in bug #97545 (by Noah Levitt) to verify that the Jamo handling in grapheme clusters is still functioning correctly, and after fixing the bugs, it is. When testing with the mentioned test, I also changed the '\n' that was being added to the end of paragraphs to a PARAGRAPH_SEPARATOR, since '\n' plays tricks if preceded by '\r'. After these changes, it passes all tests in GraphemeClusterBreakTest.txt of Unicode 4.1. (I'm working on the rest too)
Created attachment 50962 [details] [review] mentioned patch This requires the Unicode 4.1 data in glib.
Ok, this can be applied now, after requiring glib 2.9. Awaiting review.
You should probably branch pango before applying this, otherwise you'll bump the glib requirement in the middle of a stable series, which should not happen. I won't claim to understand all the break algorithm changes in the patch, but it looks generally sane to me. One change which makes me wonder is the following one: @@ -520,7 +606,7 @@ /* This is how we fill in the last element (end position) of the * attr array - assume there's a newline off the end of @text. */ - next_wc = '\n'; + next_wc = PARAGRAPH_SEPARATOR; } else { Why is this ? It makes the preceding comment wrong, and it scares me a bit if the rest of pango makes the assumption that there is a newline at the end... I also noted that some comments in the patch refer to Unicode 4.2, you probably want to make sure that the documentation refers to the right versions (both in the comment, and also in the api docs).
Thanks Matthias. We already have Pango 1.10 branched. HEAD is 1.11 now. About that change, as I wrote originally: "When testing with the mentioned test, I also changed the '\n' that was being added to the end of paragraphs to a PARAGRAPH_SEPARATOR, since '\n' plays tricks if preceded by '\r'." The idea of adding '\n' is an internal implementation detail of pango_default_break, to force a line break opportunity at the end of string, but \n doesn't work if preceded by \r. PARAGRAPH_SEPARATOR does. I will check the docs, and apply. Thanks.
2005-11-04 Behdad Esfahbod <behdad@gnome.org> * pango/break.c: Update to handle new line-breaking types in the Unicode 4.1 UAX#14. (#313907) * configure.in: Bump required glib version to 2.9.0. Needed for above-mentioned line-breaking types.