GNOME Bugzilla – Bug 97545
Make pango_default_break follow Unicode TR #29
Last modified: 2017-08-31 22:50:13 UTC
gtk_text_iter_starts_word, gtk_text_iter_forward_word_end, etc. get confused when dealing with words that have apostrophes: "don't" is recognized as the (non-)word "don" followed by "t". It looks like this is a Pango bug in how words are broken. I'm interested in fixing this myself, but I'm not sure where to begin. Are apostrophes English-specific enough to require an English-specific word-breaking module? Will the potential use of single quotes confuse apostrophe code? (Also, I feel like I've submitted this before. Sorry if this is a repeat bug, but I can't remember any resolution if there was one. This is an important bug for spell-checking purposes, because I use an internationalized dictionary library and have so far managed to not rely on any particular language features by letting Pango handle word breaking.)
Created attachment 12007 [details] gtk2 test program exhibiting behavior -- jam on the "forward" and "backward" buttons to see where the words are breaking
Not necessarily a practical answer, but the problem here is that U+0027 is an ambiguous character doing triple duty as: U+02BC modifier letter apostrophe (what you want) U+2018 left single quotation mark U+2019 right single quotation mark Youʼll note that U+0027 isn't even the right glyph for an apostrophe, though weʼve gotten pretty used to it. You'll find that if you use U+02BC, it wonʼt cause a word break; if you wanted to treat U+0027 as an U+02BC for break determination, I think you'd basically need a dictionary and an English specific shape engine. I'm not sure of the range of use of the apostrophe. None of the languages I'm familiar with other than English use it much.
Hmm, actually looking at Unicode TR 29 - http://www.unicode.org/reports/tr29/tr29-1.html it's just a bug in the Pango break implementation. U+0027 should have the class "MidLetter", though the report does mention that some language-based tailoring may be useful to make, for instance, the french l'objectif two words. [ Not trying to get fancy here after seeing what bugzilla+mozilla did to my last comment ]
TR29 seems to be an update of section 5.15, the current code is based on 5.15 only (TR29 came out a few months after writing break.c). We probably need to do a comprehensive rework of the code in light of TR29, though perhaps this should wait until TR29 leaves draft status.
Retitling to dup various associated bugs here.
*** Bug 61726 has been marked as a duplicate of this bug. ***
*** Bug 97861 has been marked as a duplicate of this bug. ***
*** Bug 63398 has been marked as a duplicate of this bug. ***
> We probably need to do a comprehensive rework of the code in light of > TR29, though perhaps this should wait until TR29 leaves draft status. This has happened, it's now known as UAX #29. http://www.unicode.org/reports/tr29/
My primary emotion here is "fear" Though the new TR seems to have test data, so we could do a test suite based on that which would help.
Created attachment 19215 [details] [review] new test program using tr29 test data
I'm not sure which of is_word_start or is_word_end corresponds to a word boundary in TR29, so I tentatively made them both count. (Unicode 3.0 5.15 also talks about word boundaries, not start and end, so I'm sure we don't have to add a new field or anything.)
*** Bug 57375 has been marked as a duplicate of this bug. ***
Making them both count looks right when we look at the examples in section 4 of TR #29. Though there may also be boundaries in Unicode that won't be Pango boundaries at all. I think a "word" in Pango is "a word that contains a letter" in TR #29.
*** Bug 138180 has been marked as a duplicate of this bug. ***
*** Bug 144670 has been marked as a duplicate of this bug. ***
Created attachment 30280 [details] [review] Patch against HEAD Moving discussion from bug 118347 to here. TR29 removes all Format characters (General Category = Cf) before determining word boundaries. One-line patch attached. I agree that a rewrite of break.c is much needed.
Should I apply the patch in comment #17?
Owen, now that you are applying patches, would you please apply the minimal patch in comment #17.
Thu Dec 2 15:31:33 2004 Owen Taylor <otaylor@redhat.com> * pango/break.c (pango_default_break): Ignore formatting characters when determining word boundaries (Part of #97545, Behdad Esfahbod)
A related bug: Bug 313907: Update break.c to handle new line-breaking types in Unicode 4.1
I'm actually working on this.
Behdad: any news on this bug?
I have a local tree that almost does this. Pretty high priority in the next devel cycle.
Automatic ping service ;) Behdad: any news on this bug? This will be very important if we are going to add spell checking support to gtk+ stack (see bug #131576 on why solving this bug is important).
I'm not sure I understood how this patch would work. With this change "don't" and other strings containing a "'" will be considered single words, is it right? Paolo, if I am right, it could have negative effects on spell checking for some languages, such as Italian or French. In English "don't" is a single word but in Italian something like "un'arancia" ("an orange") is two words. AFAIK the dictionaries for this languages do not contain the strings formed by an article, an apostrophe and a word, adding them to the dictionaries would mean adding every combination of an article and a word starting with a vowel.
(In reply to comment #26) > I'm not sure I understood how this patch would work. With this change "don't" > and other strings containing a "'" will be considered single words, is it > right? > > Paolo, if I am right, it could have negative effects on spell checking for some > languages, such as Italian or French. In English "don't" is a single word but > in Italian something like "un'arancia" ("an orange") is two words. AFAIK the > dictionaries for this languages do not contain the strings formed by an > article, an apostrophe and a word, adding them to the dictionaries would mean > adding every combination of an article and a word starting with a vowel. Now that is a very good reason to write a Latin language engine for Pango. This cycle we've added three lang engines (Arabic, Indic, Thai), and Latin can be next. In a language engine you can taylor all logical attributes based on language.
This blocks a problem labelled "High" priority which was scheduled for GNOME 2.6 and has still not happened although it continues to attract negative commentary. So I'm raising this to High, and let's see if it gets any further attention. "Couldn't" is just one word in English and it shouldn't be impossible to get that right in software as complicated as Pango.
Some may have noticed that I'm working on this since last night. Grapheme boundaries are updated to UAX#29. Working on the rest. 2008-04-24 Behdad Esfahbod <behdad@gnome.org> Part of Bug 97545 – Make pango_default_break follow Unicode TR #29 Patch from Noah Levitt * tests/Makefile.am: * tests/runtests.sh.in: * tests/testboundaries_ucd.c (count_attrs), (parse_line), (attrs_equal), (make_test_string), (do_test), (main): Add test driver for UAX#14 and UAX#29 test data from Unicode Character Databse. Just drop the following four files in pango/tests for it to use them: GraphemeBreakTest.txt LineBreakTest.txt SentenceBreakTest.txt WordBreakTest.txt 2008-04-24 Behdad Esfahbod <behdad@gnome.org> Part of Bug 97545 – Make pango_default_break follow Unicode TR #29 * pango/break.c (pango_default_break): Make Grapheme Boundary code exactly follow UAX#29 of Unicode 5.1.0
Word Boundaries implemented too. Now need to adjust is_word_start/end to not cross word boundaries. 2008-04-24 Behdad Esfahbod <behdad@gnome.org> Part of Bug 97545 – Make pango_default_break follow Unicode TR #29 * docs/tmpl/main.sgml: * pango/break.c (pango_default_break): * pango/pango-break.h: * tests/testboundaries_ucd.c (main): Add new PangoLogAttr member is_word_boundary, that implements UAX#29's Word Boundaries semantics. Test fully passes for it.
*** Bug 539655 has been marked as a duplicate of this bug. ***
*** Bug 399829 has been marked as a duplicate of this bug. ***
Has there been any progress on this bug recently (i.e. in the last two years)? It is necessary to fix bug 131576. If progress has stopped, can it be re-assigned? Would anyone like to discuss the blocking issue?
*** Bug 633097 has been marked as a duplicate of this bug. ***
Created attachment 242980 [details] [review] Patch: Don't break words between adjacents numbers and letters If I read UAX #29, version 6.2 correctly, we are not supposed to break "123foo" into two words anymore. This patch fixes that. I didn't update the WordType enum because I didn't know what to do. should we have only two types now? can we do that without breaking anything?
That's probably fine. I think you should update documentation a bit too. I wouldn't mind if you push it to master.
Review of attachment 242980 [details] [review]: Committed as approved by Behdad
Hi, can a developer take care of dependent bug 700103? Currently pango-break marks wrongly word start/ends, for instance, the string "abc·cde", where · is U+00B7- Word boundaries: "abc·cde" is a single word, according word boundaries (marked here with |) "|abc·def|" That's fine following UAX29, :) Word starts: Pango marks 2 word starts: "|abc·|def". That's wrong. Word ends: Pango marks 2 word ends: "|abc·def|". That's wrong. Expected word starts: "|abc·def" Expected word ends: "abc·def|" This bug is really annoying, because word selection with mouse, cursor movement and spell-checking fails for Catalan. Thanks for your help.
Hi there, I have made a patch that makes word starts and ends consistent with Unicode UAX TR29. The algorithm is pretty simple: if it finds a letter or a number between a pair of word boundaries, then it takes these boundaries as the beginning and end of a word. With this, "should'nt" and "can't" appear as a single word. It also fixes bug #700103. Based on some tests, it looks like it works OK, at least as far as English is concerned. However it might have ruined word breaking for languages that I haven't tested. I also attach a program that breaks a text into words, which can be used for testing purposes. echo "Irregular forms: \"ain't\", \"don't\", \"won't\", \"shan't\". \"n't\" can only be attached to an auxiliary verb which is itself not contracted." | ./wordbreak > Irregular < > forms < > ain't < > don't < > won't < > shan't < > n't < > can < > only < > be < > attached < > to < > an < > auxiliary < > verb < > which < > is < > itself < > not < > contracted <
Created attachment 283412 [details] [review] Patch: calculate word start and end based on word boundaries
Created attachment 283413 [details] Test program to test word starts and ends
(In reply to comment #42) > Created an attachment (id=283412) [details] [review] > Patch: calculate word start and end based on word boundaries I'm no pango developer, but (sadly) I do have some little experience with the word-breaking rules from the Unicode standard. In the original code in the repo the logic is looking for a "word end" to see where to break; which is totally opposite to what Unicode suggests, which is to break a word in every position "except for when you don't have to break". The rules that specify where there isn't a wordbreak are not many (~13 IIRC), so the logic should try to apply those rules, and if no such rule applies then consider the position as a word break. This is what e.g. libunistring or libicu do, and this was actually one of the reasons that kind of forced us in Tracker to switch to a non-Pango based word breaker. The unicode rules will make the word-breaking work for any kind of script, including when the text mixes them (e.g. Latin mixed with Katakana). E.g. a raw implementation of the WB algorithm I wrote a while ago: http://bazaar.launchpad.net/~gnu-pdf-team/gnupdf/trunk/view/head:/src/base/pdf-text-ucd-wordbreak.c#L1017 The new approach in the suggested patch actually tries to follow some of the rules in the standard, but not all of them. It's probably fixing some cases, but likely breaking others (e.g. non-latin scripts). The following link shows some of the unit tests we have in Tracker that actually check the word breaking algorithm used (libunistring or libicu based, but with some additional random rule we added...): https://git.gnome.org/browse/tracker/tree/tests/libtracker-fts/tracker-parser-test.c#n303
Thanks for the background information, thats useful. I've recently started to add tests to pango, see tests/markup-parse.c and tests/test-layout.c. Having a test in the same style for word boundaries and breakpoints would be awesome to get this moving. There's an older test for boundaries as well, maybe that can serve as a starting point.
(In reply to comment #45) > Thanks for the background information, thats useful. > Should probably have read a bit more the code before commenting, actually... The current logic in Pango does try to follow some of the WB rules from the Unicode TR, so it's likely that it just needs a review to check why not every rule is being applied properly (I guess). The last suggested patch doesn't help, it actually removes some of the mandatory logic in the WB algorithm. > I've recently started to add tests to pango, see tests/markup-parse.c and > tests/test-layout.c. Having a test in the same style for word boundaries and > breakpoints would be awesome to get this moving. There's an older test for > boundaries as well, maybe that can serve as a starting point. That's actually nice to have. I'd suggest to use libunistring or libicu outputs to compare with the WB rules in Pango.
(In reply to comment #46) > The last suggested patch doesn't help, it actually removes some of the > mandatory logic in the WB algorithm. The patch I posted in comment #42 sets the word_start and word_end attributes based on the word boundaries computations, which in theory follow the UAX #29 guidelines. The current situation is that these attributes are set according to different rules and are inconsistent with word boundaries (see comment #40). UAX #29 suggests the following method for determining what is a word once you have a bunch of word boundaries: "Proximity tests in searching determines whether, for example, “quick” is within three words of “fox”. That is done with the above boundaries by ignoring any words that do not contain a letter, as in Figure 2. Thus, for proximity, “fox” is within three words of “quick”. This same technique can be used for “get next/previous word” commands or keyboard arrow keys. Letters are not the only characters that can be used to determine the “significant” words; different implementations may include other types of characters such as digits or perform other analysis of the characters." This is roughly what my patch does. If there is a letter or a digit within a pair of word boundaries, it is a word and we set the word_start and word_end attributes, otherwise it isn't. What other approach do you suggest for setting these attributes?
Created attachment 288560 [details] [review] Draft patch from 2008 Attaching my patch from 2008. Not sure if it's of any use. I don't think I'll get to finish it any time soon.
*** Bug 751125 has been marked as a duplicate of this bug. ***
on a travaillé sur ce bug on est arrivé au point où on a constaté que le bug est dû à l'implémentation de la fonction pango_default_break qui se trouve dans le fichier break.c qui ne gére pas le single quote et le considére comme une séparateur d'un mot. pourriez-vous nous éclairer sur ce problème???.
I think this is for the most part done now, in other bugs. Closing as keeping open doesn't help anything.