GNOME Bugzilla – Bug 551489
API doc for atk_text_get_text_before/at/after_offset() are not consistent
Last modified: 2011-05-30 06:16:29 UTC
Please describe the problem: The API doc are at http://library.gnome.org/devel/atk/stable/AtkText.html The specifications of the APIs use some terms of "word start", "word end", "inside a word". I think we first need to clarify these terms (Sorry for my ignorance if the definitions are obvious). Take this string for example, "see a dog" // the string 0123456789 // the offset As my understanding, offset 0, 4, 6 are the "word start"s and 3, 5, 9 are the "word end"s. For offset 4, is it "inside a word" or "outside a word"? Let's take offset 4 as "inside a word" for now, so that the below specification for atk_text_get_text_at_offset() makes sense. " If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is from the word start at or before the offset to the word start after the offset. The returned string will contain the word at the offset if the offset is inside a word and will contain the word before the offset if the offset is not inside a word. " But that makes the specification of atk_text_get_text_before_offset() self-conflict. " If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is from the word start before the word start before the offset to the word start before the offset. The returned string will contain the word before the offset if the offset is inside a word and will contain the word before the word before the offset if the offset is not inside a word. " Either way, there are some places in the specifications self-conflict. Steps to reproduce: Actual results: Expected results: Does this happen every time? Other information:
We create automated test for Mozilla to test methods of text accessibles. So it's very important to get your opinion on Evan's issue. Any feedback? Thank you.
IA2 doesn't have WORD_START/END, just WORD. Using IBM Lotus Symphony as a data point, if you use "see a dog" and use IAText::textAt/After/BeforeOffset with boundry type WORD you get for offset 4 At: "a", 4, 5 After: "dog", 6, 9 Before: "see", 0, 3 for offset 5 At: "dog", 6, 9 <-- this is a bug, "" should be returned. After: "dog", 6, 9 Before: "a", 4, 5 BTW, IA2's IAText::textAt/After/BeforeOffset has some typos. This is the correct text and I will fix it in the IA2 IDL. textAtOffset The following sentence should be deleted: For example, if text type is IA2_TEXT_BOUNDARY_WORD, then the complete word that is closest to and located before offset is returned. textAfterOffset The word "before" should be changed to "after" in this sentence: For example, if text type is IA2_TEXT_BOUNDARY_WORD, then the complete word that is closest to and located before offset is returned.
I see there is another bug in the IA2 IDL comments. All three of IAText::textAt/After/BeforeOffset have this sentence: "If the index is valid, but no suitable word (or other text type) is found, an empty text segment is returned." That sentence should be removed. The return value information is correct: S_FALSE ...if there is nothing to return; [out] values are 0s and NULL respectively
BTW, The IA2 docs are at: http://www.linuxfoundation.org/~ptbrunet/ia2/docs/html/
IAccessible2 is simpler here (that's the god :)), ATK is more complicated. Since Gecko accessibility API is similar to ATK in these methods then first of all we need to clarify Evan's question (since he is reviewer of my patch of the https://bugzilla.mozilla.org/show_bug.cgi?id=452769). When we get this clarified then I will ensure we are correct with IA2 stuffs.
Currently I don't have time to look into this. Please refer to gail's code, there is implementation of text interface.
(In reply to comment #6) > Currently I don't have time to look into this. Please refer to gail's code, > there is implementation of text interface. > Li, is gail's code 100% tested or something about that because firefox also has text interface implementation but I'm not sure it's 100% correct?
In comment 2 above, when IA2::textAtOffset is requested for an offset on whitespace I indicated the return should be "". That is wrong. The return string should be a NULL pointer. The returned offsets should be 0 and the HRESULT should be S_FALSE. BTW, NVDA, which can't sense for an S_FALSE due to their Python infrastructure will sense for the NULL.
Add Brian to cc, for he may know more history of gail's text interface.
Here is mozilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=452769 where we add automated tests for text interface. Also it would be great if you could find a time to look at it to check if our assumptions about text interface methods are correct.
Evan, as you mentioned, there are "self-conflict"in the specification in these two sentences: >" >If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is >from the word start at or before the offset to the word start after the offset. >The returned string will contain the word at the offset if the offset is inside >a word and will contain the word before the offset if the offset is not inside >a word. >" >" >If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is >from the word start before the word start before the offset to the word start >before the offset. >The returned string will contain the word before the offset if the offset is >inside a word and will contain the word before the word before the offset if >the offset is not inside a word. >" And I read them word by word, I found a small problem in them. I will explain it by an example. If I made some mistake please figure out. Take the text "many kids here" for an example: 1. for "atk_text_get_text_at_offset()" 1.1 first half sentence: >doc said "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is >from the word start at or before the offset to the word start after the offset." for example, offset at 'k' in "kids". ("many kids here") Situation: "from the word start at the offset to the word start after the offset" "word start at the offset" is 'k' "word start after the offset" is 'h' Returned : "kids " for example, offset at 'd' in "kids". ("many kids here") Situation: "from the word start before the offset to the word start after the offset" "word start before the offset" is 'k' "word start after the offset" is 'h' Returned : "kids " for example, offset at '_'(blank) between "kids" and "here".("many kids here") Situation: "from the word start before the offset to the word start after the offset" "word start before the offset" is 'k' "word start after the offset" is 'h' Returned : "kids " 1.2 second half sentence: >doc said "The returned string will contain the word at the offset if the offset is inside >a word and will contain the word before the offset if the offset is not inside >a word. " for example, offset at 'k' in "kids",or offset at 'd' in "kids".("many kids here") Situation: "contain the word at the offset if the offset is inside a word" Returned : "kids " for example, offset at '_'(blank) between "kids" and "here".("many kids here") Situation: "contain the word before the offset if the offset is not inside a word" Returned : "kids " 2. for "atk_text_get_text_before_offset()" 2.1 first half sentence: >doc said "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is >from the word start before the word start before the offset to the word start >before the offset." for example, offset at 'd' in "kids". ("many kids here") Situation: "FROM the word start before the word start before the offset TO the word start before the offset" "the word start before the word start before the offset" is 'm' "the word start before the offset" is 'k' Returned : "many " for example, offset at '_'(blank) between "kids" and "here". ("many kids here") Situation: "FROM the word start before the word start before the offset TO the word start before the offset" "the word start before the word start before the offset" is 'm' "the word start before the offset" is 'k' Returned : "many " 2.2 second half sentence: >doc said "The returned string will contain the word before the offset if the offset is >inside a word and will contain the word before the word before the offset if >the offset is not inside a word. " for example, offset at 'k' in "kids". ("many kids here") Situation: "contain the word before the offset if the offset is inside a word" Returned : "many " for example, offset at 'd' in "kids". ("many kids here") Situation: "contain the word before the offset if the offset is inside a word" Returned : "many " for example, offset at '_'(blank) between "kids" and "here". Situation: "contain the word before the word before the offset if the offset is not inside a word." Returned : "many" 2.3 problems If there is a problems, I think is in 2.1 (first half sentence of atk_text_get_text_before_offset()). It didn't specify the offset at "word start". For example, offset at 'k' in "kids". According to first half sentence, returns the word before "many", but according to second half sentence, returns the word "many". So, in my opinion, the first half sentence should be "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is from the word start before the word start before or at the offset to the word start before or at the offset." (add "or at" like other two functions: "atk_text_get_text_at_offset ()"and "atk_text_get_text_after_offset ()") In a word, no problems in doc of atk_text_get_text_at_offset(), for atk_text_get_text_before_offset() missed "or at". Am i right, Evan?
Thanks for looking into this. You're right. However, the problem is not so simple. You fix is applicable to this case. But there are other cases. There are interfaces atk_text_get_text_before/at/after_offset(), and each interface can have different boundaries as its argument. Please go over the API docs of all the interfaces, and examine different cases carefully. See whether we can make sure all the docs consistent, and either two interfaces won't have conflict definition.
It would be really nice to define start word, end word, inside word and outside word offsets. I'll try, please fix me if I'm wrong. start word offset - the offset where word starts, i.e. offset of its first letter, in 'hello world' example start word offsets are 0 (letter 'h' of 'hello' word) and 6 (letter 'w' of 'world' word) end word offset - the offset where word was ended, i.e. offset immediately after of its last letter, in 'hello word' example end word offsets are 5 (' ' blank after "hello" word) and 10 (the end of 'hello word' string). inside word offset - the offset equals or bigger than start offset but strictly lesser than end offset of the same word. In the case of 'hello world' examples inside word offsets are [0, 4] ("hello" word), and [6, 9] ("world" word). outside word offset - the offset equals or bigger than end offset of one word and strictly lesser than the start offset of next word. In the case of "hello world" outside word offsets are 5 (' ' (blank) symbol after 'hello' word) and 10 (the end of "hello word" string). If this sounds correct then I can see one doc error of atk_text_get_text_after_offset () function. Let's consider example "hello my friend", offset is 5 (' ' (blank) symbol after "hello" word). "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end at or after the offset to the next work end." the result is " my" because the given offset at the word end offset. The returned string will contain the word after the offset if the offset is inside a word and will contain the word after the word after the offset if the offset is not inside a word. the result is " firend" because offset is not inside a word.
Thx for your comments: > end word offset - the offset where word was ended, i.e. offset immediately > after of its last letter, in 'hello word' example end word offsets are 5 (' ' > blank after "hello" word) and 10 (the end of 'hello word' string). In my opinion, end word offset in example 'hello word' are 4-'o',9-'d'. >"If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from >the word end at or after the offset to the next work end." >the result is " my" because the given offset at the word end offset. So, here, the result is "friend",because "from the word end after the offset to the next work end". Please fix me if I'm wrong. :-)
Thank you for quick reply. (In reply to comment #14) > In my opinion, end word offset in example 'hello word' are 4-'o',9-'d'. Ok, I thought about that but next phrase sounds strange for me: "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end at or after the offset to the next work end." If word end offset is 'o' in 'hello world' then "string from the word end" should start from 'o' because "string from the word start" includes 'h' in 'hello world'. So we should get 'o worl' for string 'hello world' at 4-'o' offset.
(In reply to comment #15) > If word end offset is 'o' in 'hello world' then "string from the word end" > should start from 'o' because "string from the word start" includes 'h' in > 'hello world'. So we should get 'o worl' for string 'hello world' at 4-'o' > offset. > I guess 'o world' if "to" in "to the next word end" is inclusive.
Yes, Alexander, you are right. I thought about that I made a mistake.
So I think atk_text_get_text_after_offset () with ATK_TEXT_BOUNDARY_WORD_END is inconsistent. "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end at or after the offset to the next work end." here the word "at" should be removed to make correspond it with this sentence: "The returned string will contain the word after the offset if the offset is inside a word and will contain the word after the word after the offset if the offset is not inside a word. " so that we get word "world" in string "hello my world" for offset 5=' ' (blank after 'hello' word) in both cases. Sounds right?
(In reply to comment #13) > It would be really nice to define start word, end word, inside word and outside > word offsets. I'll try, please fix me if I'm wrong. > > start word offset - the offset where word starts, i.e. offset of its first > letter, in 'hello world' example start word offsets are 0 (letter 'h' of > 'hello' word) and 6 (letter 'w' of 'world' word) > > end word offset - the offset where word was ended, i.e. offset immediately > after of its last letter, in 'hello word' example end word offsets are 5 (' ' > blank after "hello" word) and 10 (the end of 'hello word' string). > > inside word offset - the offset equals or bigger than start offset but strictly > lesser than end offset of the same word. In the case of 'hello world' examples > inside word offsets are [0, 4] ("hello" word), and [6, 9] ("world" word). > > outside word offset - the offset equals or bigger than end offset of one word > and strictly lesser than the start offset of next word. In the case of "hello > world" outside word offsets are 5 (' ' (blank) symbol after 'hello' word) and > 10 (the end of "hello word" string). > > If this sounds correct then I can see one doc error of > atk_text_get_text_after_offset () function. Let's consider example "hello my > friend", offset is 5 (' ' (blank) symbol after "hello" word). This will make doc of atk_text_get_text_at_offset wrong too. Seems both doc of atk_text_get_text_at_offset and doc of atk_text_get_text_after_offset assume offset 5 of "hello world" is inside the word "hello". But doc of atk_text_get_text_before_offset think offset 5 is outside the word. So my suggestion is to change the doc of atk_text_get_text_before_offset. Change "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end before the word end at or before the offset to the word end at or before the offset." to "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end before the word end before the offset to the word end before the offset."
Li, I'm still not sure in terms definitions what makes me read the doc, for example, as "if start offset is one thing then ... or if the start offset is other thing then ...". Could you please give definitions of the terms? It will help much reading documentation.
Let's take "see a dog" as an example. Offset 0, 4, 6 are word starts. Offset 3, 5, 9 are word ends. Both word start and word end are "inside a word".
So every offset in this example is inside a word. The offset can be outside any word iif sequence of more than one whitespace (non word character) is encountered. Also this should mean the statement "if the offset is inside a word" widely used in documentation is always true excluding the case of whitespace sequences. Sounds right?
Yes. Both doc of atk_text_get_text_at_offset and doc of atk_text_get_text_after_offset assume so. We need to change the doc of atk_text_get_text_before_offset from "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is from the word start before the word start before the offset to the word start before the offset. The returned string will contain the word before the offset if the offset is inside a word and will contain the word before the word before the offset if the offset is not inside a word. If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end before the word end at or before the offset to the word end at or before the offset. The returned string will contain the word before the offset if the offset is inside a word or if the offset is not inside a word. " to "If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string is from the word start before the word start before or at the offset to the word start before or at the offset. The returned string will contain the word before the offset if the offset is inside a word and will contain the word before the word before the offset if the offset is not inside a word. If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string is from the word end before the word end before the offset to the word end before the offset. The returned string will contain the word before the offset if the offset is inside a word or if the offset is not inside a word. "
Created attachment 173693 [details] [review] Patch updating the docs according to last comments
Created attachment 173833 [details] [review] Patch updating the docs according to last comments Fix the commit message from the previous patch
Review of attachment 173833 [details] [review]: It would be great if you can change the doc for atk_text_get_text_after_offset at the same time.
what changes are needed for atk_text_get_text_after_offset?
Li: Ping - can you please answer comment 27?
Sorry, I mean before_offset. The changes are similar to at_offset: * If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string * is from the word start before the word start before the offset to * the word start before the offset. to * If the boundary_type is ATK_TEXT_BOUNDARY_WORD_START the returned string * is from the word start before the word start at or before the offset to * the word start at or before the offset. and * If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string * is from the word end before the word end at or before the offset to the * word end at or before the offset. to * If the boundary_type is ATK_TEXT_BOUNDARY_WORD_END the returned string * is from the word end before the word end before the offset to the * word end before the offset.
Fernando: Time to update the patch according to Li's last comment?
(In reply to comment #30) > Fernando: Time to update the patch according to Li's last comment? Fer, ping?
Review of attachment 173833 [details] [review]: Committed.