After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 643796 - Misleading docs: .NET string's .Length (and Java string's .length) counts characters, not bytes
Misleading docs: .NET string's .Length (and Java string's .length) counts cha...
Status: RESOLVED FIXED
Product: vala
Classification: Core
Component: Documentation
unspecified
Other All
: Normal normal
: ---
Assigned To: Vala maintainers
Vala maintainers
Depends on:
Blocks:
 
 
Reported: 2011-03-03 16:49 UTC by Roman Polach
Modified: 2013-12-11 23:07 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Roman Polach 2011-03-03 16:49:37 UTC
On http://live.gnome.org/Vala/StringSample
there is sentence informing that in Vala from
0.11 string .length property is length in bytes
(not characters / codepoints) and
"This is in line with string APIs in many libraries (Java, .NET, Qt, Go)"
...

Well, this is not true at least for .NET
If you try evaluate "ščř".Length in C#, you will get
3, not 6. (Tested on .NET 2.0 on Windows.) So the comment should be changed.
The confusion is probably caused by incorrect .NET reference docs,
which claims length in bytes also, but in real .Length is in "characters".
In other .NET string methods, there is correctly counted for
characters (both in real and in reference docs), not for bytes, too.

Personally I prefer to be Vala .length the same as .NET .Length,
I have had never need to measure strings in bytes.
Comment 1 Roman Polach 2012-08-13 12:42:03 UTC
Any progress? Docs are still misleading.
Comment 2 Roman Polach 2013-06-08 22:04:29 UTC
Can the documentation be corrected?

Additional info:
* Java's String.length counts characters, not bytes
  http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#length%28%29
* .NET's String.Length counts characters, not bytes
  http://msdn.microsoft.com/en-us/library/system.string.length%28v=vs.71%29.aspx
Comment 3 Roman Polach 2013-06-08 22:07:56 UTC
Can the documentation be corrected?

Additional info:
* Java's String.length counts characters, not bytes
  http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#length%28%29
* .NET's String.Length counts characters, not bytes
  http://msdn.microsoft.com/en-us/library/system.string.length%28v=vs.71%29.aspx
Comment 4 Olav Vitters 2013-06-10 18:52:45 UTC
It is a wiki. Create an account, edit it, done.
Comment 5 Alexandre Franke 2013-12-11 22:32:33 UTC
This problem has been fixed.
Comment 6 Maciej (Matthew) Piechotka 2013-12-11 22:55:17 UTC
(In reply to comment #0)
> Well, this is not true at least for .NET
> If you try evaluate "ščř".Length in C#, you will get
> 3, not 6. (Tested on .NET 2.0 on Windows.) So the comment should be changed.
> The confusion is probably caused by incorrect .NET reference docs,
> which claims length in bytes also, but in real .Length is in "characters".
> In other .NET string methods, there is correctly counted for
> characters (both in real and in reference docs), not for bytes, too.
> 
> Personally I prefer to be Vala .length the same as .NET .Length,
> I have had never need to measure strings in bytes.

But it's not true AFAIK for SMP. The problem is with representation and it's way more complicated:

 - Java and Windows used to represent strings as UCS-2 and characters as 16-bit integers. The problem is that from introduction of SMP the UCS-2 could not represent all possible codepoints so they are now represented as UTF-16 and character can take more then one... character. So the length there is the length of the array of 16-bit integers which is not the same as number of codepoints
 - Vala uses utf-8 representation where the length of string is the length of array of 8-bit integers which is also not the same as number of codepoints
 - Finally the codepoints is not the same as glyph or characters (http://utf8everywhere.org/). In general programmer should not care about what the string represents (exception - pango developers) as Unicode is much more complicated then 'array of characters'.

I'm going to update the page accordingly but I'm posting it here as well to avoid edit war.

% cat test.java 
public class test {
    public static void main(String[] args) {
        String s = "
Comment 7 Maciej (Matthew) Piechotka 2013-12-11 23:07:56 UTC
Wonderful - Bugzilla is not SMP secured. I've updated the page with the example.