GNOME Bugzilla – Bug 643796
Misleading docs: .NET string's .Length (and Java string's .length) counts characters, not bytes
Last modified: 2013-12-11 23:07:56 UTC
On http://live.gnome.org/Vala/StringSample there is sentence informing that in Vala from 0.11 string .length property is length in bytes (not characters / codepoints) and "This is in line with string APIs in many libraries (Java, .NET, Qt, Go)" ... Well, this is not true at least for .NET If you try evaluate "ščř".Length in C#, you will get 3, not 6. (Tested on .NET 2.0 on Windows.) So the comment should be changed. The confusion is probably caused by incorrect .NET reference docs, which claims length in bytes also, but in real .Length is in "characters". In other .NET string methods, there is correctly counted for characters (both in real and in reference docs), not for bytes, too. Personally I prefer to be Vala .length the same as .NET .Length, I have had never need to measure strings in bytes.
Any progress? Docs are still misleading.
Can the documentation be corrected? Additional info: * Java's String.length counts characters, not bytes http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#length%28%29 * .NET's String.Length counts characters, not bytes http://msdn.microsoft.com/en-us/library/system.string.length%28v=vs.71%29.aspx
It is a wiki. Create an account, edit it, done.
This problem has been fixed.
(In reply to comment #0) > Well, this is not true at least for .NET > If you try evaluate "ščř".Length in C#, you will get > 3, not 6. (Tested on .NET 2.0 on Windows.) So the comment should be changed. > The confusion is probably caused by incorrect .NET reference docs, > which claims length in bytes also, but in real .Length is in "characters". > In other .NET string methods, there is correctly counted for > characters (both in real and in reference docs), not for bytes, too. > > Personally I prefer to be Vala .length the same as .NET .Length, > I have had never need to measure strings in bytes. But it's not true AFAIK for SMP. The problem is with representation and it's way more complicated: - Java and Windows used to represent strings as UCS-2 and characters as 16-bit integers. The problem is that from introduction of SMP the UCS-2 could not represent all possible codepoints so they are now represented as UTF-16 and character can take more then one... character. So the length there is the length of the array of 16-bit integers which is not the same as number of codepoints - Vala uses utf-8 representation where the length of string is the length of array of 8-bit integers which is also not the same as number of codepoints - Finally the codepoints is not the same as glyph or characters (http://utf8everywhere.org/). In general programmer should not care about what the string represents (exception - pango developers) as Unicode is much more complicated then 'array of characters'. I'm going to update the page accordingly but I'm posting it here as well to avoid edit war. % cat test.java public class test { public static void main(String[] args) { String s = "
Wonderful - Bugzilla is not SMP secured. I've updated the page with the example.