www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - toUTFxx returns null references

reply Derek Parnell <derek psych.ward> writes:
I do not know if this is a bug or not.

The toUTF32(), toUTF16(), and toUTF8() routines return a null reference if
the input parameter is an empty string. I would have thought that they
should return an empty string instead. The only exception is when the
parameter is the same type as the return value's type, in that case they
return an empty string.

Example code...
<code>
import std.utf;
import std.stdio;

void main()
{
   char[] s = "";
   dchar[] d;
   
   if (s is null) 
    writefln("s is null");
   else
    writefln("s length is %d", s.length);

   d = toUTF32(s);
   if (d is null) 
    writefln("d is null");
   else
    writefln("d length is %d", d.length);
} 
</code>

-- 
Derek
Melbourne, Australia
10/02/2005 7:28:31 PM
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek Parnell wrote:

 I do not know if this is a bug or not.
Confusing, but I don't really think it's a bug... (maybe the std routines need to be more similar to eachother, either all return null or all return "", but both types of return values are OK to use, below:)
 The toUTF32(), toUTF16(), and toUTF8() routines return a null reference if
 the input parameter is an empty string. I would have thought that they
 should return an empty string instead. The only exception is when the
 parameter is the same type as the return value's type, in that case they
 return an empty string.
I believe that in D, the empty string is "equal" to null. http://www.digitalmars.com/d/cppstrings.html:
  In D, an empty string is just null:
 
 	char[] str;
 	if (!str)
 		// string is empty
That works the same with either null or "", and this too:
 import std.stdio;
 
 void main()
 {
    char[] s = "";
    char[] d = null;
    
    writefln("s is %snull", s is null ? "" : "not ");
    writefln("s length is %d", s.length);
 
    writefln("d is %snull", d is null ? "" : "not ");
    writefln("d length is %d", d.length);
 } 
s is not null s length is 0 d is null d length is 0 Which means that whether it is "" or null, it'll compare and work the same to the rest of code ? Unless C is involved, since s.ptr will point to a '\0', but d.ptr points to null. But that will work itself out in the toStringz process... (since D strings have to be zero-terminate for C anyway) --anders
Feb 10 2005
parent reply Derek <derek psych.ward> writes:
On Thu, 10 Feb 2005 09:59:39 +0100, Anders F Björklund wrote:

 Derek Parnell wrote:
 
 I do not know if this is a bug or not.
Confusing, but I don't really think it's a bug... (maybe the std routines need to be more similar to eachother, either all return null or all return "", but both types of return values are OK to use, below:)
 The toUTF32(), toUTF16(), and toUTF8() routines return a null reference if
 the input parameter is an empty string. I would have thought that they
 should return an empty string instead. The only exception is when the
 parameter is the same type as the return value's type, in that case they
 return an empty string.
I believe that in D, the empty string is "equal" to null. http://www.digitalmars.com/d/cppstrings.html:
  In D, an empty string is just null:
 
 	char[] str;
 	if (!str)
 		// string is empty
That works the same with either null or "", and this too:
 import std.stdio;
 
 void main()
 {
    char[] s = "";
    char[] d = null;
    
    writefln("s is %snull", s is null ? "" : "not ");
    writefln("s length is %d", s.length);
 
    writefln("d is %snull", d is null ? "" : "not ");
    writefln("d length is %d", d.length);
 } 
s is not null s length is 0 d is null d length is 0 Which means that whether it is "" or null, it'll compare and work the same to the rest of code ? Unless C is involved, since s.ptr will point to a '\0', but d.ptr points to null. But that will work itself out in the toStringz process... (since D strings have to be zero-terminate for C anyway)
If discovered this behaviour when I used an 'in' contract in a function ... bool foo(dchar[] X, dchar[] Y) in { assert( ! (X is null) ); assert( ! (Y is null) ); } body { . . . } So what you seem to be saying is that I shouldn't bother checking that a dynamic array reference is null or not. Instead I can just check the length. However, I was trying to trap the case in which the function was called with an uninitialized array. Calling it with a empty array is ok though. A fuller example in which it tripped me up ... <code> import std.utf; import std.stdio; bool foo(dchar[] X, dchar[] Y) in { assert( ! (X is null) ); assert( ! (Y is null) ); } body { return true; } bool foo(char[] X, char[] Y) { return foo( toUTF32(X), toUTF32(Y) ); } bool foo(wchar[] X, wchar[] Y) { return foo( toUTF32(X), toUTF32(Y) ); } unittest { dchar[] a; dchar[] b; a = ""; b = "123"; debug(1) writefln("UT1"); assert( foo(toUTF32(a), toUTF32(b) ) ); debug(1) writefln("UT2"); assert( foo(toUTF16(a), toUTF16(b) ) ); debug(1) writefln("UT3"); assert( foo(toUTF8(a), toUTF8(b) ) ); } </code> Compiled with dmd test -debug -unittest -- Derek Melbourne, Australia
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek wrote:

 So what you seem to be saying is that I shouldn't bother checking that a
 dynamic array reference is null or not. Instead I can just check the
 length. However, I was trying to trap the case in which the function was
 called with an uninitialized array. Calling it with a empty array is ok
 though.
No, I don't think you should bother to differ between null and .length == 0.
 bool foo(dchar[] X, dchar[] Y)
   in {
     assert( ! (X is null) );
     assert( ! (Y is null) );
  }
  body { 
      return true;  }
The "recommended" way to write that is: assert(X); assert(Y); Since D doesn't have booleans, that is ? (and since the long form is an eye-sore) I'm not sure what you are trying to test, but: int main() { char[] nullstr = null; assert(nullstr == ""); assert("" == nullstr); return 0; } This test does not fail, and does not segfault... (like it would have done if nullstr was an Object:) int main() { Object nullobj = null; assert(nullobj == null); // <-- KABOOM assert(null == nullobj); // <-- KABOOM return 0; } This second program *must* be rewritten with "is". (since using '==' with class objects calls opEquals) Pointers are OK too: int main() { void* nullptr = null; assert(nullptr == null); assert(null == nullptr); return 0; } To be on the safe side, one can use "is" always... (i.e. with pointers/objects, but *not* with strings since that only compares the references, like in Java) --anders
Feb 10 2005
parent reply Derek <derek psych.ward> writes:
On Thu, 10 Feb 2005 14:10:47 +0100, Anders F Björklund wrote:

 
 I'm not sure what you are trying to test, but:
I'm testing for this ... void main() { char[] nullstr; assert( ! (nullstr is null) ); } Namely, the attempted use of a string that has never had any assignment yet. But as toUTFxx() returns that something that looks like an unassigned string, I can't test for unassigned strings. I still think that the toUTFxx() functions should return an empty string if an empty string was passed to them. -- Derek Melbourne, Australia
Feb 10 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek wrote:

I'm not sure what you are trying to test, but:
I'm testing for this ... void main() { char[] nullstr; assert( ! (nullstr is null) ); } Namely, the attempted use of a string that has never had any assignment yet.
There is nothing wrong with using an unassigned string, since all arrays (including char[]) default to length 0... You can pass "nullstr" to writefln and friends, just fine.
 But as toUTFxx() returns that something that looks like an unassigned
 string, I can't test for unassigned strings. 
If you really, really, want to test for "unassigned" strings - use .ptr: void main() { char[] s = ""; char[] d = null; assert(s.ptr != null); assert(d.ptr == null); } This is because the ptr of a string literal will point to a '\0' char.
 I still think that the toUTFxx() functions should return an empty string if
 an empty string was passed to them.
There is *no* difference in D, between null and the empty string. They both have the length property set to 0, and they're equal. (not identical, though, so using "is" between them will fail) --anders
Feb 10 2005
next sibling parent Derek <derek psych.ward> writes:
On Thu, 10 Feb 2005 15:21:21 +0100, Anders F Björklund wrote:

 Derek wrote:
 
I'm not sure what you are trying to test, but:
I'm testing for this ... void main() { char[] nullstr; assert( ! (nullstr is null) ); } Namely, the attempted use of a string that has never had any assignment yet.
There is nothing wrong with using an unassigned string, since all arrays (including char[]) default to length 0... You can pass "nullstr" to writefln and friends, just fine.
 But as toUTFxx() returns that something that looks like an unassigned
 string, I can't test for unassigned strings. 
If you really, really, want to test for "unassigned" strings - use .ptr: void main() { char[] s = ""; char[] d = null; assert(s.ptr != null); assert(d.ptr == null); } This is because the ptr of a string literal will point to a '\0' char.
 I still think that the toUTFxx() functions should return an empty string if
 an empty string was passed to them.
There is *no* difference in D, between null and the empty string. They both have the length property set to 0, and they're equal. (not identical, though, so using "is" between them will fail)
Yes, I understand the technical aspect of this. However, I was attempting to help the coder trap mistakes; namely the use of unassigned strings. The assumption is that if a coder declares a string, and uses it before assigning anything to it, then it might mean that there is a logic error in the code. This is slightly different from the use of numbers, as most people expect that numbers are zero upon declaration. But still, its just a philosophy question really. Walter has decided for us that unassigned variables are an acceptable practice, where as pedantic people such as myself think that they might indicate errors in coding. I will, no doubt, have to adjust to the given situation as it ain't gonna change ;-) -- Derek Melbourne, Australia
Feb 10 2005
prev sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 10 Feb 2005 15:21:21 +0100, Anders F Björklund <afb algonet.se>  
wrote:
 There is *no* difference in D, between null and the empty string.
There is a difference, internally, but D treats them the same. Which is probably what you meant, but I'm just being thourough. :) A null string has ptr == null, an empty string has ptr == "". In some instances it is crucial to be able to tell these cases apart: 1- value does not exist (null) 2- value is blank (empty string) To check for case 1, we can go "if (s is null)" To check for case 2, we can go "if (s.length == 0)" eg. Simple example where it is important: User enters data into a text field (A) on a web page, leaves text field (B) blank, the code is saving the values of these two fields somewhere i.e. in a database containing 3 settings A, B and C. The presence of the emtpy field (B) on the page indicates any previous value for that setting should be overwritten with the empty value. The absense of the field (C) indicates that any previous value of the setting should not be overwritten but kept. Regan
Feb 10 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 11 Feb 2005 10:05:06 +1300, Regan Heath wrote:

 On Thu, 10 Feb 2005 15:21:21 +0100, Anders F Björklund <afb algonet.se>  
 wrote:
 There is *no* difference in D, between null and the empty string.
There is a difference, internally, but D treats them the same. Which is probably what you meant, but I'm just being thourough. :) A null string has ptr == null, an empty string has ptr == "". In some instances it is crucial to be able to tell these cases apart: 1- value does not exist (null) 2- value is blank (empty string)
Exactly! Well said. -- Derek Melbourne, Australia 11/02/2005 9:49:04 AM
Feb 10 2005
parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Derek Parnell wrote:

There is *no* difference in D, between null and the empty string.
There is a difference, internally, but D treats them the same. Which is  
probably what you meant, but I'm just being thourough. :)
More or less, yes. But that's more of an Implementation Quirkâ„¢. The D specification explicitly says: http://www.digitalmars.com/d/arrays.html
 Array Initialization
 
     * Dynamic arrays are initialized to having 0 elements.
http://www.digitalmars.com/d/cppstrings.html
 Checking For Empty Strings

  In D, an empty string is just null:
 
 	char[] str;
 	if (!str)
 		// string is empty
But in practice, they do differ - in the ptr to the '\0' (for C). (but both has a length property of 0, though, as mentioned earlier) And when you copy the char[], this ptr settings follows as well... This means that there is a way to trace if it has been set to "".
A null string has ptr == null, an empty string has ptr == "".

In some instances it is crucial to be able to tell these cases apart:
  1- value does not exist (null)
  2- value is blank       (empty string)
Exactly! Well said.
But strings in D are not objects or pointers, they are arrays... And arrays are initialized to have the length zero, in the spec. Thus, that makes them similar to e.g. an integer that is initialized with a zero ? You will have to check if they are modified in some other way. Or just rely on the "string.ptr" value, since that will work as long as D supports calling C functions with string literals... But technically, there is no difference in D between "" and null. Which is probably why the standard library mixes them freely ? To recap: "" .length = 0 .ptr = &'\0' null .length = 0 .ptr = null
 void main()
 {
   char[] emptystr = "";
   char[] nullstr = null;
 
   assert(emptystr == nullstr);
   assert(!(emptystr is nullstr));
 
   assert(emptystr.length == nullstr.length);
   assert(!(emptystr.ptr is nullstr.ptr));
 }
And the D standard library should probably be "fixed" to return null for null and "" for "" anyway, even if it not's in the spec ? Care to write a full unittest for it ? (at least for all of std.utf) --anders
Feb 11 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 11 Feb 2005 17:54:45 +0100, Anders F Björklund <afb algonet.se>  
wrote:
 Derek Parnell wrote:

 There is *no* difference in D, between null and the empty string.
 There is a difference, internally, but D treats them the same. Which  
 is  probably what you meant, but I'm just being thourough. :)
More or less, yes. But that's more of an Implementation Quirkâ„¢.
Which worries me because I believe there is a real need to tell them apart. So, I ask that this behaviour be specified, or another method to achieve the same thing be specified.
 The D specification explicitly says:

 http://www.digitalmars.com/d/arrays.html
 Array Initialization
      * Dynamic arrays are initialized to having 0 elements.
http://www.digitalmars.com/d/cppstrings.html
 Checking For Empty Strings

  In D, an empty string is just null:
  	char[] str;
 	if (!str)
 		// string is empty
But in practice, they do differ - in the ptr to the '\0' (for C). (but both has a length property of 0, though, as mentioned earlier)
Sure, exactly what I said.
 And when you copy the char[], this ptr settings follows as well...
 This means that there is a way to trace if it has been set to "".
Yep, I want this behaviour to be specified. (or some other method to achieve what I want)
 A null string has ptr == null, an empty string has ptr == "".

 In some instances it is crucial to be able to tell these cases apart:
  1- value does not exist (null)
  2- value is blank       (empty string)
Exactly! Well said.
But strings in D are not objects or pointers, they are arrays...
And arrays appear to be value types containing a 'reference'. As in, arrays themselves cannot be null, but the reference in them can be.
 And arrays are initialized to have the length zero, in the spec.
 Thus, that makes them similar to e.g. an integer that is initialized
 with a zero ?
I agree arrays are value types, as integers are. For a null string, the length is initialised to 0. For a "" string the length is initialised to the length of "", which happens to be 0. For a "abc" string the length is initialised to the length of "abc", which happens to be 3.
 You will have to check if they are modified in some
 other way. Or just rely on the "string.ptr" value, since that will
 work as long as D supports calling C functions with string literals...
In C strings are pointers, and pointers can be null or point to a piece of memory which may contain a \0, so, in C there is a way to tell the 2 cases apart. In D arrays are value types containing a pointer/reference and a length. I firmly believe that loosing this ability for char[] would become a weakness in D, it would force me and others to resort to other methods to achieve it. I like the current behaviour, I just want to see it doesn't change.
 But technically, there is no difference in D between "" and null.
 Which is probably why the standard library mixes them freely ?

 To recap:

 ""
      .length = 0
      .ptr = &'\0'

 null
      .length = 0
      .ptr = null
Yep, like I said.
 void main()
 {
   char[] emptystr = "";
   char[] nullstr = null;
    assert(emptystr == nullstr);
   assert(!(emptystr is nullstr));
    assert(emptystr.length == nullstr.length);
   assert(!(emptystr.ptr is nullstr.ptr));
 }
And the D standard library should probably be "fixed" to return null for null and "" for "" anyway, even if it not's in the spec ?
Definately. I've been saying null and "" can mean different things depending on the context, you seem to be agreeing, why are we arguing? :)
 Care to write a full unittest for it ? (at least for all of std.utf)
First we have to decide (on a per function basis) whether returning null or "" makes sense, or if in deed both make sense (for different reasons of course) i.e. null == failed, cannot convert, malfomed? "" == success, result really is "" Regan
Feb 13 2005