www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 8384] New: Poor wchar/dchar* to string conversion support

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384

           Summary: Poor wchar/dchar* to string conversion support
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P3
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: thecybershadow gmail.com



05:23:29 PDT ---
import std.conv;
import std.string;

unittest
{
    static void test(T)(T lp)
    {
        assert(format("%s", lp) == "Hello, world!");
        assert(to!string(lp)    == "Hello, world!");
    }

    test("Hello, world!" .ptr);
    test("Hello, world!"w.ptr);
    test("Hello, world!"d.ptr);
}

wchar* conversion is commonly needed for Windows programming, as UTF-16 is the
native encoding for Unicode Windows API functions.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 13 2012
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg gmx.com



PDT ---
So, you expect %s on a pointer to give you the string that it points to? Why?
It's pointer, not a string. It's going to convert the pointer. That works as
expected.

to!string should take null-terminated string and give you a string, and it does
that. This code passes:

import std.conv;
import std.string;

void main()
{
    static void test(T)(T lp)
    {
        assert(to!string(lp), "hello world");
    }

    test("Hello, world!" .ptr);
    test("Hello, world!"w.ptr);
    test("Hello, world!"d.ptr);
}

So, I'd say that as far as your code goes, there's nothing wrong with it. It
functions exactly as expected. What _doesn't_ work is this:

import std.conv;
import std.string;

void main()
{
    static void test(T)(T lp)
    {
        assert(to!wstring(lp), "hello world");
        assert(to!dstring(lp), "hello world");
    }

    test("Hello, world!" .ptr);
    test("Hello, world!"w.ptr);
    test("Hello, world!"d.ptr);
}

The code doesn't even compile, giving these errors:

/home/jmdavis/dmd2/linux/bin/../../src/phobos/std/conv.d(819): Error:
incompatible types for
((cast(immutable(dchar)[])_adDupT(&_D12TypeInfo_Aya6__initZ,value[cast(ulong)0..strlen(cast(const(char*))value)]))
? (null)): 'immutable(dchar)[]' and 'string'
/home/jmdavis/dmd2/linux/bin/../../src/phobos/std/conv.d(268): Error: template
instance std.conv.toImpl!(immutable(dchar)[],immutable(char)*) error
instantiating
q.d(8):        instantiated from here: to!(immutable(char)*)
q.d(11):        instantiated from here: test!(immutable(char)*)
q.d(8): Error: template instance
std.conv.to!(immutable(dchar)[]).to!(immutable(char)*) error instantiating
q.d(11):        instantiated from here: test!(immutable(char)*)
q.d(11): Error: template instance q.main.test!(immutable(char)*) error
instantiating

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




13:36:05 PDT ---
 to!string should take null-terminated string and give you a string, and it does
 that. This code passes:
Is it something that was fixed recently (within the last two weeks)? My two-week-old dmd git build and dpaste still print offsets for wchar* and dchar*: http://dpaste.dzfl.pl/26a2b284
 So, you expect %s on a pointer to give you the string that it points to? Why?
I think that, before all else, we should be looking for good reasons why format("%s", foo) and to!string(foo) produce different results. Why should one format the offset and the other do a conversion? Second, I believe that the principle of least surprise is making this case rather clear: if the programmer tries to print a char*, it's almost certain that they want to print the null-terminated string at the given address, rather than a hexadecimal representation of the address (which are rarely useful to the end-user). Generic code is the only exception I can think of, in which case a cast to void* is in order.
 What _doesn't_ work is this:
I think this should call the appropriate toUTFx functions from std.utf. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




13:42:17 PDT ---
 I think this should call the appropriate toUTFx functions from std.utf.
Sorry about that, misread your example. I guess, ideally, conversion between any pair of {|w|d}{char*|string} should work. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




PDT ---
format and writeln are supposed to behave the same, because they both operate
on format strings (they _don't_ currently behave 100% the same, but format's
current implementation will be replaced with the new xformat's implementation
in a few months - after the "scheduled for deprecation" time period). to!string
is an entirely different beast.

std.conv.to is asking for an explicit conversion to string, whereas format and
writeln are converting according to the format specifiers, and %s indicates the
default string representation of the type. char*, wchar*, and dchar* are
pointers - _not_ strings - and should not be treated as strings. Pointers print
their address with %s. Making char*, wchar*, and dchar* print themselves as
strings would be inconsistent with other pointer types, and operating on char*,
wchar*, and dchar* should be discouraged, not encouraged.

to!string is treated differently, because you're asking for an explicit
conversion, and we _do_ need to be able to convert null-terminated strings to D
strings.

So, while I can see your point, I really don't think that having format or
writeln treat char*, wchar*, or dchar* as null-terminated strings is a good
idea. We should provide a means of converting them to D strings but not do
anything to encourage using them as-is without converting them.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384


Vladimir Panteleev <thecybershadow gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Poor wchar/dchar* to string |std.conv.to should allow
                   |conversion support          |conversion between any pair
                   |                            |of
                   |                            |string/wstring/dstring/char
                   |                            |*/wchar*/dchar*



14:25:36 PDT ---
OK, fair enough.

I've updated the enhancement request's title according to my previous comment.

Test:

-----------------------------------------------------------------------------

import std.conv;

void test1(T)(T lp)
{
    test2!( string)(lp);
    test2!(wstring)(lp);
    test2!(dstring)(lp);
    test2!(  char*)(lp);
    test2!( wchar*)(lp);
    test2!( dchar*)(lp);
}

void test2(D, S)(S lp)
{
    D dest = to!D(lp);
    assert(to!string(dest) == "Hello, world!");
}

unittest
{
    test1("Hello, world!" );
    test1("Hello, world!"w);
    test1("Hello, world!"d);
    test1("Hello, world!" .ptr);
    test1("Hello, world!"w.ptr);
    test1("Hello, world!"d.ptr);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




14:31:04 PDT ---
Oh, I forgot about constness.

I guess that raises the number of combinations to (2*3*3)^2 = 324.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384


klickverbot <code klickverbot.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code klickverbot.at



---
Hooray for using "static" foreach to conveniently enumerate all the cases to
test!

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




PDT ---
 Hooray for using "static" foreach to conveniently enumerate all the cases to
test! Yeah. I do that all of the time when I have to test with multiple types (especially with strings), and I always push for string-related tests to do that when I see that someone is looking to submit code to Phobos for a function that takes one or more strings as templated types, and their tests don't do that. It's just one of those things that everyone who writes much in the way of unit tests in D should learn and know about. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 13 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




13:24:08 PDT ---
Another case of confusion due to format treating C strings as pointers:

http://stackoverflow.com/q/11975353/21501

I still think that the current behavior, regardless of how much it makes sense
from a design/consistency/orthogonality/etc. perspective, is simply not useful
and fails the principle of least surprise in most expected cases.

I strongly believe that we should either forbid passing char pointers to
format/writeln (and force the user to cast to void* or convert to a D string),
or print them as C null-terminated strings.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




PDT ---
char* acts identically to the other pointer types, and I fully believe that it
should stay that way. We've pretty much removed all of the D features which
involved either treating a string as char* or a char* as a string (including
disallowing implicit conversion of string to const char*). The _only_ feature
that the language has which supports that is the fact that string literals have
a null character one past their end and will implicitly convert to const char*.

It would be a huge mistake IMHO to support doing _anything_ with character
pointers which treats them as strings without requiring an explicit conversion
of some kind. Anyone who continues to think of char* as being a string in D is
just asking for trouble. They need to learn to use strings correctly.

If you really want to use char* as a string in functions like format or
writeln, then simply either use to!string or ptr[0 .. strln(ptr)].

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




13:48:30 PDT ---
Sorry, I don't think that your categorical point of view is constructive. As
long as D will interface with C libraries and programs, people will continue to
attempt to use C strings together or in place of D strings, and issues like the
above will continue to appear.

How often would a typical D user want to print / format the address of a
character, versus the null-terminated string at that address?

 It would be a huge mistake IMHO to support doing _anything_ with character
 pointers which treats them as strings without requiring an explicit conversion
 of some kind. 
Why would it be a mistake? What exactly do we lose by allowing writeln/format to understand C strings?
 Anyone who continues to think of char* as being a string in D is
 just asking for trouble.
What kind of trouble?
 They need to learn to use strings correctly.
D printing an address when text was expected will sooner generate a "D sucks" reaction than a "Oops, I need to learn to use strings correctly" one.
 If you really want to use char* as a string in functions like format or
writeln, then simply either use to!string or ptr[0 .. strln(ptr)]. That's not really simple, considering some spots where that (verbose) modification needs to be made would be discovered only late at runtime, and even then the actual problem is not obvious to identify (as seen in the SO question above). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




13:56:00 PDT ---
I would like to stress out a point that I hope could clear up my view of the
logic that writeln/format should use.

Printing/formatting memory addresses is extremely rarely useful!

Except for some dirty debugging, I can't imagine a case where the user expects
that passing a pointer to something to format would yield the hex
representation of that address.

I believe that printing a pointer as a hex address should be the fallback,
last-resort behavior, if there is no better representation for the said type.
(This also allows discussion of calling toString() on struct pointers.)

For the rare case that the user intends to actually print a pointer, this is
easily accomplished by a cast to size_t and using the appropriate hex format
specifier.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




PDT ---
Anyone who does not understand that char* is _not_ a string will continue to
make mistakes like trying to concatenate a char* to a string (
http://stackoverflow.com/questions/11914070/why-can-i-not-concatenate-a-constchar-to-a-string-in-d
) or try and pass string directly to a C function. They will constantly run
into problems when dealing with strings. char* is _not_ a string and should not
be treated as such. Treating it as a string with something like writeln will
just help further the misconception that char* is a string and hinder people
learning and using D. D programmers need to understand the difference between
char* and string. char* should _not_ be treated as special, because it's not.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




14:01:42 PDT ---
First of all, you are conflating ignorance between the two string types with my
arguments. Users who are aware that D has its own way of handling strings are
still open to making frustrating mistakes.

Second, getting unexpected output is not a good way to teach people about this.
Hence my earlier proposal to make writeln/format REJECT char pointer types, on
the basis that the user's intention is ambiguous (I don't think so personally,
but obviously that's just my opinion).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




PDT ---
I'm saying that we shouldn't treat char* differently from int* just because
some newbies expect char* to act like a string. And if you know D, then you
know that char* is _not_ a string, and I don't see how you could expect it to
be treated as one. Either making char* act like a string or disallowing
printing it would make it act differently from other pointer types just to
appease the folks who mistakingly think that char* is a string.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




14:08:44 PDT ---
Well, then how about removing the pointer-printing feature entirely, and issue
a compile-time error on all pointer types?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




14:12:50 PDT ---
 And if you know D, then you know that char* is _not_ a string,
 and I don't see how you could expect it to be treated as one.
I don't think this argument is valid, because it assumes that all D users are always aware of the types they pass to writeln/format. In the SO case, the argument is a function result, and the function's return type is not explicitly written in the user's code. People often expect the compiler to shout at them if they try to pass incompatible types to a function. writeln/format accept char pointers, but ultimately do something with them that in 99% of cases is simply not useful, and put the user in search of their mistake all across the data flow. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384


Adam D. Ruppe <destructionator gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |destructionator gmail.com



14:34:54 PDT ---
I think rejecting might be the best option because if you treat it as a string,
what if it doesn't have a 0 terminator? That could easily happen if you pass it
a pointer to a D string.

I don't think that is technically un- safe, but it could be a problem anyway to
get an unexpected crash because of it. At least with to!string(char*) you might
think about it for a minute and avoid the problem.


So on one hand, I think it should just work, but on the other hand the compile
time error might be the most sane.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




PDT ---
 Well, then how about removing the pointer-printing feature entirely, and issue
a compile-time error on all pointer types? So, you're suggesting that we remove a useful feature because newbies coming from C/C++ keep mistakingly thinking that char* is a string? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




14:44:20 PDT ---
Your formulation is misrepresenting the weight of the scales. Please seriously
take into account the overall benefit for D for both decisions. The feature is
nearly useless and more harmful, and "newbies coming
from C/C++" is, again, a misrepresentation as discussed above. It is also
incorrect - someone used to e.g. using SDL bindings on another language may
expect that the types returned by the binding would be compatible with the
language's native functionality.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384


Andrej Mitrovic <andrej.mitrovich gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrej.mitrovich gmail.com



10:34:43 PST ---
*** Issue 6157 has been marked as a duplicate of this issue. ***

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 13 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=8384




10:35:51 PST ---

 *** Issue 6157 has been marked as a duplicate of this issue. ***
FYI: http://d.puremagic.com/issues/show_bug.cgi?id=6157 has an experimental implementation in the attachment (for conv.to), but I'm not an expert on things unicode. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 13 2013