digitalmars.D - First Impressions

Geoff Carlton (63/63) Sep 28 2006 Hi,

Jarrett Billingsley (42/105) Sep 28 2006 They're more just syntactic sugar than member functions. You can, in fa...

Geoff Carlton (32/50) Sep 28 2006 I'm a fan of utf-8 so it would seem natural to have string, wstring, and...

Lutger (10/39) Sep 28 2006 Yes, I was too. But although it looks not very nice at first sight, D's
Derek Parnell (25/47) Sep 28 2006 Yes. It isn't very 'nice' for a modern language. Though as you note belo...

Walter Bright (7/14) Sep 29 2006 On the other hand, the reasons other languages have strings as classes

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/5) Sep 29 2006 A string alias might still be, just as the bool alias was.
Derek Parnell (11/27) Sep 29 2006 And is it there yet? I mean, given that a string is just a lump of text,...

Georg Wrede (5/10) Sep 29 2006 The string you're talking about is not just a lump of text.
David Medlock (4/35) Sep 29 2006 I just quickly want to interject my wish for aliases for the basic
Walter Bright (4/10) Sep 29 2006 I believe it's there. I don't think std::string or java.lang.String have...

Matthias Spycher (5/20) Sep 29 2006 Immutability and some guarantees about the validity of the state of an
Derek Parnell (16/28) Sep 29 2006 I'm pretty sure that the phobos routines for search and replace only wor...

Georg Wrede (20/25) Sep 29 2006 I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61)
Walter Bright (25/37) Sep 29 2006 That cannot happen, because multibyte sequences *always* have the high

Derek Parnell (26/66) Sep 30 2006 Thanks. That has cleared up some misconceptions and pre-concenptions tha...

Walter Bright (21/52) Sep 30 2006 I certainly hope this thread doesn't degenerate into that like some of

Derek Parnell (6/9) Oct 01 2006 Oh, I threw trhat away ages ago ;-)
Lars Ivar Igesund (6/12) Oct 01 2006 Nope, it just looks correct.

Lionello Lunesu (14/23) Oct 02 2006 I don't think renaming toString to toUTF gets rid of any confusion.

Georg Wrede (3/5) Oct 01 2006 Let's just say it would be a first step in lessening the confusion _we_

Kevin Bealer (15/22) Oct 02 2006 I would kind of agree with this, but I think it's a two-edged knife.

Georg Wrede (24/54) Oct 03 2006 Well, with string, folks would at least be inclined to search for the
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (17/32) Oct 03 2006 Which could be a *good* thing, since it would stop users from hurting

Bruno Medeiros (9/26) Oct 01 2006 Precisely! And even if such conceptual difference didn't exist, or is

Geoff Carlton (7/13) Oct 01 2006 There are also many cases where char arrays are not strings:

Thomas Kuehne (31/44) Sep 30 2006 -----BEGIN PGP SIGNED MESSAGE-----

Thomas Kuehne (11/30) Sep 30 2006 -----BEGIN PGP SIGNED MESSAGE-----

Sean Kelly (9/12) Sep 30 2006 The wording could be more explicit, but I think the current

Geoff Carlton (20/35) Sep 29 2006 Hi,
David Medlock (12/30) Sep 29 2006 The reason *I* want it is _alias_ does not respect the private:

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (25/40) Sep 29 2006 Problem of "char[]" is both that it hides the fact that "char" is UTF-8

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/14) Sep 29 2006 Except the other way around, of course!

Lionello Lunesu (10/10) Sep 29 2006 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/7) Sep 29 2006 And probably only for ASCII string constants, at that...

Lionello Lunesu (15/23) Sep 29 2006 Right, that too!

Georg Wrede (21/32) Sep 29 2006 Using char[] as long as you don't know about UTF seems to work pretty

Chad J (24/64) Sep 29 2006 haha too true.

Johan Granberg (8/23) Sep 29 2006 I completely agree, char should hold a character independently of

BCS (15/23) Sep 29 2006 Why isn't performance a problem?

Chad J (40/71) Sep 29 2006 I will go ahead and say that the current state of char[] is incorrect.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (9/29) Sep 29 2006 But D already uses Unicode for all strings, encoded as UTF ?

Chad J (38/78) Sep 29 2006 Probably 7-bit. Anything where the size of one character is ALWAYS one

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (24/38) Sep 29 2006 It's mostly about looking out for the UTF "control" characters, which is...

Georg Wrede (4/8) Sep 29 2006 Problem is, using 16-bit you sort-of get away with _almost_ all of it.
Chad J (35/75) Sep 30 2006 So it seems to me the problem is that those 2 bytes are both 2

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (30/57) Oct 01 2006 Code point is the closest thing to a "character", although it might take...

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (29/40) Sep 29 2006 This code probably does not work as you think it does...

Chad J (15/67) Sep 29 2006 ah. And yep the i++ was a typo (oops).

Georg Wrede (26/31) Sep 29 2006 Wrong.

Chad J (13/19) Sep 29 2006 But this is what I'm talking about... you can't slice them or index

Georg Wrede (92/116) Sep 29 2006 Yes. That's why I talked about you falling down once you realise Daddy's...
Walter Bright (16/31) Sep 29 2006 Yes, you do have to be aware of it being UTF, just like in C you have to...

Sean Kelly (9/20) Sep 30 2006 As long as you're aware that you are working in UTF-8 I think

Walter Bright (10/21) Sep 30 2006 It's so broken that there are proposals to reengineer core C++ to add

Sean Kelly (12/36) Oct 01 2006 True. And I hinted at this above.

Walter Bright (6/22) Oct 01 2006 That's why the proposals to fix it are rewriting some of the *core* C++

Johan Granberg (7/18) Sep 29 2006 But is this not a needless source of confusion, that could be eliminated...

Georg Wrede (26/48) Sep 29 2006 You might begin with pasting this and compiling it:

Derek Parnell (10/12) Sep 29 2006 The Build program does lots of 'tampering'. I had to rewrite many standa...

Georg Wrede (7/17) Sep 29 2006 Yes, case insensitive compares are difficult if you want to cater for

Geoff Carlton (21/29) Sep 29 2006 I agree, but I disagree that there is a problem, or that utf-8 is a bad

Georg Wrede (2/38) Sep 29 2006 Yes.

Johan Granberg (6/14) Sep 29 2006 How should we chop strings on character boundaries?

Walter Bright (2/3) Sep 30 2006 std.utf.toUTFindex() should do the trick.

Johan Granberg (4/23) Sep 29 2006 I don't think any performance hit will be so big that it causes problems...

BCS (14/25) Oct 01 2006 If you will note, I said nothing about the size of the hit. While some

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (8/15) Oct 01 2006 We have that already:

BCS (19/44) Oct 01 2006 ubyte is an 8 bit unsigned number not a character encoding.

Georg Wrede (23/27) Oct 01 2006 Then all Americans would use that instead of UTF-8.
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (19/23) Oct 01 2006 Right, I actually meant ubyte[] but void[] might have been

BCS (18/22) Oct 02 2006 The more I think about it the worse this get.

Geoff Carlton <gcarlton iinet.net.au> writes:

Hi,
I'm a C++ user who's just tried D and I wanted to give my first
impressions.  I can't really justify moving any of my codebase over to
D, so I wrote a quick tool to parse a dictionary file and make a
histogram - a bit like the wc demo in the dmd package.

1.)
I was a bit underwhelmed by the syntax of char[].  I've used lua which
also has strings,functions and maps as basic primitives, so going back
to array notation seems a bit low level.  Also, char[][] is not the best
start in the main() declaration.  Is it a 2D array, an array of
arrays?  Then there is the char[][char[]].  What a mouthful for a simple
map!

Well, now I need to find elements.. I'd use std::string's find() here, 
but the wc example has all array operations. Even isalpha is done as 
'a', 'z' comparisons on an indexed array. Back to low level C stuff.

A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string

I believe single functions get pulled in as member functions?  e.g.
find(string) can be used as string.find()?  If so, it means that all the
string functionality can be added and then used naturally as member
functions on this "string" (which is really just the plain old char[] in
disguise).

This is a small thing, but I think it would help in terms of the mindset
of strings being a first class primitive, and clear up simple "hello
world" examples at the same time.  Put simply, every modern language has
a first class string primitive type, except D - at least in terms of
nomenclature.

2.)
I liked the more powerful for loop.  I'm curious is there any ability to 
use delegates in the same way as lua does?  I was blown away the first 
time I realised how simple it was for custom iteration in lua.  In 
short, you write a function that returns a delegate (a closure?) that 
itself returns arguments, terminating in nil.

   e.g. for r in rooms_in_level(lvl) // custom function

As lua can handle multiple return arguments, it can also do a key,value 
sort of thing that D can do.  What a wonderful way of allowing any sort 
of iteration.

It beats pages of code in C++ to write an iterator that can go forwards, 
or one that can go backwards (wow, the power of C++!).  C++09 still 
isn't much of an improvement here, it only sugars the awful iterator syntax.

3.)
 From the newsgroups, it seems like 'auto' as local raii and 'auto' as
automatic type deduction are still linked to the one keyword.  Well in 
lua, 'local' is pretty intuitive for locally scoped variables.  Also 
'auto' will soon mean automatic type deduction in C++.  So those make 
sense to me personally.  Looks like this has been discussed to death, 
but thats my 2c.

4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put 
object files in an /obj subdirectory and the final exe in the original 
directory dbuild is run from.

This way, it could be run from a root directory, operate on a /src 
subdirectory, and not clutter up the source with object files.  There is 
a switch for that, of course, but I can't imagine when you would want 
object files sitting in the same directory as the source.

Well, as first impressions go, I was pleased by D, and am interested to 
see how well it fares as time goes on.  Its just a shame that all the 
tools/library/IDE is all in C++!

Thanks,
Geoff

Sep 28 2006

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Geoff Carlton" <gcarlton iinet.net.au> wrote in message 
news:efhp1r$1r9s$1 digitaldaemon.com...
 Hi,
 I'm a C++ user who's just tried D and I wanted to give my first
 impressions.  I can't really justify moving any of my codebase over to
 D, so I wrote a quick tool to parse a dictionary file and make a
 histogram - a bit like the wc demo in the dmd package.

 1.)
 I was a bit underwhelmed by the syntax of char[].  I've used lua which
 also has strings,functions and maps as basic primitives, so going back
 to array notation seems a bit low level.  Also, char[][] is not the best
 start in the main() declaration.  Is it a 2D array, an array of
 arrays?  Then there is the char[][char[]].  What a mouthful for a simple
 map!

 Well, now I need to find elements.. I'd use std::string's find() here, but 
 the wc example has all array operations. Even isalpha is done as 'a', 'z' 
 comparisons on an indexed array. Back to low level C stuff.

 A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string

 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()?  If so, it means that all the
 string functionality can be added and then used naturally as member
 functions on this "string" (which is really just the plain old char[] in
 disguise).

They're more just syntactic sugar than member functions.  You can, in fact 
do this with any array type, e.g

void foo(int[] arr)
{
    ...
}

int[] x; x = [4, 5, 6, 7]; // bug in the new array literals ;)
x.foo();

 This is a small thing, but I think it would help in terms of the mindset
 of strings being a first class primitive, and clear up simple "hello
 world" examples at the same time.  Put simply, every modern language has
 a first class string primitive type, except D - at least in terms of
 nomenclature.

It does look nicer.  I suppose the counterargument would be that having an 
alias char[] string might not be portable -- what about wchar[] and dchar[]? 
Would they be wstring and dstring?  Or would we choose wchar[] or dchar[] to 

already use UTF-16 as the default string type)?

I've never been too incredibly put off by char[], but of course other people 
have other opinions.

 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?  I was blown away the first 
 time I realised how simple it was for custom iteration in lua.  In short, 
 you write a function that returns a delegate (a closure?) that itself 
 returns arguments, terminating in nil.

   e.g. for r in rooms_in_level(lvl) // custom function

 As lua can handle multiple return arguments, it can also do a key,value 
 sort of thing that D can do.  What a wonderful way of allowing any sort of 
 iteration.

Unfortunately the way Lua does "foreach" iteration is exactly the inverse of 
how D does it.  Lua gets an iterator and keeps calling it in the loop; D 
gives the loop (the entire body!) to the iterator function, which runs the 
loop.  So it's something like a "true" iterator as described in the Lua 
book:

level.each(function(r) print("Room: " .. r) end)

D does it this way I guess to make it easier to write iterators.  Since 
you're limited to one return value, it's simpler to make the iterator a 
callback and pass the indices into the foreach body than it is to make the 
iterator return multiple parameters through "out" parameters.  That, and 
it's easier to keep track of state with a callback iterator.  (I'm going 
through which to use in a Lua-like language that I'm designing too!)

 It beats pages of code in C++ to write an iterator that can go forwards, 
 or one that can go backwards (wow, the power of C++!).  C++09 still isn't 
 much of an improvement here, it only sugars the awful iterator syntax.

Weeeeeeeee!  C++

 3.)
 From the newsgroups, it seems like 'auto' as local raii and 'auto' as
 automatic type deduction are still linked to the one keyword.  Well in 
 lua, 'local' is pretty intuitive for locally scoped variables.  Also 
 'auto' will soon mean automatic type deduction in C++.  So those make 
 sense to me personally.  Looks like this has been discussed to death, but 
 thats my 2c.

I don't even wanna get into it ;)  _Technically_ speaking, auto isn't really 
"used" in type deduction; instead, the syntax is just <storage class> 
<identifier>, skipping the type.  Since the default storage class is auto, 
it looks like auto is being used to determine the type, but it also works 
like e.g.

static x = 5;

I think a better way to do it would be to have a special "stand-in" type, 
such as

var x = 5;
static var y = 20;
auto var f = new Foo(); // this will be RAII and automatically 
type-determined

 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.

 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is a 
 switch for that, of course, but I can't imagine when you would want object 
 files sitting in the same directory as the source.

 Well, as first impressions go, I was pleased by D, and am interested to 
 see how well it fares as time goes on.  Its just a shame that all the 
 tools/library/IDE is all in C++!

 Thanks,
 Geoff

Sep 28 2006

Geoff Carlton <gcarlton iinet.net.au> writes:

Jarrett Billingsley wrote:

 It does look nicer.  I suppose the counterargument would be that having an 
 alias char[] string might not be portable -- what about wchar[] and dchar[]? 
 Would they be wstring and dstring?  Or would we choose wchar[] or dchar[] to 

 already use UTF-16 as the default string type)?
 

I'm a fan of utf-8 so it would seem natural to have string, wstring, and 
dstring.  IMO utf-16 is backward thinking, and has the dubious property 
of being mostly fixed width, except when its not.  And even utf-32 isn't 
one-to-one in terms of glyphs rendered on screen.

Anyway, as a low level programmer, I appreciate that its all based on 
very powerful and flexible arrays.  But as a high level programmer, I 
don't want to be reminded of that fact every time I need a to use a string.


 Unfortunately the way Lua does "foreach" iteration is exactly the inverse of 
 how D does it.  Lua gets an iterator and keeps calling it in the loop; D 
 gives the loop (the entire body!) to the iterator function, which runs the 
 loop.  So it's something like a "true" iterator as described in the Lua 
 book:


Ok, although the advantage of the first method is that you write the 
iterator once, and then its easy to use for all clients.  Wrapping up 
the loop in a function is just backward, although it is much more 
palatable in the inline format than a clunky out of line functor or 
using _1, _2 hackery magic.

As an example, I love the fact that I can do this in lua:

for r1 in rooms_in_level(lvl) do
   for r2 in rooms_in_level(lvl) do
     for c in connections(r1, r2) do
       print("got connection " .. c)
     end
   end
end

I wrote Floyd's algorithm in lua in the time it would take me in C++ to 
not even finish thinking about what structures, classes, vectors I would 
use.  I imagine D would be as easy, although not as nice as the above style.

 
 D does it this way I guess to make it easier to write iterators.  Since 
 you're limited to one return value, it's simpler to make the iterator a 
 callback and pass the indices into the foreach body than it is to make the 
 iterator return multiple parameters through "out" parameters.  That, and 
 it's easier to keep track of state with a callback iterator.  (I'm going 
 through which to use in a Lua-like language that I'm designing too!)

Multiple returns would be tricky.  C++ looks like its getting there with 
std::tuple and std::tie, but as always the downside is the sheer 
clunkiness.  As hetrogenous arrays aren't in the core language for 
either C++ or D, its tricky to come up with a clean solution.

Designing a language would be great fun, and I think lua has done a 
great many things right.  Not sure about the typeless state though, it 
gets messy with large projects. Still, no templates (or rather, every 
function is like a template).

Sep 28 2006

Lutger <lutger.blijdestijn gmail.com> writes:

Geoff Carlton wrote:
 Hi,
 I'm a C++ user who's just tried D and I wanted to give my first
 impressions.  I can't really justify moving any of my codebase over to
 D, so I wrote a quick tool to parse a dictionary file and make a
 histogram - a bit like the wc demo in the dmd package.

You'll sure be pleased with D coming from C++.

 1.)
 I was a bit underwhelmed by the syntax of char[]...

Yes, I was too. But although it looks not very nice at first sight, D's 
arrays are nothing like C++ arrays. Strings are first class, array 
notation is consistent and getting used to them together with 
concatenation and slicing operators, I found they are quite powerful yet 
simple to use.

 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?  I was blown away the first 
 time I realised how simple it was for custom iteration in lua.  In 
 short, you write a function that returns a delegate (a closure?) that 
 itself returns arguments, terminating in nil.

You can enable a class to use the foreach statement. 
http://www.digitalmars.com/d/statement.html#foreach

 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.
 
 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is 
 a switch for that, of course, but I can't imagine when you would want 
 object files sitting in the same directory as the source.

Check out build: http://www.dsource.org/projects/build

 Well, as first impressions go, I was pleased by D, and am interested to 
 see how well it fares as time goes on.  Its just a shame that all the 
 tools/library/IDE is all in C++!
 
 Thanks,
 Geoff

Sep 28 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:

 Hi,
 I'm a C++ user who's just tried D

 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a
simple alias can help a lot. 

   alias char[] string;


 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()? 

This syntax sugar works for all arrays.

   func(T[] x, a)

   x.func(a)

are equivalent.

 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?

Yes it can use anonymous delegates. You can also overload it in classes.

 
 3.)
  From the newsgroups, it seems like 'auto' as local raii and 'auto' as
 automatic type deduction are still linked to the one keyword.

There are lots of D users hoping that this wart will be repaired before too
long. 

 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.
 
 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is 
 a switch for that, of course, but I can't imagine when you would want 
 object files sitting in the same directory as the source.

Thanks for the Build comments. One unfortunate thing I find is that one
person's defaults are another's exceptions. That is why you can tailor
Build to your 'default' behaviour requirements. In this case, create a text
file in the same directory that Build.exe is installed in, called
'build.cfg' and place in it the line ...

CMDLINE=-od./obj

Then when you run the tool, the command line switch is applied every time
you run it.


-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
29/09/2006 4:44:52 PM

Sep 28 2006

Walter Bright <newshound digitalmars.com> writes:

Derek Parnell wrote:
 On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
 I was a bit underwhelmed by the syntax of char[].

 
 Yes. It isn't very 'nice' for a modern language. Though as you note below a
 simple alias can help a lot. 
 
    alias char[] string;

On the other hand, the reasons other languages have strings as classes 
is because they just don't support arrays very well. C++'s std::string 
combines the worst of core functionality and libraries, and has the 
advantages of neither.

An early design goal for D was to upgrade arrays to the point where 
string classes weren't necessary.

Sep 29 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Walter Bright wrote:

 An early design goal for D was to upgrade arrays to the point where 
 string classes weren't necessary.

A string alias might still be, just as the bool alias was.

--anders

Sep 29 2006

Derek Parnell <derek psyc.ward> writes:

On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
 I was a bit underwhelmed by the syntax of char[].

 
 Yes. It isn't very 'nice' for a modern language. Though as you note below a
 simple alias can help a lot. 
 
    alias char[] string;

 
 On the other hand, the reasons other languages have strings as classes 
 is because they just don't support arrays very well. C++'s std::string 
 combines the worst of core functionality and libraries, and has the 
 advantages of neither.
 
 An early design goal for D was to upgrade arrays to the point where 
 string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.

And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Derek Parnell wrote:
 On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
An early design goal for D was to upgrade arrays to the point where 
string classes weren't necessary.

 
 And is it there yet? I mean, given that a string is just a lump of text

The string you're talking about is not just a lump of text.

More specifically it's a lump of text, irregularly interspersed with 
short non-ascii ubyte sequences.

The latter being of course the tails of UTF-8 "characters".

Sep 29 2006

David Medlock <noone nowhere.com> writes:

Derek Parnell wrote:
 On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
 
 
Derek Parnell wrote:

On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:

I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a
simple alias can help a lot. 

   alias char[] string;

On the other hand, the reasons other languages have strings as classes 
is because they just don't support arrays very well. C++'s std::string 
combines the worst of core functionality and libraries, and has the 
advantages of neither.

An early design goal for D was to upgrade arrays to the point where 
string classes weren't necessary.

 
 
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.
 
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?
 

I just quickly want to interject my wish for aliases for the basic 
string array types.

-DavidM

Sep 29 2006

Walter Bright <newshound digitalmars.com> writes:

Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have 
anything over it.

 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

Sep 29 2006

Matthias Spycher <matthias coware.com> writes:

Immutability and some guarantees about the validity of the state of an 
immutable string in a concurrent setting are what set Java strings 
apart. Garbage collection without immutable strings in the standard 
library is quite out of the ordinary.

Walter Bright wrote:
 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

 
 I believe it's there. I don't think std::string or java.lang.String have 
 anything over it.
 
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

 
 I don't think it'll cause problems, it just seems pointless.

Sep 29 2006

Derek Parnell <derek psyc.ward> writes:

On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

 
 I believe it's there. I don't think std::string or java.lang.String have 
 anything over it.

I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character. It looks for byte-subsets rather than character sub-sets.

 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

 
 I don't think it'll cause problems, it just seems pointless.

It may very well be pointless for your way of thinking, but your language
is also for people who may not necessarily think in the same manner as
yourself. I, for example, think there is a point to having my code read
like its dealing with strings rather than arrays of characters. I suspect
I'm not alone. We could all write the alias in all our code, but you could
also be helpful and do it for us - like you did with bit/bool.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character. It looks for byte-subsets rather than character sub-sets.

I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61) 
may be found within a Japanese multibyte glyph? Or even a very long 
Japanese text.

That is not correct.

The designers of UTF-8 knew that this would be dangerous, and created 
UTF-8 so that such _will_not_happen_. Ever.

Therefore, something like std.string.find() doesn't even have to know 
about it.

Basically, std.string.find() and comparable functions, only have to 
receive two octet sequences, and see where one of them first occurs in 
the other. No need to be aware of UTF or ASCII. For all we know, the 
strings may even be in EBCDIC. Still works.

If the strings themselves are valid (in whichever encoding you have 
chosen to use), then the result will also be valid.

((For the sake of completeness, here I've restricted the discussion to 
the version of such functions that accept ubyte[] compatible input 
(obviously including char[]). Those taking 16 or 32 bits, and especially 
if we deliberately feed input of wrong width to any of these, then of 
course the results will be more complicated.))

Sep 29 2006

Walter Bright <newshound digitalmars.com> writes:

Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character.

That cannot happen, because multibyte sequences *always* have the high 
bit set, and 'a' does not. That's one of the things that sets UTF-8 
apart from other multibyte formats. You might be thinking of the older 
Shift-JIS multibyte encoding, which did suffer from such problems.

 It looks for byte-subsets rather than character sub-sets.

I don't think it's broken, but if it is, those are bugs, not fundamental 
problems with char[], and should be filed in bugzilla.

 It may very well be pointless for your way of thinking, but your language
 is also for people who may not necessarily think in the same manner as
 yourself. I, for example, think there is a point to having my code read
 like its dealing with strings rather than arrays of characters. I suspect
 I'm not alone. We could all write the alias in all our code, but you could
 also be helpful and do it for us - like you did with bit/bool.

I'm concerned about just adding more names that don't add real value. As 
I wrote in a private email discussion about C++ typedefs, they should 
only be used when:

1) they provide an abstraction against the presumption that the 
underlying type may change

2) they provide a self-documentation purpose

(1) certainly doesn't apply to string. (2) may, but char[] has no use 
other than that of being a string, as a char[] is always a string and a 
string is always a char[]. So I don't think string fits (2).

And lastly, there's the inevitable confusion. People learning the 
language will see char[] and string, and wonder which should be used 
when. I can't think of any consistent understandable rule for that. So 
it just winds up being wishy-washy. Adding more names into the global 
space (which is what names in object.d are) should be done extremely 
conservatively.

If someone wants to use the string alias as their personal or company 
style, I have no issue with that, as other people *do* think differently 
than me (which is abundantly clear here!).

Sep 29 2006

Derek Parnell <derek psyc.ward> writes:

On Fri, 29 Sep 2006 23:11:37 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character.

 
 That cannot happen, because multibyte sequences *always* have the high 
 bit set, and 'a' does not. That's one of the things that sets UTF-8 
 apart from other multibyte formats. You might be thinking of the older 
 Shift-JIS multibyte encoding, which did suffer from such problems.

Thanks. That has cleared up some misconceptions and pre-concenptions that I
had with utf encoding. I can reduce some of my home-grown routines now and
reduce that number of times that I (think I) need dchar[] ;-)


 It may very well be pointless for your way of thinking, but your language
 is also for people who may not necessarily think in the same manner as
 yourself. I, for example, think there is a point to having my code read
 like its dealing with strings rather than arrays of characters. I suspect
 I'm not alone. We could all write the alias in all our code, but you could
 also be helpful and do it for us - like you did with bit/bool.

 
 I'm concerned about just adding more names that don't add real value. As 
 I wrote in a private email discussion about C++ typedefs, they should 
 only be used when:
 
 1) they provide an abstraction against the presumption that the 
 underlying type may change
 
 2) they provide a self-documentation purpose
 
 (1) certainly doesn't apply to string. 

No argument there.

  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

 
This is a lttle more debatable, but not worth generating hostility. 

A string of text contains characters whose position in the string is
significant - there are semantics to be applied to the entire text. It is
quite possible to conceive of an application in which the characters in the
char[] array have no importance attached to their relative position within
the array *where compared to neighboring characters*. The order of
characters in text is significant but not necessarily so in a arbitary
character array. 

Conceptually a string is different from a char[], even though they are
implemented using the same technology.

 And lastly, there's the inevitable confusion. People learning the 
 language will see char[] and string, and wonder which should be used 
 when. I can't think of any consistent understandable rule for that. So 
 it just winds up being wishy-washy. Adding more names into the global 
 space (which is what names in object.d are) should be done extremely 
 conservatively.

And yet we have "toString" and not "toCharArray" or "toUTF"!
 
And we still have the "printf" in object.d too! 

 If someone wants to use the string alias as their personal or company 
 style, I have no issue with that, as other people *do* think differently 
 than me (which is abundantly clear here!).

I'll revert Build to string again as it is a lot easier to read. It started
out that way but I converted it to char[] to appease you (why I thought you
need appeasing is lost though). :-)

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Sep 30 2006

Walter Bright <newshound digitalmars.com> writes:

Derek Parnell wrote:
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

  
 This is a lttle more debatable, but not worth generating hostility.

I certainly hope this thread doesn't degenerate into that like some of 
the others.

 A string of text contains characters whose position in the string is
 significant - there are semantics to be applied to the entire text. It is
 quite possible to conceive of an application in which the characters in the
 char[] array have no importance attached to their relative position within
 the array *where compared to neighboring characters*. The order of
 characters in text is significant but not necessarily so in a arbitary
 character array. 
 
 Conceptually a string is different from a char[], even though they are
 implemented using the same technology.

You do have a point there.

 And lastly, there's the inevitable confusion. People learning the 
 language will see char[] and string, and wonder which should be used 
 when. I can't think of any consistent understandable rule for that. So 
 it just winds up being wishy-washy. Adding more names into the global 
 space (which is what names in object.d are) should be done extremely 
 conservatively.

 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

True, and some have called for renaming char to utf8. While that would 
be technically more correct (as toUTF would be, too), it just looks awful.

I suppose that since I grew up with char* meaning string, using char[] 
seems perfectly natural. I tried typedef'ing char* to string now and 
then, but always wound up going back to just using char*.

 And we still have the "printf" in object.d too!

I know many feel that printf doesn't belong there. It certainly isn't 
there for purity or consistency. It's there purely (!) for the 
convenience of writing short quickie programs. I tend to use it for 
quick debugging test cases, because it doesn't rely on the rest of D 
working.

 If someone wants to use the string alias as their personal or company 
 style, I have no issue with that, as other people *do* think differently 
 than me (which is abundantly clear here!).

 
 I'll revert Build to string again as it is a lot easier to read. It started
 out that way but I converted it to char[] to appease you (why I thought you
 need appeasing is lost though). :-)

No, you certainly don't need to appease me! I do care about maintaining 
a reasonably consistent style in Phobos, but I don't believe a language 
should enforce a particular style beyond the standard library. Viva la 
difference.

P.S. I did say to not 'enforce', but that doesn't mean I am above 
recommending a particular style, as in 
http://www.digitalmars.com/d/dstyle.html

Sep 30 2006

Derek Parnell <derek psyc.ward> writes:

On Sat, 30 Sep 2006 21:18:02 -0700, Walter Bright wrote:

 P.S. I did say to not 'enforce', but that doesn't mean I am above 
 recommending a particular style, as in 
 http://www.digitalmars.com/d/dstyle.html

Oh, I threw trhat away ages ago ;-)

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Oct 01 2006

Lars Ivar Igesund <larsivar igesund.net> writes:

Walter Bright wrote:

 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

 
 True, and some have called for renaming char to utf8. While that would
 be technically more correct (as toUTF would be, too), it just looks awful.
 

Nope, it just looks correct.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource & #D: larsivi

Oct 01 2006

Lionello Lunesu <lio lunesu.remove.com> writes:

Lars Ivar Igesund wrote:
 Walter Bright wrote:
 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

 True, and some have called for renaming char to utf8. While that would
 be technically more correct (as toUTF would be, too), it just looks awful.

 
 Nope, it just looks correct.
 

I don't think renaming toString to toUTF gets rid of any confusion. 
AFAIK, toString is meant for debugging and char[] should be enough, and 
yet flexible enough for unicode strings.

In fact, "string toString()" would be a good solution too.

---
My 4 reasons for the "string" aliases:

* readability: less [] pairs;
* safety: char[] is not zero-terminated, so lets not pretend there's a 
relation with C's char*. In fact: lets hide any relation;
* clarity: a char[] should not be iterated 1 char at a time, which makes 
it different from an int[].
* consistency: "string toString()"

L.

Oct 02 2006

Georg Wrede <georg.wrede nospam.org> writes:

Walter Bright wrote:
 True, and some have called for renaming char to utf8. While that would 
 be technically more correct (as toUTF would be, too), it just looks awful.

Let's just say it would be a first step in lessening the confusion _we_ 
create in newcomers' heads.

Oct 01 2006

Kevin Bealer <kevinbealer gmail.com> writes:

Georg Wrede wrote:
 Walter Bright wrote:
 True, and some have called for renaming char to utf8. While that would 
 be technically more correct (as toUTF would be, too), it just looks 
 awful.

 
 Let's just say it would be a first step in lessening the confusion _we_ 
 create in newcomers' heads.

I would kind of agree with this, but I think it's a two-edged knife.

If we say 'char[]' then users don't know it's a string until they read 
the 'why D arrays are great' page (which they should read, but...)

If we say 'string' then we hide the fact that [] can be applied and that 
other array-like operations can work.

For instance, from a Java perspective:

char[] : Users don't know that it's "String"; users see it as low-level.
          Some will try to write things like 'find()' by hand since they
          will figure arrays are low level and not expect this to exist.

string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

Kevin

Oct 02 2006

Georg Wrede <georg.wrede nospam.org> writes:

Kevin Bealer wrote:
 Georg Wrede wrote:
 
 Walter Bright wrote:

 True, and some have called for renaming char to utf8. While that 
 would be technically more correct (as toUTF would be, too), it just 
 looks awful.


 Let's just say it would be a first step in lessening the confusion 
 _we_ create in newcomers' heads.

 
 
 I would kind of agree with this, but I think it's a two-edged knife.
 
 If we say 'char[]' then users don't know it's a string until they read 
 the 'why D arrays are great' page (which they should read, but...)
 
 If we say 'string' then we hide the fact that [] can be applied and that 
 other array-like operations can work.
 
 For instance, from a Java perspective:
 
 char[] : Users don't know that it's "String"; users see it as low-level.
          Some will try to write things like 'find()' by hand since they
          will figure arrays are low level and not expect this to exist.

Yes.

 string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

Well, with string, folks would at least be inclined to search for the 
library function to do it.

---

Overall, having string instead of char[] should result in folks learning 
and doing more with D _before_ they get tangled with UTF issues. (I 
guess, getting tangled with UTF is unavoidable.) But the more later 
folks stumble on this, the better they can handle it. If it happens too 
soon, then they will just run away from D.

But substituting string for char[] in D is not enough. More than half 
the issue is the wording in the docs.

---

Another thing intimately connected with this is whether we should have 
char[] or utf8[] (string or no string, this is an important thing anyway).

I understand that "char" is one of the words that a seasoned 
programmer's fingers know by heart. So it would feel simply disgusting 
to have to learn (and bother) to write "utf8" which I admit is a lot 
more work to type. (Seriously.)

Now, "string" is easy for the fingers, and then you get to skip "[]", 
which makes it all a little more palatable.

Having string would let us have the underlying type be utf8[], which 
really emphasizes and calls your attention to the fact that it's not 
byte-by-byte stuff we have there.

Oct 03 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Kevin Bealer wrote:

 If we say 'char[]' then users don't know it's a string until they read 
 the 'why D arrays are great' page (which they should read, but...)
 
 If we say 'string' then we hide the fact that [] can be applied and that 
 other array-like operations can work.

Which could be a *good* thing, since it would stop users from hurting
themselves by pretending that the D strings are arrays of characters ?

And when they have read up that they are "arrays of Unicode code units",
they should be OK with interpreting the "string" alias as char[] arrays.

 For instance, from a Java perspective:
 
 char[] : Users don't know that it's "String"; users see it as low-level.
          Some will try to write things like 'find()' by hand since they
          will figure arrays are low level and not expect this to exist.
 
 string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

I think the best answer would be: "to get a char[] from the string,
use the std.utf.toUTF8 function", since this also works even if you
redeclare the "string" alias to be something else - like wchar_t[] ?

Earlier* I suggested adding the alias utf8_t for "char", just like
we have int8_t for "byte", but I wouldn't rename the actual D types.
Just a little std.stdutf module with some aliases, if ever needed...

string std.string.toString( )
utf8_t[]  std.utf.toUTF8( )
utf16_t[] std.utf.toUTF16( )
utf32_t[] std.utf.toUTF32( )

--anders

* digitalmars.D/11821, 2004-10-15

Oct 03 2006

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Derek Parnell wrote:
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

  
 This is a lttle more debatable, but not worth generating hostility. 
 
 A string of text contains characters whose position in the string is
 significant - there are semantics to be applied to the entire text. It is
 quite possible to conceive of an application in which the characters in the
 char[] array have no importance attached to their relative position within
 the array *where compared to neighboring characters*. The order of
 characters in text is significant but not necessarily so in a arbitary
 character array. 
 
 Conceptually a string is different from a char[], even though they are
 implemented using the same technology.
 

Precisely! And even if such conceptual difference didn't exist, or is 
very rare, 'string' is nonetheless more readable than 'char[]', a fact I 
am constantly reminded of when I see 'int main(char[][] args)' instead 
of 'int main(string[] args)', which translates much more quickly into 
the  brain as 'array of strings' than its current counterpart.

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Oct 01 2006

Geoff Carlton <gcarlton iinet.net.au> writes:

Bruno Medeiros wrote:
 Precisely! And even if such conceptual difference didn't exist, or is 
 very rare, 'string' is nonetheless more readable than 'char[]', a fact I 
 am constantly reminded of when I see 'int main(char[][] args)' instead 
 of 'int main(string[] args)', which translates much more quickly into 
 the  brain as 'array of strings' than its current counterpart.
 

There are also many cases where char arrays are not strings:

Single array of characters, not strings:
  char GAME_10PT_LETTERS[] = { 'x', 'z' };

Two-dimensional array of characters, not string arrays:
  char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. };
  char m_scrabbleBoard[20][20];

Oct 01 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Derek Parnell schrieb am 2006-09-30:
 On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

 
 I believe it's there. I don't think std::string or java.lang.String have 
 anything over it.

 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character. It looks for byte-subsets rather than character sub-sets.


~wow~

Have a look at std.string.find's source and try to stop giggling *g*

The correct implementation would be:




















The same applies to ifind and the like.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFFHj4fLK5blCcjpWoRAj67AJoDagf5zf7Az7ZqMDfOyZdRJ+aIqQCdGeen
ye80pstE4IJC1WoxgTVVgdc=
=iwT5
-----END PGP SIGNATURE-----

Sep 30 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Kuehne schrieb am 2006-09-30:
 Derek Parnell schrieb am 2006-09-30:
 On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

 
 I believe it's there. I don't think std::string or java.lang.String have 
 anything over it.

 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character. It looks for byte-subsets rather than character sub-sets.


 ~wow~

 Have a look at std.string.find's source and try to stop giggling *g*

 The correct implementation would be:

As it seems, the original code depends on the undocumented index behavior
with regards to silent transcoding in foreach.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFFHkOILK5blCcjpWoRAnmjAJ9PKdGDHsghycgxHdr7hkc+IP+XEgCgohH8
LH7OOQgQAZoTMLRQXtWhqbE=
=or0x
-----END PGP SIGNATURE-----

Sep 30 2006

Sean Kelly <sean f4.ca> writes:

Thomas Kuehne wrote:
 
 As it seems, the original code depends on the undocumented index behavior
 with regards to silent transcoding in foreach.

The wording could be more explicit, but I think the current 
documentation implies the actual behavior:

"The index  must be of int or uint type, it cannot be inout, and it is 
set to be the index of the array element."

The docs should probably also be revised to allow for 64-bit indices, 
where the index would be long or ulong.  Something along the lines of:

"The index must be an integer type of size equal to size_t.sizeof. . ."


Sean

Sep 30 2006

Geoff Carlton <gcarlton iinet.net.au> writes:

Walter Bright wrote:
 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

 
 I believe it's there. I don't think std::string or java.lang.String have 
 anything over it.
 
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

 
 I don't think it'll cause problems, it just seems pointless.

Hi,
The main reasons I think are these:

It simplifies the initial examples, particularly main(string[]), and 
maps such as string[string].  More complex examples are a map of words 
to text lines, string[][string], rather than char[][][char[]].

It clarifies the actual use of the entity.  It is a text string, not 
just a jumbled array of characters.  Arrays of char can be used for 
other things, such as the set of player letters in a scrabble game.  A 
string has the additional usage that we know it as is text string.  The 
alias reflects that intent.

Given a user wants to use a string, there is no need to expose the 
implementation detail of how strings are done in D.  Perhaps in perl, 
strings are a linked list of shorts, but it doesn't mean that you'd have 
  list<short> all over the place.

Use of char[] and char[][] looks like low level C.  It has also been 
noted that it encourages char based indexing, which is not a good thing 
for utf8.

Anyway, hope one of those points grabbed you!
Geoff

Sep 29 2006

David Medlock <noone nowhere.com> writes:

Walter Bright wrote:
 Derek Parnell wrote:
 
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

 
 
 I believe it's there. I don't think std::string or java.lang.String have 
 anything over it.
 
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

 
 
 I don't think it'll cause problems, it just seems pointless.

The reason *I* want it is _alias_ does not respect the private: 
visibility modifier.

So when I pull out an old piece of code which says

alias char[] string

and import it in my newer module I get conflicts when I compile.

Then I must do this silly hack where I include the newer file from the 
old or vice versa.

If you didn't add this into phobos, at least or adopt a method to 
discriminate between more than one alias with the same name to resolve 
the issue.

-DavidM

Sep 29 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Geoff Carlton wrote:

 A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string
 
 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()?  If so, it means that all the
 string functionality can be added and then used naturally as member
 functions on this "string" (which is really just the plain old char[] in
 disguise).

Problem of "char[]" is both that it hides the fact that "char" is UTF-8
while at the same time it exposes the fact that it's stored as an array.

You can "improve" upon that readability with aliases, like declaring say
utf8_t -> char and string -> utf8_t[], but you still need to understand
Unicode and Arrays in order to use it outside of the provided methods...
I think "hides the implementation" was the biggest argument against it ?

http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

 This is a small thing, but I think it would help in terms of the mindset
 of strings being a first class primitive, and clear up simple "hello
 world" examples at the same time.  Put simply, every modern language has
 a first class string primitive type, except D - at least in terms of
 nomenclature.

I did the big mistake of thinking it would be a good thing to be able to
switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like:

version(UNICODE)
     alias char[] string;
else // version(ANSI)
     alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix

Still trying to sort out all the code problems with that idea, as there
is a ton of toUTF8 and other conversions to make strings work together.


In retrospect it would have been much easier to have stuck with char[],
and do the conversion from UTF-8 to the local encoding on the C++ side.
(since there were no guarantees that the "char" and "wchar_t" types in
C++ used UTF encodings, even if they did so in Unix/GTK+ for instance)
Any (minor) performance issues of having to do the UTF-8 <-> UTF-32
conversions were not worth the hassle of doing it on the D side, IMHO.

So I agree with the "alias char[] string;" and the string[string] args.
It's going to be used as wx.common.string for instance, in wxD library.

--anders

Sep 29 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

 I did the big mistake of thinking it would be a good thing to be able to
 switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like:
 
 version(UNICODE)
     alias char[] string;
 else // version(ANSI)
     alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix

Except the other way around, of course!

version(UNICODE)
     alias wchar_t[] string;
else // version(ANSI)
     alias char[] string;

Now, to get me some more coffee... :-P

--anders

Sep 29 2006

Lionello Lunesu <lio lunesu.remove.com> writes:

I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
they would be included by default in Phobos.

alias char[] string;
alias wchar[] wstring;
alias dchar[] dstring;

Perhaps, using string instead of char[], it's more obvious that it's not 
zero-terminated. I've seen D examples online that just cast a char[] to 
char* for use in MessageBox and the like (which worked since it were 
string constants.)

L.

Sep 29 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Lionello Lunesu wrote:

 Perhaps, using string instead of char[], it's more obvious that it's not 
 zero-terminated. I've seen D examples online that just cast a char[] to 
 char* for use in MessageBox and the like (which worked since it were 
 string constants.)

And probably only for ASCII string constants, at that...

--anders

Sep 29 2006

Lionello Lunesu <lio lunesu.remove.com> writes:

Anders F Bj�rklund wrote:
 Lionello Lunesu wrote:
 
 Perhaps, using string instead of char[], it's more obvious that it's 
 not zero-terminated. I've seen D examples online that just cast a 
 char[] to char* for use in MessageBox and the like (which worked since 
 it were string constants.)

 
 And probably only for ASCII string constants, at that...

Right, that too!

char[] somestring = "....";
func( somestring[0] ); // WRONG: somestring[x] is not 1 character!

Using "string" would make it less obvious:

string somestring = ".....";
func( somestring[0] ); // [0] means what?

This goes for iteration as well. DMD will still deduct 'char' as the 
type type, but at least one's less likely to type foreach(char c;str).

If you want to iterate the UNICODE characters in a string, you'll 
specify "dchar" as the type and you won't worry about "how come I can 
use dchar when it's a char[]":

foreach(dchar c; somestring)
   func(c); // correct


L.

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Lionello Lunesu wrote:
 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
 they would be included by default in Phobos.
 
 alias char[] string;
 alias wchar[] wstring;
 alias dchar[] dstring;
 
 Perhaps, using string instead of char[], it's more obvious that it's not 
 zero-terminated. I've seen D examples online that just cast a char[] to 
 char* for use in MessageBox and the like (which worked since it were 
 string constants.)

Using char[] as long as you don't know about UTF seems to work pretty 
well in D. But the moment you realise that we're having potential 
multibyte characters in what essentially is a ubyte[], you get scared to 
death, and start to wonder how on earth you haven't yet blown up your 
hard disk.

You start having nightmares about slicing char arrays at the wrong 
place, extracting single chars that might not be storable in a char, and 
all of a sudden you decide to stick with your old language "till things 
calm down".

The only medicine to this is simply to shut your eyes and keep coding on 
like you never did realise anything.

It's a little like when you first realised Daddy isn't holding your 
bike: you instantly fall hurting yourself, instead of realizing that 
he's probably let go ages ago, and you still haven't fallen, so simply 
keep going.

---

This doesn't mean I'm happy with this either, but I don't have the 
energy to conjure up a significantly better solution _and_ fight for it 
till it gets accepted. (Some things are just too hard to fix, like 
"bit=bool" was, and now "auto/auto".)

Sep 29 2006

Chad J <""gamerChad\" spamIsBad gmail.com"> writes:

Georg Wrede wrote:
 Lionello Lunesu wrote:
 
 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
 they would be included by default in Phobos.

 alias char[] string;
 alias wchar[] wstring;
 alias dchar[] dstring;

 Perhaps, using string instead of char[], it's more obvious that it's 
 not zero-terminated. I've seen D examples online that just cast a 
 char[] to char* for use in MessageBox and the like (which worked since 
 it were string constants.)

 
 
 Using char[] as long as you don't know about UTF seems to work pretty 
 well in D. But the moment you realise that we're having potential 
 multibyte characters in what essentially is a ubyte[], you get scared to 
 death, and start to wonder how on earth you haven't yet blown up your 
 hard disk.
 
 You start having nightmares about slicing char arrays at the wrong 
 place, extracting single chars that might not be storable in a char, and 
 all of a sudden you decide to stick with your old language "till things 
 calm down".
 
 The only medicine to this is simply to shut your eyes and keep coding on 
 like you never did realise anything.
 
 It's a little like when you first realised Daddy isn't holding your 
 bike: you instantly fall hurting yourself, instead of realizing that 
 he's probably let go ages ago, and you still haven't fallen, so simply 
 keep going.
 
 ---
 
 This doesn't mean I'm happy with this either, but I don't have the 
 energy to conjure up a significantly better solution _and_ fight for it 
 till it gets accepted. (Some things are just too hard to fix, like 
 "bit=bool" was, and now "auto/auto".)

haha too true.

I experienced this too as I read this ng.  It hasn't been THAT truamatic 
for me though, since everything seems to work as long as you stick to 
english.  I don't have the resources to even begin thinking about 
non-english text (ex: paying people to translate stuff), so I don't lose 
any sleep about it, at least not yet.

Perhaps there should be a string struct/class that has an undefined 
underlying type (it could be UTF-8, 16, 32, you dunno really), and you 
could index it to get the *complete* character at any position in the 
string.  Basically, it is like char[], but it /just works/ in all cases. 
  I'd almost rather have the size of a char be undefined, and just have 
char[] be the said magic string type.  If you want something with a 
.size of 1, then there is byte/ubyte.  There would probably have to be 
some stuff in the phobos internals to handle such a string in a correct 
manner.

Going even further... if you could make char[] be such a magic string 
type, then wchar[] and dchar[] could probably be deprecated - use ushort 
and uint instead.  Then add the following aliases to phobos:
alias ubyte utf8;
alias ushort utf16;
alias uint utf32;

Just a thought.  I'm no expert on UTF, but maybe this can start a 
discussion that will result in the nightmares ending :)

Sep 29 2006

Johan Granberg <lijat.meREM OVEgmail.com> writes:

Chad J > wrote:
 Perhaps there should be a string struct/class that has an undefined 
 underlying type (it could be UTF-8, 16, 32, you dunno really), and you 
 could index it to get the *complete* character at any position in the 
 string.  Basically, it is like char[], but it /just works/ in all cases. 
  I'd almost rather have the size of a char be undefined, and just have 
 char[] be the said magic string type.  If you want something with a 
 ..size of 1, then there is byte/ubyte.  There would probably have to be 
 some stuff in the phobos internals to handle such a string in a correct 
 manner.

I have thought about this to.

 Going even further... if you could make char[] be such a magic string 
 type, then wchar[] and dchar[] could probably be deprecated - use ushort 
 and uint instead.  Then add the following aliases to phobos:
 alias ubyte utf8;
 alias ushort utf16;
 alias uint utf32;

I completely agree, char should hold a character independently of 
encoding and NOT a code unit or something else. I think it would bee 
beneficial to D in the long term if chars where done right (meaning that 
they can store any character) how it is implemented is not important and 
i believe performance is not a problem here, so ease of use and 
correctness would be appreciated.

Sep 29 2006

BCS <BCS pathlink.com> writes:

Johan Granberg wrote:
 
 
 I completely agree, char should hold a character independently of 
 encoding and NOT a code unit or something else. I think it would be
 beneficial to D in the long term if chars where done right (meaning that 
 they can store any character) how it is implemented is not important and 
 i believe performance is not a problem here, so ease of use and 
 correctness would be appreciated.

Why isn't performance a problem?

If you are saying that this won't cause performance hits in run times or 
  memory space, I might be able to buy it, but I'm not yet convinced.

If you are saying that causing a performance hit in run times or memory 
space is not a problem... in that case I think you are dead wrong and 
you will not convince me otherwise.

In my opinion, any compiled language should allow fairly direct access 
to the most efficient practical means of doing something*. If I didn't 
care about speed and memory I wound use some sort of scripting language.

A good set of libs should make most of this moot. Leave the char as is 
and define a typedef struct or whatever that provides the added 
functionality that you want.

* OTOH a language should not mandate code to be efficient at the expense 
of ease of coding.

Sep 29 2006

Chad J <""gamerChad\" spamIsBad gmail.com"> writes:

BCS wrote:
 Johan Granberg wrote:
 
 I completely agree, char should hold a character independently of 
 encoding and NOT a code unit or something else. I think it would be
 beneficial to D in the long term if chars where done right (meaning 
 that they can store any character) how it is implemented is not 
 important and i believe performance is not a problem here, so ease of 
 use and correctness would be appreciated.

 
 
 Why isn't performance a problem?
 
 If you are saying that this won't cause performance hits in run times or 
  memory space, I might be able to buy it, but I'm not yet convinced.
 
 If you are saying that causing a performance hit in run times or memory 
 space is not a problem... in that case I think you are dead wrong and 
 you will not convince me otherwise.
 
 In my opinion, any compiled language should allow fairly direct access 
 to the most efficient practical means of doing something*. If I didn't 
 care about speed and memory I wound use some sort of scripting language.
 
 A good set of libs should make most of this moot. Leave the char as is 
 and define a typedef struct or whatever that provides the added 
 functionality that you want.
 
 * OTOH a language should not mandate code to be efficient at the expense 
 of ease of coding.

I will go ahead and say that the current state of char[] is incorrect. 
That is, if you write a program manipulating char[] strings, then run it 
in china, you will be dissapointed with the results.  It won't matter 
how fast the program runs, because bad stuff will happen like entire 
strings becoming unreadable to the user.

Technically if you follow UTF and do your char[] manipulations very 
carefully, it is correct, but realistically few if any people will do 
such things (I won't).  Also, if you do this, your program will probably 
run as slow as one with the proposed char/string solution, maybe slower 
(since language/stdlib level support can be heavily optimized).

What I'd like then, is a program that is correct and as fast as possible 
while still being correct.

Sure you can get some speed gains by just using ASCII and saying to hell 
with UTF, but you should probably only do that when profiling has shown 
that such speed gains are actually useful/needed in your program.

Ultimately we have to decide whether we want D to default to UTF code 
which might run slightly slower but allow better localization and 
international friendliness, or if we want it to default to ASCII or some 
such encoding that runs slightly faster but is mostly limited to english.

I'd like the default to be UTF.  Then we can have a base of code to 
correctly manipulate UTF strings (in phobos and language supported). 
Writing correct ASCII manipulation routine without good library/language 
support is a lot easier than writing good UTF manipulation routines 
without good library/language support, and UTF will probably be used 
much more than ASCII.

Also, if we move over to full blown UTF, we won't have to give up ASCII. 
  It seems to me like the phobos std.string functions are pretty much 
ASCII string manipulating functions (no multibyte string support).  So 
just copy those out to a seperate library, call it "ASCII lib", and 
there's your library support for ASCII.  That leaves string literals, 
which is a slight problem, but I suppose easily fixed:
ubyte[] hi = "hello!"a;
Just add a postfix 'a' for strings which makes the string an ASCII 
literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF 
manipulations without special attention, but they are powerful enough to 
do ASCII manipulations without special attention, so using ubyte[] as an 
ASCII string should give full language support for these.  Given that 
and ASCIILIB you pretty much have the current D string manipulation 
capabilities afaik, and it will be fast.

Sep 29 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Chad J > wrote:

 I'd like the default to be UTF. Then we can have a base of code to
 correctly manipulate UTF strings (in phobos and language supported).
 Writing correct ASCII manipulation routine without good library/language
 support is a lot easier than writing good UTF manipulation routines
 without good library/language support, and UTF will probably be used
 much more than ASCII.

But D already uses Unicode for all strings, encoded as UTF ?

When you say "ASCII", do you mean 8-bit encodings perhaps ?
(since all proper 7-bit ASCII are already valid UTF-8 too)

 Also, if we move over to full blown UTF, we won't have to give up ASCII. 
  It seems to me like the phobos std.string functions are pretty much 
 ASCII string manipulating functions (no multibyte string support).  So 
 just copy those out to a seperate library, call it "ASCII lib", and 
 there's your library support for ASCII.  That leaves string literals, 
 which is a slight problem, but I suppose easily fixed:
 ubyte[] hi = "hello!"a;

I don't understand this, why can't you use UTF-8 for this ?

char[] hi = "hello!";

 Just add a postfix 'a' for strings which makes the string an ASCII 
 literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF 
 manipulations without special attention, but they are powerful enough to 
 do ASCII manipulations without special attention, so using ubyte[] as an 
 ASCII string should give full language support for these.  Given that 
 and ASCIILIB you pretty much have the current D string manipulation 
 capabilities afaik, and it will be fast.

What is not powerful enough about the foreach(dchar c; str) ?
It will step through that UTF-8 array one codepoint at a time.

--anders

Sep 29 2006

Chad J <""gamerChad\" spamIsBad gmail.com"> writes:

Anders F Bj�rklund wrote:
 Chad J > wrote:
 
 I'd like the default to be UTF. Then we can have a base of code to
 correctly manipulate UTF strings (in phobos and language supported).
 Writing correct ASCII manipulation routine without good library/language
 support is a lot easier than writing good UTF manipulation routines
 without good library/language support, and UTF will probably be used
 much more than ASCII.

 
 
 But D already uses Unicode for all strings, encoded as UTF ?
 
 When you say "ASCII", do you mean 8-bit encodings perhaps ?
 (since all proper 7-bit ASCII are already valid UTF-8 too)
 

Probably 7-bit.  Anything where the size of one character is ALWAYS one 
byte.  I am already assuming that ASCII is a subset or at least is 
mostly a subset of UTF8.  However, I talk about it in an exclusive 
manner because if you handle UTF8 strings properly then the code will 
probably run at least slightly slower than with ASCII-only strings.

 Also, if we move over to full blown UTF, we won't have to give up 
 ASCII.  It seems to me like the phobos std.string functions are pretty 
 much ASCII string manipulating functions (no multibyte string 
 support).  So just copy those out to a seperate library, call it 
 "ASCII lib", and there's your library support for ASCII.  That leaves 
 string literals, which is a slight problem, but I suppose easily fixed:
 ubyte[] hi = "hello!"a;

 
 
 I don't understand this, why can't you use UTF-8 for this ?
 
 char[] hi = "hello!";
 

I was talking about IF we made char[] into a datatype that handles all 
of those odd corner cases correctly (slices into multibyte strings, for 
instance) then it will no longer be the same fast ASCII-only routines. 
So for those who want the fast ASCII-only stuff, it would nice to 
specify a way to make string literals such that each character in the 
literal takes only one byte, without ugly casting.  To get an ASCII 
monobyte string from a string literal in D I would have to do the following:

ubyte[] hi = cast(ubyte[])"hello!";

hmmm, yuck.

 Just add a postfix 'a' for strings which makes the string an ASCII 
 literal, of type ubyte[].  D arrays don't seem powerful enough to do 
 UTF manipulations without special attention, but they are powerful 
 enough to do ASCII manipulations without special attention, so using 
 ubyte[] as an ASCII string should give full language support for 
 these.  Given that and ASCIILIB you pretty much have the current D 
 string manipulation capabilities afaik, and it will be fast.

 
 
 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.
 

I'm assuming 'str' is a char[], which would make that very nice.  But it 
doesn't solve correctly slicing or indexing into a char[].  If nothing 
was done about this and I absolutely needed UTF support, I'd probably 
make a class like so:

class String
{
   char[] data;

   ...

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;

       i++;
     }
   }

   // similar thing for opSlice down here
   ...
}

Which is probably slower than could be done.

All in all it is a drag that we should have to learn all of this UTF 
stuff.  I want char[] to just work!

Sep 29 2006

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Chad J > wrote:

 Probably 7-bit.  Anything where the size of one character is ALWAYS one 
 byte.  I am already assuming that ASCII is a subset or at least is 
 mostly a subset of UTF8.  However, I talk about it in an exclusive 
 manner because if you handle UTF8 strings properly then the code will 
 probably run at least slightly slower than with ASCII-only strings.

It's mostly about looking out for the UTF "control" characters, which is 
  not more than a simple assertion in your ASCII-only functions really...

I don't think handling UTF-8 properly is a burden for string functions, 
when you compare it with the enormous gain that it has over ASCII-only.

 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.

 
 I'm assuming 'str' is a char[], which would make that very nice.  But it 
 doesn't solve correctly slicing or indexing into a char[].  

Well, it's also a lot "trickier" than that... For instance, my last name
can be written in Unicode as Björklund or Bj¨orklund, both of which are 
valid - only that in one of them, the 'ö' occupies two full code points!
It's still a single character, which is why Unicode avoids that term...

As you know, if you need to access your strings by codepoint (something 
that the Unicode group explicitly recommends against, in their FAQ) then 
char[] isn't a very nice format - because of the conversion overhead...
But it's still possible to translate, transform, and translate back ?

 If nothing was done about this and I absolutely needed UTF support,
 I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand.
(viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)

 All in all it is a drag that we should have to learn all of this UTF 
 stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning...
(where http://www.unicode.org/faq/utf_bom.html is a very good page)
And D does force you to think about string implementation, no question.
This has both pros and cons, but it is a deliberate language decision.

If you're willing to handle the "surrogates", then UTF-16 is a rather
good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
A downside is that it is not "ascii-compatible" (has embedded NUL chars)
and that it is endian-dependant unlike the more universal UTF-8 format.

--anders

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Anders F Björklund wrote:
 If you're willing to handle the "surrogates", then UTF-16 is a rather
 good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
 A downside is that it is not "ascii-compatible" (has embedded NUL chars)
 and that it is endian-dependant unlike the more universal UTF-8 format.

Problem is, using 16-bit you sort-of get away with _almost_ all of it. 
But as a pay-back, the day your 16 bits don't suffice, you're in deep 
crap. And that day _will_ come.

Sep 29 2006

Chad J <""gamerChad\" spamIsBad gmail.com"> writes:

Anders F Björklund wrote:
 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.


 I'm assuming 'str' is a char[], which would make that very nice.  But 
 it doesn't solve correctly slicing or indexing into a char[].  

 
 
 Well, it's also a lot "trickier" than that... For instance, my last name
 can be written in Unicode as Björklund or Bj¨orklund, both of which are 
 valid - only that in one of them, the 'ö' occupies two full code points!
 It's still a single character, which is why Unicode avoids that term...
 

So it seems to me the problem is that those 2 bytes are both 2 
characters and 1 character at the same time.

In this case, I'd prefer being able to index to a safe default (like the 
ö, instead of the umlauts next to the o), or not being able to index at 
all.

 As you know, if you need to access your strings by codepoint (something 
 that the Unicode group explicitly recommends against, in their FAQ) then 
 char[] isn't a very nice format - because of the conversion overhead...
 But it's still possible to translate, transform, and translate back ?
 

I read that FAQ at the bottom of this post, and didn't see anything 
about accessing strings by codepoint.  Maybe you mean a different FAQ 
here, in which case, could I have a link please?  I've been to the 
unicode site before and all I remember was being confused and having a 
hard time finding the info I wanted :(

Also I still am not sure exactly what a code point is.  And that FAQ at 
the bottom used the word "surrogate" a lot; I'm not sure about that one 
either.

When you say char[] isn't a nice format, I wasn't thinking about having 
the string class I mentioned earlier store the data ONLY as char[].  It 
might be wchar[].  Or dchar[].  Then it would be automatically converted 
between the two either at compile time (when possible) or dynamically at 
runtime (hopefully only when needed).  So if someone throws a Chinese 
character literal at it, there is a very big clue there to use UTF32 or 
something that can store all of the characters in a uniform width sort 
of way, to speed indexing.  Algorithms could be used so that a program 
'learns' at runtime what kind of strings are dominating the program, and 
uses algorithms optimized for those.  Maybe this is a bit too complex, 
but I can dream, hehe.

 If nothing was done about this and I absolutely needed UTF support,
 I'd probably make a class like so: [...]

 
 
 In my own mock String class, I cached the dchar[] codepoints on demand.
 (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
 
 All in all it is a drag that we should have to learn all of this UTF 
 stuff.  I want char[] to just work!

 
 
 Using Unicode strings and characters does require a little learning...
 (where http://www.unicode.org/faq/utf_bom.html is a very good page)
 And D does force you to think about string implementation, no question.
 This has both pros and cons, but it is a deliberate language decision.
 
 If you're willing to handle the "surrogates", then UTF-16 is a rather
 good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
 A downside is that it is not "ascii-compatible" (has embedded NUL chars)
 and that it is endian-dependant unlike the more universal UTF-8 format.
 
 --anders

My impression has gone from being quite scared of UTF to being not so 
worried, but only for myself.  D seems to be good at handling UTF, but 
only if someone tells you to never handle strings as arrays of 
characters.  Unfortunately, the first thing you see in a lot of D 
programs is "int main( char[][] args )" and there are some arrays of 
characters being used as strings.  This also means that some array 
capabilities like indexing and the braggable slicing are more dangerous 
than useful for string handling.  It's a newbie trap.

Like I said earlier, I either want to be able to index/slice strings 
safely, or not at all (or better yet, not by any intuitive means).

Sep 30 2006

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Chad J > wrote:

 I read that FAQ at the bottom of this post, and didn't see anything 
 about accessing strings by codepoint.  Maybe you mean a different FAQ 
 here, in which case, could I have a link please?  I've been to the 
 unicode site before and all I remember was being confused and having a 
 hard time finding the info I wanted :(



 Also I still am not sure exactly what a code point is.  And that FAQ at 
 the bottom used the word "surrogate" a lot; I'm not sure about that one 
 either.

Code point is the closest thing to a "character", although it might take 
more than one Unicode code point to represent a single Unicode grapheme.

Surrogates are used with UTF-16, to represent "too large" code points...
i.e. they always occur in "surrogate pairs", which combine to a single

 When you say char[] isn't a nice format, I wasn't thinking about having 
 the string class I mentioned earlier store the data ONLY as char[].  It 
 might be wchar[].  Or dchar[].  Then it would be automatically converted 
 between the two either at compile time (when possible) or dynamically at 
 runtime (hopefully only when needed).  So if someone throws a Chinese 
 character literal at it, there is a very big clue there to use UTF32 or 
 something that can store all of the characters in a uniform width sort 
 of way, to speed indexing.  Algorithms could be used so that a program 
 'learns' at runtime what kind of strings are dominating the program, and 
 uses algorithms optimized for those.  Maybe this is a bit too complex, 
 but I can dream, hehe.

Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway...
(UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above)

We already have char[] as the string default in D, but most models for
a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java:
* http://mango.dsource.org/classUString.html (uses the ICU lib)
* http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html

All formats do use Unicode, so converting from one UTF to another is 
mostly a question of memory/performance and not about any data loss.
However, it is not converted at compile time (without using templates)
so mixing and matching different representations is somewhat of a pain.

I think that char[] for string and wchar[] for String are good defaults.

 My impression has gone from being quite scared of UTF to being not so 
 worried, but only for myself.  D seems to be good at handling UTF, but 
 only if someone tells you to never handle strings as arrays of 
 characters.  Unfortunately, the first thing you see in a lot of D 
 programs is "int main( char[][] args )" and there are some arrays of 
 characters being used as strings.  This also means that some array 
 capabilities like indexing and the braggable slicing are more dangerous 
 than useful for string handling.  It's a newbie trap.

It is, since it isn't really "arrays of characters" but "arrays of code 
units". What muddies the waters further is that sometimes they're equal.
That is, with ASCII characters each character fits into a a D char unit.
Without surrogates, each character (from BMP) fits into one wchar unit.

However, all code that handles the shorter formats should be prepared to 
handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32:
bool isAscii(char c) { return (c <= 0x7f); }
bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); }

But a warning that D uses multi-byte strings might be in order, yes...
Another warning that it only supports UTF-8 platforms* might also be ?

--anders

* "main(char[][] args)" does not work for any non-UTF consoles,
   as you will get invalid UTF sequences for the non-ASCII chars.

Oct 01 2006

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Chad J > wrote:

 char[] data; 

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;
 
       i++;
     }
   }

This code probably does not work as you think it does...

If you loop through a char[] using dchars (with a foreach),
then the int will get the codeunit index - *not* codepoint.
(the ++ in your code above looks more like a typo though,
since it needs to *either* foreach i, or do it "manually")

import std.stdio;
void main()
{
    char[] str = "Björklund";
    foreach(int i, dchar c; str)
    {
      writefln("%4d \\U%08X '%s'", i, c, ""d ~ c);
    }
}

Will print the following sequence:

    0 \U00000042 'B'
    1 \U0000006A 'j'
    2 \U000000F6 'ö'
    4 \U00000072 'r'
    5 \U0000006B 'k'
    6 \U0000006C 'l'
    7 \U00000075 'u'
    8 \U0000006E 'n'
    9 \U00000064 'd'

Notice how the non-ASCII character takes *two* code units ?
(if you expect indexing to use characters, that'd be wrong)

More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs

--anders

Sep 29 2006

Chad J <""gamerChad\" spamIsBad gmail.com"> writes:

Anders F Björklund wrote:
 Chad J > wrote:
 
 char[] data; 

 
 
   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;

       i++;
     }
   }

 
 
 This code probably does not work as you think it does...
 
 If you loop through a char[] using dchars (with a foreach),
 then the int will get the codeunit index - *not* codepoint.
 (the ++ in your code above looks more like a typo though,
 since it needs to *either* foreach i, or do it "manually")
 
 import std.stdio;
 void main()
 {
    char[] str = "Björklund";
    foreach(int i, dchar c; str)
    {
      writefln("%4d \\U%08X '%s'", i, c, ""d ~ c);
    }
 }
 
 Will print the following sequence:
 
    0 \U00000042 'B'
    1 \U0000006A 'j'
    2 \U000000F6 'ö'
    4 \U00000072 'r'
    5 \U0000006B 'k'
    6 \U0000006C 'l'
    7 \U00000075 'u'
    8 \U0000006E 'n'
    9 \U00000064 'd'
 
 Notice how the non-ASCII character takes *two* code units ?
 (if you expect indexing to use characters, that'd be wrong)
 
 More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
 
 --anders

ah.  And yep the i++ was a typo (oops).

So maybe something like:

   dchar opIndex( int index )
   {
     int i;
     foreach( dchar c; data )
     {
       if ( i == index )
         return c;

       i++;
     }
   }

The i is no longer the foreach's index, so the i++ isn't a typo anymore.

Thanks for the info.  I'll check out that faq a little later, gotta go.

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Chad J > wrote:
 I will go ahead and say that the current state of char[] is incorrect. 
 That is, if you write a program manipulating char[] strings, then run it 
 in china, you will be dissapointed with the results.  It won't matter 
 how fast the program runs, because bad stuff will happen like entire 
 strings becoming unreadable to the user.

Wrong.

And that's precisely what I meant about the Daddy holding bike allegory 
a few messages back.

The current system seems to work "by magic". So, if you do go to China, 
itll "just work".

At this point you _should_ not believe me. :-) But it still works.

---

The secret is, there actually is a delicate balance between UTF-8 and 
the library string operations. As long as you use library functions to 
extract substrings, join or manipulate them, everything is OK. And very 
few of us actually either need to, or see the effort of bit-twiddling 
individual octets in these "char" arrays.

So things just keep on working.

---

Not convinced yet? Well, a lot of folks here are from Europe, and our 
languages contain "non-ASCII" characters. Our text manipulating programs 
still work allright. And, actually D is pretty popular in Japan. Every 
once in a while some Japanese guys pop on-and-off here, and some of them 
don't even speak English, so they use a machine translator(!) to talk 
with us. Just guess if they use ASCII in their programs. And you know 
what, most of these guys even use their own characters for variable 
names in D!

And not one of them has complained about "disappointing results".

---

That's why I continued with: keep your eyes shut and keep on coding.

Sep 29 2006

Chad J <""gamerChad\" spamIsBad gmail.com"> writes:

Georg Wrede wrote:
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And very 
 few of us actually either need to, or see the effort of bit-twiddling 
 individual octets in these "char" arrays.
 

But this is what I'm talking about... you can't slice them or index 
them.  I might actually index a character out of an array from time to 
time.  If I don't know about UTF, and I do just keep on coding, and I do 
something like this:

char[] str = "some string in nonenglish text";
for ( int i = 0; i < str.length; i++ )
{
   str[i] = doSomething( str[i] );
}

and this will fail right?

If it does fail, then everything is not alright.  You do have to worry 
about UTF.  Someone has to tell you to use a foreach there.

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Chad J > wrote:
 Georg Wrede wrote:
 
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And 
 very few of us actually either need to, or see the effort of 
 bit-twiddling individual octets in these "char" arrays.

 
 But this is what I'm talking about... you can't slice them or index 
 them.  I might actually index a character out of an array from time to 
 time.  If I don't know about UTF, and I do just keep on coding, and I do 
 something like this:
 
 char[] str = "some string in nonenglish text";
 for ( int i = 0; i < str.length; i++ )
 {
   str[i] = doSomething( str[i] );
 }
 
 and this will fail right?
 
 If it does fail, then everything is not alright.  You do have to worry 
 about UTF.  Someone has to tell you to use a foreach there.

Yes. That's why I talked about you falling down once you realise Daddy's 
not holding the bike.

Part of UTF-8's magic lies in that it is amazingly easy to get working 
smoothly with truly minor tweaks to "formerly ASCII-only" libraries -- 
so that even the most exotic languages have no problem.

Your concerns about the for loop are valid, and expected. Now, IMHO, the 
standard library should take care of "all" the situations where you 
would ever need to split, join, examine, or otherwise use strings, 
"non-ASCII" or not. (And I really have no complaint (Walter!) about 
this.) Therefore, in no normal circumstances should you have to twiddle 
them yourself -- unless.

And this "unless" is exactly why I'm unhappy with the situation, too.

Problem is, _technology_wise_ the existing setup may actually be the 
best, both considering ease of writing the library, ease of using it, 
robustness of both the library and users' code, and the headaches saved 
from programmers who, either haven't heard of the issue (whether they're 
American or Chinese!), or who simply trust their lives with the machinery.

So, where's the actual problem???

At this point I'm inclined to say: the documentation, and the stage 
props! The latter meaning: exposing the fact that our "strings" are just 
arrays is psychologically wrong, and even more so is the fact that we're 
shamelessly storing entities of variable length in arrays which have no 
notion of such -- even worse, while we brag with slices!

If this had been a university course assignment, we'd have been thrown 
out of class, for both half baked work, and for arrogance towards our 
client, victimizing the coder.

The former meaning: we should not be like "we're bad enough to overtly 
use plain arrays for variable-length data, now if you have a problem 
with it, the go home and learn stuff, or then just trust us".

Both "documentation" and "stage props" ultimately meaning that the 
largest problem here is psychology, pedagogy, and education.

---

A lot would already be won by:

merely aliasing char[] to string, and discouraging other than guru-level 
folks from screwing with their internals. This alone would save a lot of 
Fear, Uncertainty and D-phobia.

The documentation should take pains in explaining up front that if you 
_really_ want to do Character-by-Character ops _and_ you live outside of 
America, then the Right way to do it (ehh, actually the Canonical Way), 
is to first convert the string to dchar[]. Period.

Then, if somebody else knows enough of UTF-8 and knows he can handle bit 
twiddling more efficiently than using the Canonical Way, with plain 
char[] and "foreignish", then let him. But let that be undocumented and 
Un-Discussed in the docs. Precisely like a lot of other things are. (And 
should be.) And will be. He's on his own, and he ought to know it.

---

In other words, the normal programmer should believe he's working with 
black-box Strings, and he will be happy with it. That way he'll survive 
whether he's in Urduland or Boise, Idaho -- without neither ever needing 
to have heard about UTF nor other crap.

Not until in Appendix Z of the manual should we ever admit that the 
Emperor's Clothes are just plain arrays, and we apologize for the breach 
of manners of storing variable length data in simple naked arrays. And 
here would be the right place to explain how come this hasn't blown up 
in our faces already. And, exactly how you'll avoid it too. (This 
_needs_ to contain an adequate explanation about the actual format of 
UTF-8.)

---

TO RECAP

The _single_ biggest strings-related disservice to our pilgrims is to

     lead them to believe, that D stores
     strings in something like utf8[]

internally.

Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_ 
implemented, it would probably have to be an alias of char[][]. Right? 
Right? What we have instead is ubyte[], which is _not_ the same as 
utf8[].) (Oh, and if it ever becomes obvious that not _everybody_ 
understood this, then that in itself simply proves my point here.)

(*1)

And the fault lies in the documentation, not the implementation!

This results, in braincell-hours wasted, precisely as much as everybody 
has to waste them, before they realise that the acronym RAII is a filthy 
lie. Akin only to the former "German _Democratic_ Republic". Only a 
politician should be capable of this kind of deception.

Ok, nobody is doing it on purpose. Things being too clear to oneself 
often result in difficulties to find ways to express them to new people. 
(Happens every day at the Math department! :-( ) And since all 
in-the-know are unable to see it, and all not-in-the-know are too, then 
both groups might think it's the thing itself that is "the problem", and 
not merely the chosen _presentation_ of it.



Sorry for sonding Righteous, arrogant and whatever. But this really is a 
5 minute thing for one person to fix for good, while it wastes entire 
days or months _per_person_, from _every_ non-defoiled victim who 
approaches the issue. Originally I was one of them: hence the aggression.

-------------------------------------------


(*1) Even I am not simultaneously both literally and theoretically right 
here. Those who saw it right away, probably won't mind, since it's the 
point that is the issue here.

Now, having to write this disclaimer, IMHO simply again underlines the 
very point attempted here.

Sep 29 2006

Walter Bright <newshound digitalmars.com> writes:

Chad J > wrote:
 But this is what I'm talking about... you can't slice them or index 
 them.  I might actually index a character out of an array from time to 
 time.  If I don't know about UTF, and I do just keep on coding, and I do 
 something like this:
 
 char[] str = "some string in nonenglish text";
 for ( int i = 0; i < str.length; i++ )
 {
   str[i] = doSomething( str[i] );
 }
 
 and this will fail right?
 
 If it does fail, then everything is not alright.  You do have to worry 
 about UTF.  Someone has to tell you to use a foreach there.

Yes, you do have to be aware of it being UTF, just like in C you have to 
be aware that strings are 0 terminated. But once aware of it, there is 
plenty of support for it in the core language and in std.utf.

You can also simply use dchar[], which has a one to one mapping between 
characters and indices, if you prefer.

Contrast that with C++, which has no usable or portable support for 
UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
std::string needs to be totally scrapped and redone with your own custom 
classes, should you decide your app needs to support unicode.

You can also wrap char[] inside a class that provides a view of the data 
  as if it were dchar's. But I don't think the performance of such a 
class would be competitive. Interestingly, it turns out that most string 
operations do not need to be concerned with the number of char's in a 
character (like "find this substring"), and forcing them to care just 
makes for inefficiency.

Sep 29 2006

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 
 Contrast that with C++, which has no usable or portable support for 
 UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
 std::string needs to be totally scrapped and redone with your own custom 
 classes, should you decide your app needs to support unicode.

As long as you're aware that you are working in UTF-8 I think 
std::string could still be used.  It just may be strange to use 
substring searches to find multibyte characters with no built-in support 
for dchar-type searching.

 You can also wrap char[] inside a class that provides a view of the data 
  as if it were dchar's. But I don't think the performance of such a 
 class would be competitive. Interestingly, it turns out that most string 
 operations do not need to be concerned with the number of char's in a 
 character (like "find this substring"), and forcing them to care just 
 makes for inefficiency.

Yup.  I realized this while working on array operations and it came as a 
surprise--when I began I figured I would have to provide overloads for 
char strings, but in most cases it simply isn't necessary.


Sean

Sep 30 2006

Walter Bright <newshound digitalmars.com> writes:

Sean Kelly wrote:
 Walter Bright wrote:
 Contrast that with C++, which has no usable or portable support for 
 UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
 std::string needs to be totally scrapped and redone with your own 
 custom classes, should you decide your app needs to support unicode.

 
 As long as you're aware that you are working in UTF-8 I think 
 std::string could still be used.  It just may be strange to use 
 substring searches to find multibyte characters with no built-in support 
 for dchar-type searching.

It's so broken that there are proposals to reengineer core C++ to add 
support for UTF types.

1) implementation-defined whether a char is signed or unsigned, so 
you've got to cast the result of any string[i]

2) none of the iteration, insertion, appending, etc., operations can 
handle multibyte

3) no UTF conversion or transliteration

4) C++ source text encoding is implementation-defined, so no using UTF 
characters in source code (have to use \u or \U notation)

Sep 30 2006

Sean Kelly <sean f4.ca> writes:

Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Contrast that with C++, which has no usable or portable support for 
 UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
 std::string needs to be totally scrapped and redone with your own 
 custom classes, should you decide your app needs to support unicode.

 As long as you're aware that you are working in UTF-8 I think 
 std::string could still be used.  It just may be strange to use 
 substring searches to find multibyte characters with no built-in 
 support for dchar-type searching.

 
 It's so broken that there are proposals to reengineer core C++ to add 
 support for UTF types.
 
 1) implementation-defined whether a char is signed or unsigned, so 
 you've got to cast the result of any string[i]

Oops, forgot about this.

 2) none of the iteration, insertion, appending, etc., operations can 
 handle multibyte

True.  And I hinted at this above.

 3) no UTF conversion or transliteration
 
 4) C++ source text encoding is implementation-defined, so no using UTF 
 characters in source code (have to use \u or \U notation)

Personally, I see this as a language deficiency more than a deficiency 
in std::string.  std::string is really just a vector with some search 
capabilities thrown in.  It's not that great for a string class, but it 
works well enough as a general sequence container.  And it will work a 
tad better once they impose the came data contiguity guarantee that 
vector has (I believe that's one of the issues set to be resolved for 0x).

Overall, I do agree with you.  Though I suppose that's obvious as I'm a 
former C++ advocate who now uses D quite a bit :-)


Sean

Oct 01 2006

Walter Bright <newshound digitalmars.com> writes:

Sean Kelly wrote:
 3) no UTF conversion or transliteration

 4) C++ source text encoding is implementation-defined, so no using UTF 
 characters in source code (have to use \u or \U notation)

 
 Personally, I see this as a language deficiency more than a deficiency 
 in std::string.

That's why the proposals to fix it are rewriting some of the *core* C++ 
language.

 std::string is really just a vector with some search 
 capabilities thrown in.

Another difficulty with it is it doesn't have a connection with 
std::vector<char>.

 It's not that great for a string class, but it 
 works well enough as a general sequence container.  And it will work a 
 tad better once they impose the came data contiguity guarantee that 
 vector has (I believe that's one of the issues set to be resolved for 0x).
 
 Overall, I do agree with you.  Though I suppose that's obvious as I'm a 
 former C++ advocate who now uses D quite a bit :-)

:-)

Oct 01 2006

Johan Granberg <lijat.meREM OVEgmail.com> writes:

Georg Wrede wrote:
 Wrong.
 
 And that's precisely what I meant about the Daddy holding bike allegory 
 a few messages back.
 
 The current system seems to work "by magic". So, if you do go to China, 
 itll "just work".
 
 At this point you _should_ not believe me. :-) But it still works.
 
 ---

But is this not a needless source of confusion, that could be eliminated 
by defining char as "big enough to hold a unicode code point" or 
something else that eliminates the possibility to incorrectly divide utf 
tokens.

I will have to try using char[] with non ascii characters thou I have 
been using dchar fore that up till now.

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Johan Granberg wrote:
 Georg Wrede wrote:
 
 Wrong.

 And that's precisely what I meant about the Daddy holding bike 
 allegory a few messages back.

 The current system seems to work "by magic". So, if you do go to 
 China, itll "just work".

 At this point you _should_ not believe me. :-) But it still works.

 ---

 
 
 But is this not a needless source of confusion, that could be eliminated 
 by defining char as "big enough to hold a unicode code point" or 
 something else that eliminates the possibility to incorrectly divide utf 
 tokens.
 
 I will have to try using char[] with non ascii characters thou I have 
 been using dchar fore that up till now.

You might begin with pasting this and compiling it:

import std.stdio;

void main()
{
	int öylätti;
	int ШеФФ;

	öylätti = 37;
	ШеФФ = 19;

	writefln("Köyhyys 1 on %d ja nöyrä 2 on %d, että näin.", öylätti,
ШеФФ);
}

It will compile, and run just fine. (The source file having been read 
into DMD as a single big string, and then having gone through comment 
removal, tokenizing, parsing, lexing, compiling, optimizing, and finally 
the variable names having found their way into the executable. Even 
though the front end has been written in D itself, with simply char[] 
all over the place.)

(Then you might see that the Windows "command prompt window" renders the 
output wrong, but it's only from the fact that Windows itself doesn't 
handle UTF-8 right in the Command Window.)

The next thing you might do is to write a grep program (that takes as 
input a file and as output writes the lines found). Write the program as 
if you had never heard this discussion. Then feed it the Kalevala in 
Finnish, or Mao's Red Book in Chinese. Should still work.

As long as you don't start tampering with the individual octets in 
strings, you should be just fine. Don't think about UTF and you'll prosper.

Sep 29 2006

Derek Parnell <derek psyc.ward> writes:

On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:


 As long as you don't start tampering with the individual octets in 
 strings, you should be just fine. Don't think about UTF and you'll prosper.

The Build program does lots of 'tampering'. I had to rewrite many standard
routines and create some new ones to deal with unicode characters because
the standard ones just don't work. And Build still fails to do somethings
correctly (e.g. case insensitive compares) but that's on the TODO list.

I have to think about UTF because it doesn't work unless I do that.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Derek Parnell wrote:
 On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:
 
As long as you don't start tampering with the individual octets in 
strings, you should be just fine. Don't think about UTF and you'll prosper.

 
 The Build program does lots of 'tampering'. I had to rewrite many standard
 routines and create some new ones to deal with unicode characters because
 the standard ones just don't work.

Do you still remember which they were?

 And Build still fails to do somethings
 correctly (e.g. case insensitive compares) but that's on the TODO list.

Yes, case insensitive compares are difficult if you want to cater for 
non-ASCII strings. While it may not be unreasonably difficult to get 
American, European and Russian strings right, there will always be 
languages  and character sets where even the Unicode guys aren't sure 
what is right. Unfortunately.

Sep 29 2006

Geoff Carlton <gcarlton iinet.net.au> writes:

Georg Wrede wrote:

 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And very 
 few of us actually either need to, or see the effort of bit-twiddling 
 individual octets in these "char" arrays.
 
 So things just keep on working.
 

I agree, but I disagree that there is a problem, or that utf-8 is a bad 
choice, or that perhaps char[] or string should be called utf8 instead.

As a note here, I actually had a page of text localised into Chinese 
last week - it came back as a utf8 text file.

The only thing with utf8 is that a glyphs aren't represented by a single 
char.  But utf16 is no better!  And even utf32 codepoints can be 
combined into a single rendered glyph.  So truncating a string at an 
arbitrary index is not going to slice on a glyph boundary.

However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes.  That 
garbage is a unique series of bytes that represent a codepoint.  This is 
a property not found in any other encoding.

As such, everything works, strstr, strchr, strcat, printf, scanf - for 
ASCII, normal unicode, and the "Astral planes".  It all just works.  The 
only thing that breaks is if you tried to index or truncate the data by 
hand.

But even that mostly works, you can iterate through, looking for ASCII 
sequences, chop out ASCII and string together more stuff, it all works 
because you can just ignore the higher order bytes.  Pretty much the 
only thing that fails is if you said "I don't know whats in the string, 
but chop it off at index 12".

Sep 29 2006

Georg Wrede <georg.wrede nospam.org> writes:

Geoff Carlton wrote:
 Georg Wrede wrote:
 
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And 
 very few of us actually either need to, or see the effort of 
 bit-twiddling individual octets in these "char" arrays.

 So things just keep on working.

 
 I agree, but I disagree that there is a problem, or that utf-8 is a bad 
 choice, or that perhaps char[] or string should be called utf8 instead.
 
 As a note here, I actually had a page of text localised into Chinese 
 last week - it came back as a utf8 text file.
 
 The only thing with utf8 is that a glyphs aren't represented by a single 
 char.  But utf16 is no better!  And even utf32 codepoints can be 
 combined into a single rendered glyph.  So truncating a string at an 
 arbitrary index is not going to slice on a glyph boundary.
 
 However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes.  That 
 garbage is a unique series of bytes that represent a codepoint.  This is 
 a property not found in any other encoding.
 
 As such, everything works, strstr, strchr, strcat, printf, scanf - for 
 ASCII, normal unicode, and the "Astral planes".  It all just works.  The 
 only thing that breaks is if you tried to index or truncate the data by 
 hand.
 
 But even that mostly works, you can iterate through, looking for ASCII 
 sequences, chop out ASCII and string together more stuff, it all works 
 because you can just ignore the higher order bytes.  Pretty much the 
 only thing that fails is if you said "I don't know whats in the string, 
 but chop it off at index 12".

Yes.

Sep 29 2006

Johan Granberg <lijat.meREM OVEgmail.com> writes:

Georg Wrede wrote:
 Geoff Carlton wrote:
 But even that mostly works, you can iterate through, looking for ASCII 
 sequences, chop out ASCII and string together more stuff, it all works 
 because you can just ignore the higher order bytes.  Pretty much the 
 only thing that fails is if you said "I don't know whats in the 
 string, but chop it off at index 12".

 
 Yes.

How should we chop strings on character boundaries?

I have a text rendering function that uses freetype and want to restrict 
the width of the renderd string, (i have to use some sort of search 
here, binary or linear) by truncating it. Right now I use dchar but if 
char is sufficient it would save me conversions all over the place.

Sep 29 2006

Walter Bright <newshound digitalmars.com> writes:

Johan Granberg wrote:
 How should we chop strings on character boundaries?

std.utf.toUTFindex() should do the trick.

Sep 30 2006

Johan Granberg <lijat.meREM OVEgmail.com> writes:

BCS wrote:
 Why isn't performance a problem?
 
 If you are saying that this won't cause performance hits in run times or 
  memory space, I might be able to buy it, but I'm not yet convinced.
 
 If you are saying that causing a performance hit in run times or memory 
 space is not a problem... in that case I think you are dead wrong and 
 you will not convince me otherwise.
 
 In my opinion, any compiled language should allow fairly direct access 
 to the most efficient practical means of doing something*. If I didn't 
 care about speed and memory I wound use some sort of scripting language.
 
 A good set of libs should make most of this moot. Leave the char as is 
 and define a typedef struct or whatever that provides the added 
 functionality that you want.
 
 * OTOH a language should not mandate code to be efficient at the expense 
 of ease of coding.

I don't think any performance hit will be so big that it causes problems 
(max x4 memory and negligible computation overhead). Hope that made 
clear what I meant.

Sep 29 2006

BCS <BCS pathlink.com> writes:

Johan Granberg wrote:
 BCS wrote:
 
 Why isn't performance a problem?


[...]
 If you are saying that causing a performance hit in run times or 
 memory space is not a problem... in that case I think you are dead 
 wrong and you will not convince me otherwise.

 
 I don't think any performance hit will be so big that it causes problems 
 (max x4 memory and negligible computation overhead). Hope that made 
 clear what I meant.

If you will note, I said nothing about the size of the hit. While some 
may disagree, I think that any unneeded hit is a problem.

One alternative that I could live with would use 4 character types:

char	one codeunit in whatever encoding the runtime uses
schar	one 8 bit code unit (ASCII or utf-8)
wchar	one 16 bit code unit (same as before)
dchar	one 32 bit code unit (same as before)

(using the same thing for ASCII and UTF-8 may be a problem, but this 
isn't my field)

The point being that char, wchar and dchar are not representing numbers 
and should be there own type. This also preserves direct access to 8, 16 
and 32 bit types.

Oct 01 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

BCS wrote:

 One alternative that I could live with would use 4 character types:
 
 char    one codeunit in whatever encoding the runtime uses
 schar    one 8 bit code unit (ASCII or utf-8)
 wchar    one 16 bit code unit (same as before)
 dchar    one 32 bit code unit (same as before)

We have that already:

ubyte   one codeunit in whatever encoding the runtime uses
char    one 8 bit code unit (ASCII or utf-8)

There is no support in Phobos for runtime/native encodings,
but you can use the "iconv" library to do such conversions ?

 (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my
field)

All ASCII characters are valid UTF-8 code units, so it's OK.

--anders

Oct 01 2006

BCS <BCS pathlink.com> writes:

Anders F Bj�rklund wrote:
 BCS wrote:
 
 One alternative that I could live with would use 4 character types:

 char    one codeunit in whatever encoding the runtime uses
 schar    one 8 bit code unit (ASCII or utf-8)
 wchar    one 16 bit code unit (same as before)
 dchar    one 32 bit code unit (same as before)

 
 
 We have that already:
 
 ubyte   one codeunit in whatever encoding the runtime uses
 char    one 8 bit code unit (ASCII or utf-8)

ubyte is an 8 bit unsigned number not a character encoding.


[after some more reading]
I may be just rambling but...

how about have the type of the value denote the encoding. One for ASCII 
would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and 
32. Direct assignment would be illegal (as with, say int[] -> Object) or 
implicitly converted (as with int -> real). Casts would be provided. 
Indexing would be by codepoint. Non-array variables would be big enough 
to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of 
"whatever the system uses" data type (ah la C's int) could be used for 
actual output, maybe even escaping anything that won't get displayed 
correctly.

This all sort of follows the idea of "call it what it is and don't hide 
the overhead". 1) Characters are a different type of data than numbers 
(see the threads on bool) and as such, that should be reflected in the 
type system. 2) I have no problem with high overhead operations as long 
as I can avoid using them when I don't want to.

 
 There is no support in Phobos for runtime/native encodings,
 but you can use the "iconv" library to do such conversions ?
 
 (using the same thing for ASCII and UTF-8 may be a problem, but this 
 isn't my field)

 
 
 All ASCII characters are valid UTF-8 code units, so it's OK.
 

But UTF-8 is not ASCII.

 --anders

Oct 01 2006

Georg Wrede <georg.wrede nospam.org> writes:

BCS wrote:
 I may be just rambling but...
 
 how about have the type of the value denote the encoding. One for ASCII 
 would only ever store ASCII (UTF-8 is invalid)

Then all Americans would use that instead of UTF-8.

This is natural, since first you code for yourself, later maybe for your 
boss, etc. And, you'd only become aware of any problems when a Latino 
tries to use his own name Jos�, talk about Mot�rhead, or Ana�s the 
fragrance. And the mail and newsreader you wrote in D simply would not work.

Guess if anybody would heed the warning "Only use this new ASCII 
encoding when you are absolutely positive the program never will 
encounter a single foreign sentence or letter".

So, better not.

---

D's current setup and documentation engourage this kind of suggestions, 
and I don't blame you.

Things being like they are, a programmer who wants to write a crossword 
puzzle generator, would of course begin with:

char[20][20] theGrid;

It's a shame that an otherwise so excellent language ( + the wording it 
its docs) downright leads you to do this.

The guy naturally assumes that D being a "UTF-8" language, this would 
work even in Chinese. (Hey, char[] foo = "Jos� Mot�rhead from the band 
Ana�s is on stage!"; works, so why wouldn't theGrid? Poor guy.

I can't blame anyone then wanting to stay within ASCII for the rest of 
D's life.

Oct 01 2006

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

BCS wrote:

 ubyte is an 8 bit unsigned number not a character encoding.

Right, I actually meant ubyte[] but void[] might have been
more accurate for representing any (even non-UTF) encoding.
(I used ubyte[] in my mapping functions, since they only
used legacy 8-bit encodings like "cp1252" or "macroman")

Re-reading your post, it seems to me that you were more talking
about doing an alias to the UTF type most suitable for the OS ?

I guess UTF-8 would be a good choice if the operating system
doesn't use Unicode, since then it'll have to do lookups anyway.
Otherwise the existing "wchar_t" isn't bad for such an UTF type,
it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...)

 All ASCII characters are valid UTF-8 code units, so it's OK.

 
 But UTF-8 is not ASCII.

So you would like a char "type" that would only take ASCII ?
I guess that is *one* way of dealing with it, you could also
have a wchar type that wouldn't accept surrogates (BMP only)

Then it would be OK to index them by code unit / character...
(since each allowed character would fit into one code unit)
Sounds a little like signed vs. unsigned integers actually ?

Then again, 5 character types is even worse than the 3 now.

--anders

Oct 01 2006

BCS <BCS pathlink.com> writes:

Anders F Bj�rklund wrote:
[...]
 
 Then again, 5 character types is even worse than the 3 now.
 
 --anders

The more I think about it the worse this get.

What I really would like is  a system that allows O(1) operations on 
strings (slice out char 7 to 27), allows somewhat compact encoding 
(8bit) and allows safe operations on UTF (if I do something dumb, it 
complains). All at the same time would be nice, but is not needed.


Come to think about it, a lib that will do good FAST convention between 
buffers:

//note: "in" is intentional, it wont allocate anything
UTF8to16(in char[], in wchar[]);
UTF8to32(in char[], in dchar[]);
UTF16to32(in wchar[], in dchar[]);
...

would get most of what I want.


<sarcasm>
And while I'm at it, I'd like a million bucks please.
</sarcasm>

Oct 02 2006

D Programming

C/C++ Programming

Other

digitalmars.D - First Impressions