digitalmars.D - Evolution (Hello World)

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/56) Feb 09 2005 Taking a quick look at "Hello World"

Sebastian Beschke (7/8) Feb 10 2005 This is an obvious question, but what would you propose for wchar[] and

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (24/31) Feb 10 2005 My proposition was "ustr" for wchar[] (since it rhymes with "uint",

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (9/19) Feb 10 2005 Another possibility is wstr for wchar[] ("wide string")

Matthew (5/23) Feb 10 2005 I don't think char[] should have an alias. Strings in D are slices, for

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/16) Feb 10 2005 I thought that strings in D were sliceable codepoint arrays,

Matthew (4/7) Feb 10 2005 Now you've got something of a point there. But, still, I'd prefer to

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (10/14) Feb 10 2005 Ehrm, nooooo ? "The line must be drawn here". :-)

Matthew (5/14) Feb 10 2005 I know. And I like your sentiment. It's just that I think that the

Derek (11/52) Feb 10 2005 One cannot easily address individual code points using utf8. For example...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (15/43) Feb 10 2005 This is not that much of a problem, since you should not address

Derek Parnell (16/64) Feb 10 2005 I obviously do a lot of different sort of programing to you. I often nee...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (24/42) Feb 10 2005 No, but it's mine ;-) (the ignorant westerner that I am)

James McComb (4/9) Feb 10 2005 Does anyone here know if Japanese and Chinese use a lot of ASCII

Walter (5/9) Feb 11 2005 Take a look at the functions std.utf.stride, std.utf.toUCSindex, and

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Taking a quick look at "Hello World"
shows a remarkable language evolution...


 From C:
 #include <stdio.h>
 #include <stdlib.h>
 
 int main(void)
 {
   puts("Hello, World!");
   return EXIT_SUCCESS;
 }
 
 int main(int argc, char *argv[])
 {
   int i;
   for (i = 0; i < argc; i++)
     printf("%d %s\n", i, argv[i]);
   return EXIT_SUCCESS;
 }

To "old D":
 import std.c.stdio;
 import std.c.stdlib;
 
 int main()
 {
   puts("Hello, World!");
   return EXIT_SUCCESS;
 }
 
 int main(char[][] args)
 {
   for (int i = 0; i < args.length; i++)
     printf("%d %.*s\n", i, args[i]);
   return EXIT_SUCCESS;
 }

To "new D":
 import std.stdio;
 
 void main()
 {
   writeln("Hello, World!");
 }
 
 void main(str[] args)
 {
   foreach (int i, str a; args)
     writefln("%d %s", i, a);
 }


Where I took the liberty of adding a few of my own RFEs:
1) "void main" should return 0 back to the operating system
2) new std.stdio.writeln, a formatless version of writefln
3) the "str" alias for the char[] type, like "bool" for bit

Not too bad for the first five years, if I say so myself ?
Then again, I couldn't even use it here until GDC arrived...
And it was not even a year ago, that David Friedman did that.

--anders

Feb 09 2005

Sebastian Beschke <s.beschke gmx.de> writes:

Anders F Bj�rklund schrieb:
 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and 
dchar[]?

I think, considering that UTF-8 is not the optimal encoding in a huge 
number of cases, promoting it as "The One and Only String Type" would 
not be wise.

-Sebastian

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Sebastian Beschke wrote:

 3) the "str" alias for the char[] type, like "bool" for bit

 
 This is an obvious question, but what would you propose for wchar[] and 
 dchar[]?

My proposition was "ustr" for wchar[] (since it rhymes with "uint",
and is to be pronounced as Unicode string, as in Unicode-optimized)

There will be no alias for dchar[], as it is a silly type anyway.
dchar's are cool, but UTF-32 strings are just too much lossage...

 I think, considering that UTF-8 is not the optimal encoding in a huge 
 number of cases, promoting it as "The One and Only String Type" would 
 not be wise.

There is no "one and only one string type" in D, just as there
is no "one and only one boolean type". There are three of each,
and char[] is the preferred string type (what does "main" use ?)
so it gets to be the str. And bit is the type of "true" and "false"
so it gets to be the default bool type. If you want to speed up
or optimize your code, you can change to using wchar[] or wbool[]...
And when it is needed in a few places, you have dchar[] and dbool

The contents are exactly the same, just encoded differently -
all strings are in Unicode and all booleans are in Zero-is-False.

I just find this shortform to be easier on the eyes:

     void main(str[] args);

     str[str] dictionary;

UTF-8 has two major advantages: 1) it's optimized for ASCII and
does not require a BOM mark, making it compatible for files too
2) it is Endian agnostic, no more X86 vs PPC gruffs like the others

If you do a lot of Unicode, or non-Western languages, switch to
ustr instead? It's equally well supported in all D std libraries.
(the only downside of using str is that it's a little bigger/slower)

--anders

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

I wrote:

 3) the "str" alias for the char[] type, like "bool" for bit

 This is an obvious question, but what would you propose for wchar[] 
 and dchar[]?

 
 My proposition was "ustr" for wchar[] (since it rhymes with "uint",
 and is to be pronounced as Unicode string, as in Unicode-optimized)
 
 There will be no alias for dchar[], as it is a silly type anyway.
 dchar's are cool, but UTF-32 strings are just too much lossage...

Another possibility is wstr for wchar[] ("wide string")
and ustr for dchar[] ("Unicode string"), which might
perhaps work better and be a tad more logical too...

I like "str" better than "string", because it:
1) rhymes with int, char, bool and the others
2) is shorter to type, easilly 50% saved
3) doesn't confuse anyone with C++ std::string

--anders

Feb 10 2005

"Matthew" <admin stlsoft.dot.dot.dot.dot.org> writes:

"Anders F Bj�rklund" <afb algonet.se> wrote in message 
news:cugd6f$2ptf$1 digitaldaemon.com...
I wrote:

 3) the "str" alias for the char[] type, like "bool" for bit

 This is an obvious question, but what would you propose for wchar[] 
 and dchar[]?

 My proposition was "ustr" for wchar[] (since it rhymes with "uint",
 and is to be pronounced as Unicode string, as in Unicode-optimized)

 There will be no alias for dchar[], as it is a silly type anyway.
 dchar's are cool, but UTF-32 strings are just too much lossage...

 Another possibility is wstr for wchar[] ("wide string")
 and ustr for dchar[] ("Unicode string"), which might
 perhaps work better and be a tad more logical too...

 I like "str" better than "string", because it:
 1) rhymes with int, char, bool and the others
 2) is shorter to type, easilly 50% saved
 3) doesn't confuse anyone with C++ std::string

I don't think char[] should have an alias. Strings in D are slices, for 
very good reason, and it's good for that to be foremost in peoples' 
minds.

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Matthew wrote:

 I don't think char[] should have an alias. Strings in D are slices, for 
 very good reason, and it's good for that to be foremost in peoples' 
 minds.

I thought that strings in D were sliceable codepoint arrays,
but not necessarily slices always ? It's just an alias, the
type is still char[] ? (and wchar[] and dchar(), but anyway)

But I rethought and found "ustr" to be silly altogether...

alias  char[]   str; // ASCII-optimized
alias wchar[]  wstr; // Unicode-optimized
alias dchar[]  dstr; // codepoint-optimized

More orthogonal that way ? (with "char[]" = "str", always)

char[] by itself is actually not that bad, but this is:
	int main(char[][] args);
	char[][char[]] dictionary;

--anders

Feb 10 2005

"Matthew" <admin stlsoft.dot.dot.dot.dot.org> writes:

 char[] by itself is actually not that bad, but this is:
 int main(char[][] args);
 char[][char[]] dictionary;

Now you've got something of a point there. But, still, I'd prefer to 
leave it as char[]. The example you give is only 1-dim string / 2-dim 
char. What about higher dimensionality (of anything)? We could end up in 
the cow-dung of LPPPCSTR, etc.

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Matthew wrote:

 Now you've got something of a point there. But, still, I'd prefer to 
 leave it as char[]. The example you give is only 1-dim string / 2-dim 
 char. What about higher dimensionality (of anything)? We could end up in 
 the cow-dung of LPPPCSTR, etc. 

Ehrm, nooooo ? "The line must be drawn here". :-)

I just wanted some easier basics, for beginners ?
For the higher levels, you still need to learn
about bit and char[] and other behind-the-scenes.

It's just similar to the "alias foo* fooPtr;",
that seems to always enter the picture after
one has seen one too many stars fly by...

I'll (re)post my Grand Scheme of Std Aliases.

--anders

Feb 10 2005

"Matthew" <admin stlsoft.dot.dot.dot.dot.org> writes:

"Anders F Bj�rklund" <afb algonet.se> wrote in message 
news:cugfi7$2sll$2 digitaldaemon.com...
 Matthew wrote:

 Now you've got something of a point there. But, still, I'd prefer to 
 leave it as char[]. The example you give is only 1-dim string / 2-dim 
 char. What about higher dimensionality (of anything)? We could end up 
 in the cow-dung of LPPPCSTR, etc.

 Ehrm, nooooo ? "The line must be drawn here". :-)

 I just wanted some easier basics, for beginners ?
 For the higher levels, you still need to learn
 about bit and char[] and other behind-the-scenes.

I know. And I like your sentiment. It's just that I think that the 
string-is-a-slice concept is so important and fundamental to D that it's 
more likely to be disservice in the medium/long term.

Feb 10 2005

Derek <derek psych.ward> writes:

On Thu, 10 Feb 2005 20:01:50 +0100, Anders F Bj�rklund wrote:

 Sebastian Beschke wrote:
 
 3) the "str" alias for the char[] type, like "bool" for bit

 
 This is an obvious question, but what would you propose for wchar[] and 
 dchar[]?

 
 My proposition was "ustr" for wchar[] (since it rhymes with "uint",
 and is to be pronounced as Unicode string, as in Unicode-optimized)
 
 There will be no alias for dchar[], as it is a silly type anyway.
 dchar's are cool, but UTF-32 strings are just too much lossage...
 
 I think, considering that UTF-8 is not the optimal encoding in a huge 
 number of cases, promoting it as "The One and Only String Type" would 
 not be wise.

 
 There is no "one and only one string type" in D, just as there
 is no "one and only one boolean type". There are three of each,
 and char[] is the preferred string type (what does "main" use ?)
 so it gets to be the str. And bit is the type of "true" and "false"
 so it gets to be the default bool type. If you want to speed up
 or optimize your code, you can change to using wchar[] or wbool[]...
 And when it is needed in a few places, you have dchar[] and dbool
 
 The contents are exactly the same, just encoded differently -
 all strings are in Unicode and all booleans are in Zero-is-False.
 
 I just find this shortform to be easier on the eyes:
 
      void main(str[] args);
 
      str[str] dictionary;
 
 UTF-8 has two major advantages: 1) it's optimized for ASCII and
 does not require a BOM mark, making it compatible for files too
 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others
 
 If you do a lot of Unicode, or non-Western languages, switch to
 ustr instead? It's equally well supported in all D std libraries.
 (the only downside of using str is that it's a little bigger/slower)

One cannot easily address individual code points using utf8. For example...

  char[] SomeText;

You cannot be sure that SomeText[5] address the beginning of a code point
or not. Remembering that code points in utf8 are variable length, but are
fixed length in utf32. 

So if using utf8, and one is doing some form of character manipulation, one
should first convert to utf32, do the work, then convert back to utf8.

-- 
Derek
Melbourne, Australia

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Derek wrote:

UTF-8 has two major advantages: 1) it's optimized for ASCII and
does not require a BOM mark, making it compatible for files too
2) it is Endian agnostic, no more X86 vs PPC gruffs like the others

If you do a lot of Unicode, or non-Western languages, switch to
ustr instead? It's equally well supported in all D std libraries.
(the only downside of using str is that it's a little bigger/slower)

 
 One cannot easily address individual code points using utf8. For example...
 
   char[] SomeText;
 
 You cannot be sure that SomeText[5] address the beginning of a code point
 or not. Remembering that code points in utf8 are variable length, but are
 fixed length in utf32. 

This is not that much of a problem, since you should not address
individual code points anyway but treat the code units as a string.

See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:

 Code-point boundaries, iteration, and indexing are very fast with
 UTF-32. Code-point boundaries, accessing code points at a given offset,
 and iteration involve a few extra machine instructions for UTF-16; UTF-8
 is a bit more cumbersome. Indexing is slow for both of them, but in
 practice indexing by different code units is done very rarely, except
 when communicating with specifications that use UTF-32 code units, such
 as XSL.
 
 This point about indexing is true unless an API for strings allows
 access only by code point offsets. This is a very inefficient design:
 strings should always allow indexing with code unit offsets.

But char[] works fine for ASCII and wchar[] works fine for Unicode,
*as long* as you watch out for any surrogates in the code units...

Which means you can have a fast standard route, and extra code
to handle the exceptional characters if and when they occur ?

 So if using utf8, and one is doing some form of character manipulation, one
 should first convert to utf32, do the work, then convert back to utf8.

Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
as D can transparently handle the transition between char[] and dchar...

There are also readily available functions in the std.utf module:
"encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.

If you lot of loops like that, you can use a dchar[] (dstr alias) as a
intermediate storage. But char[] and wchar[] are better for long term.


--anders

Feb 10 2005

Derek Parnell <derek psych.ward> writes:

On Thu, 10 Feb 2005 22:47:18 +0100, Anders F Bj�rklund wrote:

 Derek wrote:
 
UTF-8 has two major advantages: 1) it's optimized for ASCII and
does not require a BOM mark, making it compatible for files too
2) it is Endian agnostic, no more X86 vs PPC gruffs like the others

If you do a lot of Unicode, or non-Western languages, switch to
ustr instead? It's equally well supported in all D std libraries.
(the only downside of using str is that it's a little bigger/slower)

 
 One cannot easily address individual code points using utf8. For example...
 
   char[] SomeText;
 
 You cannot be sure that SomeText[5] address the beginning of a code point
 or not. Remembering that code points in utf8 are variable length, but are
 fixed length in utf32. 

 
 This is not that much of a problem, since you should not address
 individual code points anyway but treat the code units as a string.

I obviously do a lot of different sort of programing to you. I often need
to look at individual code points (ie. characters) in a string.

 See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:
 
 Code-point boundaries, iteration, and indexing are very fast with
 UTF-32. Code-point boundaries, accessing code points at a given offset,
 and iteration involve a few extra machine instructions for UTF-16; UTF-8
 is a bit more cumbersome. Indexing is slow for both of them, but in
 practice indexing by different code units is done very rarely, except
 when communicating with specifications that use UTF-32 code units, such
 as XSL.
 
 This point about indexing is true unless an API for strings allows
 access only by code point offsets. This is a very inefficient design:
 strings should always allow indexing with code unit offsets.


Yes, and a simple index into a char[] doesn't do this for you.
 
 But char[] works fine for ASCII and wchar[] works fine for Unicode,
 *as long* as you watch out for any surrogates in the code units...
 
 Which means you can have a fast standard route, and extra code
 to handle the exceptional characters if and when they occur ?

'exceptional' to whom? To latin-based alphabet users maybe, but not the
great majority of the world's population.

 So if using utf8, and one is doing some form of character manipulation, one
 should first convert to utf32, do the work, then convert back to utf8.

 
 Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
 as D can transparently handle the transition between char[] and dchar...

Except for "character manipulation" as 'foreach(inout dchar c; SomeText)'
is not permitted.

 There are also readily available functions in the std.utf module:
 "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.

Exactly my point. One needs to use these if *manipulating* characters in a
utf8 or utf16 string. 

 If you lot of loops like that, you can use a dchar[] (dstr alias) as a
 intermediate storage. But char[] and wchar[] are better for long term.

'long term' meaning ??? Disk storage? RAM storage? Or until we finally get
rid of all those silly 'alphabets' out there ;-)

-- 
Derek
Melbourne, Australia
11/02/2005 9:38:27 AM

Feb 10 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Derek Parnell wrote:

But char[] works fine for ASCII and wchar[] works fine for Unicode,
*as long* as you watch out for any surrogates in the code units...

Which means you can have a fast standard route, and extra code
to handle the exceptional characters if and when they occur ?

 
 'exceptional' to whom? To latin-based alphabet users maybe, but not the
 great majority of the world's population.

No, but it's mine ;-) (the ignorant westerner that I am)

Seriously, in my own language - Swedish, about 10% of the text
is non ASCII, which means that Walters optimized US ASCII parts
runs for 90% of the time. I assume this is the same for the
rest of the previously ISO-8859-X using Western world languages...

Had I been using another alphabet, like Japanese or Chinese,
then UTF-16 had been a nice bet. Surrogate characters are not
occuring very often, in fact they were just now introduced
in Java 1.5 since the original 16 bits of Unicode "overflowed".
So I think there's a 90-10 rule here too, with non-Surrogates.

So I do think talking about "exceptions" is warranted ?

Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
as D can transparently handle the transition between char[] and dchar...

 
 Except for "character manipulation" as 'foreach(inout dchar c; SomeText)'
 is not permitted.

We are talking Copy-on-Write here, yes ? As in reading from readonly
and writing to readwrite ? Otherwise you could use dchar[] instead,
and do a simple indexing. (or a foreach(inout dchar c; SomeText on it)

And convert from UTF-8/UTF-16 on the way in, do all the processing
on the UTF-32 internal array, and convert back to UTF-8/UTF-16 on
the way out. (most routines now do include a dchar[] interface too,
you can even use dchar[] in switch/case statements - if you like)

If you lot of loops like that, you can use a dchar[] (dstr alias) as a
intermediate storage. But char[] and wchar[] are better for long term.

 
 'long term' meaning ??? Disk storage? RAM storage? Or until we finally get
 rid of all those silly 'alphabets' out there ;-)

Storage. Even with all "silly alphabets" utilized, there are still 11
dead bits in each UTF-32 character. UTF-16 is bound to more efficient. 
Unless you are doing extinct languages research or something? :-)

It's not just me... See http://www.unicode.org/faq/utf_bom.html#UTF32

--anders

Feb 10 2005

James McComb <ned jamesmccomb.id.au> writes:

Anders F Bj�rklund wrote:

 Had I been using another alphabet, like Japanese or Chinese,
 then UTF-16 had been a nice bet. Surrogate characters are not
 occuring very often, in fact they were just now introduced
 in Java 1.5 since the original 16 bits of Unicode "overflowed".
 So I think there's a 90-10 rule here too, with non-Surrogates.

Does anyone here know if Japanese and Chinese use a lot of ASCII 
punctuation? If they do, then maybe UTF-8 is reasonable.

James McComb

Feb 10 2005

"Walter" <newshound digitalmars.com> writes:

"Derek Parnell" <derek psych.ward> wrote in message
news:uk7573l4ag4s.fkp4buj0rl0e.dlg 40tude.net...
 This is not that much of a problem, since you should not address
 individual code points anyway but treat the code units as a string.

 I obviously do a lot of different sort of programing to you. I often need
 to look at individual code points (ie. characters) in a string.

Take a look at the functions std.utf.stride, std.utf.toUCSindex, and
std.utf.toUTFindex. They provide the basic building blocks to manipulate
UTF-8 strings as if they were an array of UCS characters.

Feb 11 2005

D Programming

C/C++ Programming

Other

digitalmars.D - Evolution (Hello World)