digitalmars.D - OT - scanf in Java

Arcane Jill (23/23) Jul 26 2004 I realize that this is not a Java forum, but I'm trying to get a feel fo...

Andy Friesen (13/39) Jul 26 2004 It's been quite awhile, but I think it goes something like this:
Berin Loritsch (8/34) Jul 26 2004 You have some options. In Java 1.5, there is a new Scanner class--since
parabolis (63/67) Jul 26 2004 Andy Friesen's code solves the problem, but since you are comparing

Berin Loritsch (5/24) Jul 26 2004 Another place to look, if you want to see how they are planning on

Arcane Jill (9/12) Jul 26 2004 Hey, cool. They can parse non-Latin digits.

Sean Kelly (7/19) Jul 26 2004 I've been wondering about this. readf (was scanf) still uses some lame

Arcane Jill (29/34) Jul 26 2004 The function getDecimalDigit(dchar) in etc.unicode returns the numeric v...

Sean Kelly (7/26) Jul 27 2004 Makes sense. The scanf spec that I was working off of makes no

parabolis (19/38) Jul 26 2004 I am not suprised. I excpect one of the areas of programming language

Arcane Jill (8/17) Jul 27 2004 Yes, Java uses UTF-16 (which is what you meant by "escape sequence" or

parabolis (2/6) Jul 27 2004 In my opinion D is off to a really bad start with Unicode.

Arcane Jill (44/50) Jul 27 2004 The "start" hasn't even happened yet. What we have now isn't anything li...

parabolis (99/137) Jul 27 2004 Actually the start has happened. What I was referring to was

Sean Kelly (38/77) Jul 27 2004 This is a multifaceted issue. D supports UTF-8, UTF-16, and UTF-32

parabolis (24/111) Jul 28 2004 Yes Unicode calls them code units instead of characters because

Arcane Jill (68/144) Jul 28 2004 You'll have to ask Walter that one. (I mean, you'll have to wait and see...

J C Calvarese (9/25) Jul 28 2004 Unless I don't understand the question (which is always a strong possibi...

Arcane Jill (7/12) Jul 28 2004 So it does. Cool.
parabolis (2/35) Jul 28 2004 Wow I am impressed. That was really forward thinking.

parabolis (108/240) Jul 28 2004 I will probabably wait so see if he responds to me. As I have

Arcane Jill (46/70) Jul 28 2004 I'm not disagreeing with you, but see my separate post on graphemes and ...

Berin Loritsch (14/44) Jul 28 2004 Perhaps I am missing something, but the general idea that I am used to
parabolis (29/48) Jul 28 2004 Actually on the topic of UTF conversion exception... There

Carlos Santander B. (55/55) Jul 28 2004 "parabolis" escribi� en el mensaje

Ben Hinkle (10/27) Jul 28 2004 The section of the D doc that you quote is followed by an example and th...

Arcane Jill (1/5) Jul 28 2004 That works for me.
parabolis (12/29) Jul 28 2004 No I do not believe that would help. I think that would simply

Arcane Jill (4/9) Jul 28 2004 except of course dchar[]

parabolis (3/9) Jul 26 2004 Sadly the one thing I really want them to implement will never happen -

Berin Loritsch (9/21) Jul 27 2004 I think the main reason for that is the focus of Java. Java is designed

parabolis (22/37) Jul 27 2004 ================================================================

Berin Loritsch (18/43) Jul 27 2004

parabolis (20/46) Jul 27 2004 No what I remember reading was that Sun wanted to dabble in

Arcane Jill (8/10) Jul 27 2004 Well, here's one possible reason:

Berin Loritsch (5/22) Jul 27 2004 This will not hold true for all processor types, so it is generally
parabolis (17/38) Jul 27 2004 Yes I would wonder why you wrote that and probably assume to

Sean Kelly (9/13) Jul 27 2004 Though I'm generally too prone to premature optimization to do this, I

parabolis (4/15) Jul 28 2004 Emphasis on the potential. That means that you must first be

Sean Kelly (7/22) Jul 28 2004 Perhaps I've been spoiled by the latest set of C++ compilers. The tests...

Sha Chancellor (4/14) Jul 26 2004 Do they have a DecaffinateInputStream and a DefatInputStream class also

parabolis (3/21) Jul 26 2004 lol

Arcane Jill <Arcane_member pathlink.com> writes:

I realize that this is not a Java forum, but I'm trying to get a feel for how D
compares to other things. I want to know, how does one get a line of input from
the console in Java? I've written some insignificant amount of code in Java in
the past, but none of it ever needed to get a line of input from the console.

Here's the reference program - written in C++. Just as an exercise, I'm
comparing this with other languages. (I know it's not really a fair test, but
what the hell?)











I can translate that into most other languages easily - but for Java, I'm stuck.
How would you do this? Especially, how would you do this without using any
deprecated functions?

(D doesn't do very well at this one, incidently, but that's just a temporary
phase. Things will obviously get better when stream support improves and we get
a native-D scanf replacement, both of which, I gather, are underway).

Arcane Jill

Jul 26 2004

Andy Friesen <andy ikagames.com> writes:

Arcane Jill wrote:

 I realize that this is not a Java forum, but I'm trying to get a feel for how D
 compares to other things. I want to know, how does one get a line of input from
 the console in Java? I've written some insignificant amount of code in Java in
 the past, but none of it ever needed to get a line of input from the console.
 
 Here's the reference program - written in C++. Just as an exercise, I'm
 comparing this with other languages. (I know it's not really a fair test, but
 what the hell?)
 









 
 I can translate that into most other languages easily - but for Java, I'm
stuck.
 How would you do this? Especially, how would you do this without using any
 deprecated functions?
 
 (D doesn't do very well at this one, incidently, but that's just a temporary
 phase. Things will obviously get better when stream support improves and we get
 a native-D scanf replacement, both of which, I gather, are underway).

It's been quite awhile, but I think it goes something like this:

     import java.io.*;

     public class TheMainClass {
         public static void main(String[] args) {
             InputStreamReader isr = new InputStreamReader(System.in);
             BufferedReader br = new BufferedReader(isr);
             String s = br.readLine();
             System.out.writeLine(s);
         }
     }

Beating java on this one isn't very hard. :)

  -- andy

Jul 26 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Arcane Jill wrote:

 I realize that this is not a Java forum, but I'm trying to get a feel for how D
 compares to other things. I want to know, how does one get a line of input from
 the console in Java? I've written some insignificant amount of code in Java in
 the past, but none of it ever needed to get a line of input from the console.
 
 Here's the reference program - written in C++. Just as an exercise, I'm
 comparing this with other languages. (I know it's not really a fair test, but
 what the hell?)
 









 
 I can translate that into most other languages easily - but for Java, I'm
stuck.
 How would you do this? Especially, how would you do this without using any
 deprecated functions?
 
 (D doesn't do very well at this one, incidently, but that's just a temporary
 phase. Things will obviously get better when stream support improves and we get
 a native-D scanf replacement, both of which, I gather, are underway).

You have some options.  In Java 1.5, there is a new Scanner class--since
I haven't played with it much, I will have to stick with the older
methods.

The System class holds a reference to stdin and stdout as System.in and
System.out respectively.

Using the System.in, you can wrap whatever input streams/readers you
need to parse the input as expected.

Jul 26 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:
 I realize that this is not a Java forum, but I'm trying to get a feel for how D
 compares to other things. I want to know, how does one get a line of input from
 the console in Java? I've written some insignificant amount of code in Java in
 the past, but none of it ever needed to get a line of input from the console.

Andy Friesen's code solves the problem, but since you are comparing 
languages I would suggest you take a stroll through the java.io classes 
to get a feel for how the IO library works. It is really well done in my 
opinion and an shining example of what really well done OO code looks like.

The Java API specs are available from
     http://java.sun.com/j2se/1.4.2/docs/api/


The abstract class InputStream defines a small number of fundamental 
operations that conceptually define a Stream. The _only_ abstract 
function is

     abstract int read();

This is the only function a subclass needs to define to have the whole 
InputStream repotoire available. Some InputStream subclasses provide 
data from sources like files, socket connections and in memory data 
structures:

     FileInputStream in java.io
     SocketInputStream in java.net
     StringInputStream in java.io


The rest of the functions have meainingful default behavior. So reading 
bytes into an array in the general case is handled in InputStream by:

     int read(byte buf, int off, int len)  (not abstract!)

Interesting intermediate behavior is obtained by passing an InputStream 
subclass to other InputStreams. Interesting intermediate behavior includes:

     Buffering in java.io.Buffered___Stream
     En/De-cryption in javax.crypto.Cipher___Stream
     Compression in java.util.zip.Inflater___Stream
     Decompression in java.util.zip.Deflater___Stream
     Digesting (eg CRC32) in java.util.zip.Checked___Stream


Interesting final behavior (ie the reason you opened the stream to begin 
with...) includes:

     Read\Write general Data in java.io.Data___Stream
     Read\Write Ojects in java.io.Object___Stream
     Read\Write zip file in java.util.zip.Zip___Stream

The end result is mixing and matching Streams to suit your needs.

Say you want to read something in. You only need to answer 3 questions:

1) From where? File, Socket, Data Structure, etc...
2) How? Buffered, Encrypted, Digested, etc...
3) What kind? Data, Object, etc...

Say you want to read in
   1) from a File
   2) compressed, digested and buffed
   3) Data

That would be:

DataInputStream input =
   new DataInputStream(
     new CheckedInputStream(
       new DeflaterInputStream(
         new BufferedInputStream(
           new FileInputStream("filename.ext")
         )
       )
     )
   );

In this case input will buffer, then decompress, then digest anything 
you read from filename.ext.


An item to pay particular attention to is the Object___Stream. When 
combined with Socket___Stream you can send Objects to (or read object 
from) a TCP connection.

If you write a java.lang.Runnable object to an ObjectOutputStream which 
then sends it to a server the server can cast object read to Runnable 
and then start a new thread which calls the objects run() method. Thus 
it is possible to start a Server and leave it running then later write 
new code and send it (code the Server has never seen before).

Jul 26 2004

Berin Loritsch <bloritsch d-haven.org> writes:

parabolis wrote:

 Arcane Jill wrote:
 
 I realize that this is not a Java forum, but I'm trying to get a feel 
 for how D
 compares to other things. I want to know, how does one get a line of 
 input from
 the console in Java? I've written some insignificant amount of code in 
 Java in
 the past, but none of it ever needed to get a line of input from the 
 console.

 
 
 Andy Friesen's code solves the problem, but since you are comparing 
 languages I would suggest you take a stroll through the java.io classes 
 to get a feel for how the IO library works. It is really well done in my 
 opinion and an shining example of what really well done OO code looks like.
 
 The Java API specs are available from
     http://java.sun.com/j2se/1.4.2/docs/api/

Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
http://java.sun.com/j2se/1.5.0/docs/api/

Jul 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce3o9u$1enc$1 digitaldaemon.com>, Berin Loritsch says...

Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html

Hey, cool. They can parse non-Latin digits.


returns true

So, Arab digits, Bengali digits, no problem.

Not sure how they'd cope with Osmanya digits though - these have codepoints
U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.

We'll have this in D eventually, but we won't stop at wchars.

Arcane Jill

Jul 26 2004

Sean Kelly <sean f4.ca> writes:

In article <ce3pln$1fds$1 digitaldaemon.com>, Arcane Jill says...
In article <ce3o9u$1enc$1 digitaldaemon.com>, Berin Loritsch says...

Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html

Hey, cool. They can parse non-Latin digits.


returns true

So, Arab digits, Bengali digits, no problem.

Not sure how they'd cope with Osmanya digits though - these have codepoints
U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.

We'll have this in D eventually, but we won't stop at wchars.

I've been wondering about this.  readf (was scanf) still uses some lame
shortcuts like "x - '0'" but that wouldn't be too terribly hard to fix.  I don't
suppose the unicode isdigit function currently supports these numbering schemes?
Also, is it reasonable to assume that every numbering scheme is base 10?  I'd
certainly think so, but I suppose it's worth asking.


Sean

Jul 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce3qsf$1fqt$1 digitaldaemon.com>, Sean Kelly says...

I've been wondering about this.  readf (was scanf) still uses some lame
shortcuts like "x - '0'" but that wouldn't be too terribly hard to fix.  I don't
suppose the unicode isdigit function currently supports these numbering schemes?

The function getDecimalDigit(dchar) in etc.unicode returns the numeric value in
the range 0 to 9 of all Unicode decimal digits. It returns -1 for all
non-digits. You can find source code for this function in Deimos on dsource.
Temporarily, there is no prebuilt library, but the source code works fine.

When I get back to writing code, my very next task will be to tidy up
etc.unicode, release the codebuilder code, etc.. Right now I'm still taking a
few weeks off coding because I'm still a bit blown away by my gran's death, so,
for now, you'll just have to put up with me ranting on this forum without
actually /doing/ anything - but I imagine I'll get back onto the task in hand in
maybe a couple of weeks or so.

There is also a similar function, getDigit(dchar), which is similarly defined,
except that it also considers things like SUPERSCRIPT TWO and CIRCLED THREE to
be "digits". I imagine, therefore, that for readf(), getDecimalDigit() would be
more appropriate than getDigit().



Also, is it reasonable to assume that every numbering scheme is base 10?  I'd
certainly think so, but I suppose it's worth asking.

As far as Unicode is concerned, yes.
As far as reality is concerned, no.

In the Tamil script, for example, they use base twelve. Unicode simply cannot
comprehend this, and (erroneously) declares Tamil digits 0 to 9 to be "decimal".
However - for our purposes, /this doesn't matter/. Our job is to implement the
Unicode standard, even if it's wrong. Fixing the Unicode code charts is a job
for the Unicode Consortium, and that may happen in some future release. For now
- as Walter said - we put metaphorical blinkers on and go with what the standard
says.


For hexadecimal, there's the function getHexValue(), which returns a value in
the range 0 to 15 for hex digits, -1 otherwise. (It's possible I may not have
implemented that yet, or that I implemented it inefficiently. When I get back to
D-coding, I'll fix this).

Jill

Jul 26 2004

Sean Kelly <sean f4.ca> writes:

Arcane Jill wrote:
 In article <ce3qsf$1fqt$1 digitaldaemon.com>, Sean Kelly says...
 
Also, is it reasonable to assume that every numbering scheme is base 10?  I'd
certainly think so, but I suppose it's worth asking.

 
 As far as Unicode is concerned, yes.
 As far as reality is concerned, no.
 
 In the Tamil script, for example, they use base twelve. Unicode simply cannot
 comprehend this, and (erroneously) declares Tamil digits 0 to 9 to be
"decimal".
 However - for our purposes, /this doesn't matter/. Our job is to implement the
 Unicode standard, even if it's wrong. Fixing the Unicode code charts is a job
 for the Unicode Consortium, and that may happen in some future release. For now
 - as Walter said - we put metaphorical blinkers on and go with what the
standard
 says.

Makes sense.  The scanf spec that I was working off of makes no 
concession for a base 12 numbering scheme anyway.  And I hesitate to add 
it as it would confuse things.

 For hexadecimal, there's the function getHexValue(), which returns a value in
 the range 0 to 15 for hex digits, -1 otherwise. (It's possible I may not have
 implemented that yet, or that I implemented it inefficiently. When I get back
to
 D-coding, I'll fix this).

Perfect.  I'll just use this function for everything.  It will simplify 
the code a bit anyway.


Sean

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 In article <ce3o9u$1enc$1 digitaldaemon.com>, Berin Loritsch says...
 
 
Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html

 
 
 Hey, cool. They can parse non-Latin digits.
 

 returns true
 
 So, Arab digits, Bengali digits, no problem.

I am not suprised. I excpect one of the areas of programming language 
development in the near future to be inclusion of non ASCII names for 
things like classes and variables etc. Just imagine trying to program 
with class names in Hiragana. I dont think support would be difficult. 
However I am still trying to fathom the depths of that Unicode beast.

 
 Not sure how they'd cope with Osmanya digits though - these have codepoints
 U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.
 

(been working on Unicode stuff so I know this...)
I believe they would cope using an escape sequence (surrogate pairs). 
Which I suppose means the String.length() function lies sometimes.

 From java.nio.Charset
================================
The native coded character set of the Java programming language is that 
of the first seventeen planes of the Unicode version 3.0 character set; 
that is, it consists in the basic multilingual plane (BMP) of Unicode 
version 1 plus the next sixteen planes of Unicode version 3. This is 
because the language's internal representation of characters uses the 
UTF-16 encoding, which encodes the BMP directly and uses surrogate 
pairs, a simple escape mechanism, to encode the other planes
================================

Jul 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce4ec5$1moq$2 digitaldaemon.com>, parabolis says...


 returns true
 
 Not sure how they'd cope with Osmanya digits though - these have codepoints
 U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.
 

(been working on Unicode stuff so I know this...)
I believe they would cope using an escape sequence (surrogate pairs). 
Which I suppose means the String.length() function lies sometimes.

Yes, Java uses UTF-16 (which is what you meant by "escape sequence" or
"surrogate pairs"). However, that doesn't change the definition above: "A
non-ASCII character c for which Character.isDigit(c) returns true". The function
Character.isDigit(c) takes a Java char as it's parameter, not a UTF-16 sequence.

It doesn't matter for me, though, as I don't use Java, and I intend for D to do
better.

Jill

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 
 It doesn't matter for me, though, as I don't use Java, and I intend for D to do
 better.
 

In my opinion D is off to a really bad start with Unicode.

Jul 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce68g6$2h7r$1 digitaldaemon.com>, parabolis says...
Arcane Jill wrote:

 
 It doesn't matter for me, though, as I don't use Java, and I intend for D to do
 better.
 

In my opinion D is off to a really bad start with Unicode.

The "start" hasn't even happened yet. What we have now isn't anything like what
we're /going/ to have. There are /loads/ of (other) things that D doesn't have
yet (like decent streams support), but most of these things are *in progress*.
I'd say you made your call too early.

Look at it like this. D has only been around for three or four years, and it was
basically a one-person project. We're not even at version 1.0 yet, so the best
is most definitely yet to come. Now Walter planned for good Unicode support from
the start, and, with that in mind, he laid down the foundations, for example by
insisting that D strings be Unicode. Those foundations are now being built upon.
For example, the library etc.unicode (temporarily on hold for a few weeks due to
a family death) currently gives you access to (almost) every Unicode character
property. C doesn't give you this. C++ doesn't give you this. Even Java only
gives you this for codepoints up to U+FFFF. D covers the lot - and that's /right
now/. What's more, this library is robot-built from the actual Unicode database
files, and so can be rebuilt with every new version of Unicode as it comes out,
/and/ can be rebuilt for old versions of Unicode should that need arise. We're
way ahead of Java there, which leaves you stuck with whatever version happens to
come with your JVM.

And as for the future - well, for stage 2 we've got the normalization, canonical
and compatibility equivalence stuff all planned, grapheme boundary detection,
full localized casing ... which I think will take us way ahead of Java. And
meanwhile, there are guys working on strings and streams who are getting
transcoding issues sussed.

For stage three - and by this stage we'll be way ahead of the field - we'll have
fuzzy matching, collation, and so on, all of which are locale-aware, plus full
support for PUA properties. And meanwhile, there will be other guys working on
other internationalization translation issues like number formatting and
whatnot.

I think you have made your judgement too early. Phobos is tiny right now,
compared with Java's vast array of classes. Deimos is even tinier, and somewhat
more piecemeal. But already D's Unicode support is:

* Better than C
* Better than C++
* Catching up with Java (and better in some areas)

To expect the full whack right at the start is unrealistic (and we /are/ still
right at the start). Walter was way too busy getting the core of the language
together to start worrying about how you do uppercasing in Deseret*, but the
language has now reached the point where we can do that.

So tell me. Against what are you comparing D? Java? Tell me in what ways you
think D is behind? Tell me what does better than D, and in what way? I suspect
you may be hard pressed to come up with examples.

Arcane Jill


* something which Java can't do, but D can, right now.

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:
Arcane Jill wrote:


It doesn't matter for me, though, as I don't use Java, and I intend for D to do
better.

In my opinion D is off to a really bad start with Unicode.

 
 
 The "start" hasn't even happened yet. What we have now isn't anything like what
 we're /going/ to have. There are /loads/ of (other) things that D doesn't have
 yet (like decent streams support), but most of these things are *in progress*.
 I'd say you made your call too early.

Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

 
 Look at it like this. D has only been around for three or four years, and it
was
 basically a one-person project. We're not even at version 1.0 yet, so the best

...
 
 And as for the future - well, for stage 2 we've got the normalization,
canonical

...
 For stage three - and by this stage we'll be way ahead of the field - we'll
have

...

Please forgive me but I have only started figuring out Unicode 
this week so a good deal of D's planned implementation are 
features that I at best partially understand. I am happy to
know there is a master plan however.

I applaud the robot builds.

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).

I am also curious about whether there it will be possible for a 
non-English speaker to use their language's version of D's 
reserverved words (ie Swedish word for synchronized, etc).

 
 I think you have made your judgement too early. Phobos is tiny right now,
 compared with Java's vast array of classes. Deimos is even tinier, and somewhat
 more piecemeal. But already D's Unicode support is:
 
 * Better than C
 * Better than C++
 * Catching up with Java (and better in some areas)
 
 To expect the full whack right at the start is unrealistic (and we /are/ still
 right at the start). Walter was way too busy getting the core of the language
 together to start worrying about how you do uppercasing in Deseret*, but the
 language has now reached the point where we can do that.
 
 So tell me. Against what are you comparing D? Java? Tell me in what ways you
 think D is behind? Tell me what does better than D, and in what way? I suspect
 you may be hard pressed to come up with examples.
 

Ok before I explain what aspects of D's Unicode implementation 
bother me I feel given the context of the thread that I need to 
point out that I am not comparing D to any other language. I 
have used Java's Unicode related documents only to clarify the 
diverse Unicode technical vocabulary.

As stated above (and in the 'Source level Java to D converter' 
thread) I do not agree with D's apparent conception of the string.

The D docs (in Arrays.Special Array Types.Strings) say this:
================================
Dynamic arrays in D suggest the obvious solution - a string is 
just a dynamic array of characters. String literals become just 
an easy way to write character arrays.
================================

I agree that a string is a sequence of characters. However D's 
conception of string seems to be a Unicode string which is most 
decidedly NOT a sequence of characters. Unicode defines a 
Character in a sensible fashion:

================================
(from http://www.unicode.org/glossary/)

Character. (1) The smallest component of written language that 
has semantic value; ...

Unicode String. A code unit sequence ...
================================

What D calls characters are in fact code units. A char for 
example is an 8-bit code unit that may in special cases 
represent a Character. Of course the type name 'char' was
strongly suggested for C-compatibility so the misnomer was not 
wanton.

However conceptually confusing a String of Characters with a 
Unicode String (of code units) led to what I consider a fairly 
glaring omission in even the most basic or unfinished library:

The most basic of String operations length-querry and substring 
are not supported. It is clearly possible to count the code 
units with _char[].length and it is possible to slice the code 
units with char[i..j]. But no predefined operations actually 
indicate how many Characters a char[] (or wchar[]) actually 
contains. To put it another way:

     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)
        char[].sort  (just amusing to consider)

I may seem like I am being overly pedantic but I came to D 
without any knowledge of Unicode. It took me days to finally 
figure out that when anything D related says 'string' it 
actually means something different from the intuitive notion of 
a string, the formal notion of a string (assuming an alphabet 
must consist of Characters) and Unicode's technical definition.

I would strongly suggest adding a String class to phobos which 
implements a String of Characters and reserve the term string to 
that class alone. Hence if a String class is written then 
Object.toString() should return a String reference.

Of course a String class is not the only valid solution and I do 
not have enough experience with D or Unicode to suggest that it 
would be the best. I certainly would not suggest that it should 
be done because that is how it was done in Java...

With that said I do have doubts that a feasible solution exists 
without implementing a String class. Non-class methods would 
have to parse UTF once for each length and substring call 
whereas a proper class implementation can do it in constant time 
(see implementation suggestion below if in doubt).

I am under the impression that while Unicode has enities that 
require more than 16 bits to represent it has been said that 
such 32 bit examples will be "vanishingly rare". Thus 16 bits is 
the normal case with occasional use of 32 bit entities.

Optimizing the most frequent case suggests using an internal 
representation of wchar[] for the 16 bit entities and another 
sparse wchar array for the cases in which a wchar is too small. 
The querry-length function is obviously constant time in all 
cases. However so is the substring operation - thanks to 
copy-on-write.

I might also suggest considering making String an interface and 
implementing it in 3 seperate classes (or more):
     1) String8
     2) String16
     3) String32
(these are horrible names, sorry)

The Interface implementation of course has the benefit that 
anybody who wants to tune a String class to work for them can 
either subclass an existing String class or write their own 
implementation (without inheriting super class stuff) and still 
have the class recognized by Object.toString() and 
Exception.this(String)

Jul 27 2004

Sean Kelly <sean f4.ca> writes:

parabolis wrote:
 
 What D calls characters are in fact code units. A char for example is an 
 8-bit code unit that may in special cases represent a Character. Of 
 course the type name 'char' was
 strongly suggested for C-compatibility so the misnomer was not wanton.

This is a multifaceted issue.  D supports UTF-8, UTF-16, and UTF-32 
representations, stored in arrays of char, wchar, and dchar, 
respectively.  While char strings are technically UTF-8, there is a 1-1 
correspondence between characters and bytes so long as the values are 
within the range of the ASCII character set.  And in the case of dchars, 
there (as far as I know) is always a 1-1 correspondence between D 
characters and Unicode characters.

 However conceptually confusing a String of Characters with a Unicode 
 String (of code units) led to what I consider a fairly glaring omission 
 in even the most basic or unfinished library:
 
 The most basic of String operations length-querry and substring are not 
 supported. It is clearly possible to count the code units with 
 _char[].length and it is possible to slice the code units with 
 char[i..j]. But no predefined operations actually indicate how many 
 Characters a char[] (or wchar[]) actually contains. To put it another way:
 
     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)
        char[].sort  (just amusing to consider)

Good point.  However C++ has this exact same issue with its string 
class.  Perhaps the problem is one of semantics.  While C++ merely 
claims that its strings are an ordered sequence of bytes, the D 
documentation suggests that these bytes are in a specific encoding 
format (though the language does not require this).

 I may seem like I am being overly pedantic but I came to D without any 
 knowledge of Unicode. It took me days to finally figure out that when 
 anything D related says 'string' it actually means something different 
 from the intuitive notion of a string, the formal notion of a string 
 (assuming an alphabet must consist of Characters) and Unicode's 
 technical definition.

Part of this has come about because we've been actively discussing 
internationalization recently, so much of what's said about strings is 
done so in that context.  I'm only passingly familiar with many of the 
details of Unicode as well, but I do believe that there is room in the 
language for both definitions of "string."

 I would strongly suggest adding a String class to phobos which 
 implements a String of Characters and reserve the term string to that 
 class alone. Hence if a String class is written then Object.toString() 
 should return a String reference.

True enough.  I agree that if a sequence of characters is to be printed 
then it must be properly encoded.  Whether the internal representation 
is properly encoded, however, isn't much of an issue to me, so long as 
there is a clear means of producing the encoded string when output is 
desired.

 With that said I do have doubts that a feasible solution exists without 
 implementing a String class. Non-class methods would have to parse UTF 
 once for each length and substring call whereas a proper class 
 implementation can do it in constant time (see implementation suggestion 
 below if in doubt).

True enough.  At the very least, we need some method of determing "true" 
string length.  ie. how many representable characters a string contains. 
  I have a feeling that there is a Unicode function for this, but I 
could not tell you its name.  Frankly, I suspect that we will begin to 
use dcar arrays more and more often to avoid the trouble that dealing 
with multibyte encodings causes.

 I might also suggest considering making String an interface and 
 implementing it in 3 seperate classes (or more):
     1) String8
     2) String16
     3) String32
 (these are horrible names, sorry)

I'm not sure if there's one in the DTL, but it might be worth waiting to 
see.  Assuming there is, I suspect that the signature would be along the 
lines of:

class String(CharT) {...}

so

String!(char);
String!(wchar);
String!(dchar);


Sean

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 parabolis wrote:
 
 What D calls characters are in fact code units. A char for example is 
 an 8-bit code unit that may in special cases represent a Character. Of 
 course the type name 'char' was
 strongly suggested for C-compatibility so the misnomer was not wanton.

 
 
 This is a multifaceted issue.  D supports UTF-8, UTF-16, and UTF-32 
 representations, stored in arrays of char, wchar, and dchar, 

Yes Unicode calls them code units instead of characters because 
they do not always represent a character.

 respectively.  While char strings are technically UTF-8, there is a 1-1 
 correspondence between characters and bytes so long as the values are 
 within the range of the ASCII character set.  And in the case of dchars, 
 there (as far as I know) is always a 1-1 correspondence between D 
 characters and Unicode characters.

Yes that would be the special case in which a char actually 
holds sufficient code units to be interpreted as a Character.

 
 However conceptually confusing a String of Characters with a Unicode 
 String (of code units) led to what I consider a fairly glaring 
 omission in even the most basic or unfinished library:

 The most basic of String operations length-querry and substring are 
 not supported. It is clearly possible to count the code units with 
 _char[].length and it is possible to slice the code units with 
 char[i..j]. But no predefined operations actually indicate how many 
 Characters a char[] (or wchar[]) actually contains. To put it another 
 way:

     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)
        char[].sort  (just amusing to consider)

 
 
 Good point.  However C++ has this exact same issue with its string 
 class.  Perhaps the problem is one of semantics.  While C++ merely 
 claims that its strings are an ordered sequence of bytes, the D 
 documentation suggests that these bytes are in a specific encoding 
 format (though the language does not require this).

If there were actually a string class I would not excpect the 
above to hold. I simply meant that there is no way currently in 
D to find any of:

     <String>.length
     <String>.substring(i,j)

Because only these are implemented:
      char[].length
         char[i..k]
 
 I may seem like I am being overly pedantic but I came to D without any 
 knowledge of Unicode. It took me days to finally figure out that when 
 anything D related says 'string' it actually means something different 
 from the intuitive notion of a string, the formal notion of a string 
 (assuming an alphabet must consist of Characters) and Unicode's 
 technical definition.

 
 
 Part of this has come about because we've been actively discussing 
 internationalization recently, so much of what's said about strings is 
 done so in that context.  I'm only passingly familiar with many of the 
 details of Unicode as well, but I do believe that there is room in the 
 language for both definitions of "string."

I think you may be missing my point. I am not suggesting 
eliminating "Unicode string" support for the sake of a 1:1 
corresondence between a primitive type and character.  I am 
saying that there is really only one definition of "string" and 
calling sequences of code units 'strings' does not fit any 
standard notion of a "string".

 
 With that said I do have doubts that a feasible solution exists 
 without implementing a String class. Non-class methods would have to 
 parse UTF once for each length and substring call whereas a proper 
 class implementation can do it in constant time (see implementation 
 suggestion below if in doubt).

 
 
 True enough.  At the very least, we need some method of determing "true" 
 string length.  ie. how many representable characters a string contains. 
  I have a feeling that there is a Unicode function for this, but I could 
 not tell you its name.  Frankly, I suspect that we will begin to use 
 dcar arrays more and more often to avoid the trouble that dealing with 
 multibyte encodings causes.
 
 I might also suggest considering making String an interface and 
 implementing it in 3 seperate classes (or more):
     1) String8
     2) String16
     3) String32
 (these are horrible names, sorry)

 
 
 I'm not sure if there's one in the DTL, but it might be worth waiting to 
 see.  Assuming there is, I suspect that the signature would be along the 
 lines of:
 
 class String(CharT) {...}
 
 so
 
 String!(char);
 String!(wchar);
 String!(dchar);
 

I think a templated version of String should also implement a 
String interface because it would still allow other 
implementations to be used:

    interface String
    class StringT(CharT) : StringT {...}

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce6pke$2o2m$1 digitaldaemon.com>, parabolis says...
Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).

You'll have to ask Walter that one. (I mean, you'll have to wait and see if
Walter answers this question). I suspect not, because I'm only providing a
library, and it's written in D. The DMD compiler is written in C, and so can't
call D libraries, and therefore won't be able to take advantage of any D library
I provide. Adding Unicode support to /the compiler/ would also bloat the
compiler somewhat. But that's just a guess. As I said, only Walter can answer
this one definitively.


I am also curious about whether there it will be possible for a 
non-English speaker to use their language's version of D's 
reserverved words (ie Swedish word for synchronized, etc).

I'd be surprised if that were so. Syntax analysis happens /before/ semantic
analysis, and syntax analysis needs to know all the reserved words. But again,
I'm just guessing. Only Walter can be definitive.


Ok before I explain what aspects of D's Unicode implementation 
bother me I feel given the context of the thread that I need to 
point out that I am not comparing D to any other language. I 
have used Java's Unicode related documents only to clarify the 
diverse Unicode technical vocabulary.

As stated above (and in the 'Source level Java to D converter' 
thread) I do not agree with D's apparent conception of the string.

The D docs (in Arrays.Special Array Types.Strings) say this:
================================
Dynamic arrays in D suggest the obvious solution - a string is 
just a dynamic array of characters. String literals become just 
an easy way to write character arrays.
================================

I agree that a string is a sequence of characters. However D's 
conception of string seems to be a Unicode string which is most 
decidedly NOT a sequence of characters. Unicode defines a 
Character in a sensible fashion:

================================
(from http://www.unicode.org/glossary/)

Character. (1) The smallest component of written language that 
has semantic value; ...

Unicode String. A code unit sequence ...
================================

What D calls characters are in fact code units.

Correct.


A char for 
example is an 8-bit code unit that may in special cases 
represent a Character. Of course the type name 'char' was
strongly suggested for C-compatibility so the misnomer was not 
wanton.

I think it was also chosen for ASCII compatibility. It makes sense for
Westerners. "hello world\n" has got twelve characters in it, as well as twelve
code units. See - D is trying to educate people /gently/. If it had started out
with the following as basic types:

*    codeunit    // UTF-8 code unit
*    wcodeunit   // UTF-16 code unit
*    dcodeunit   // UTF-32 code unit
*    char        // 32-bit wide character (same as dcodeunit)

then everything would have worked, but people who used mostly ASCII would likely
go: Eh? And ASCII strings would be four times as long.


However conceptually confusing a String of Characters with a 
Unicode String (of code units) led to what I consider a fairly 
glaring omission in even the most basic or unfinished library:

The most basic of String operations length-querry and substring 
are not supported. It is clearly possible to count the code 
units with _char[].length and it is possible to slice the code 
units with char[i..j]. But no predefined operations actually 
indicate how many Characters a char[] (or wchar[]) actually 
contains. To put it another way:

     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)

That's only partially true. As noted above, it's the /names/ for things that are
wrong, not that things are absent. If you pretend that "dchar" is the character
type, rather than "char", then you /do/ get the behavior you desire. You /could/
simply pretend that char and wchar don't exist, if you really wanted.


        char[].sort  (just amusing to consider)

This one will actually work. Lexicographical UTF-8 order is the same as
lexicographical Unicode order.


I may seem like I am being overly pedantic but I came to D 
without any knowledge of Unicode.

Most people do, and you're not being overly pedantic.


It took me days to finally 
figure out that when anything D related says 'string' it 
actually means something different from the intuitive notion of 
a string, the formal notion of a string (assuming an alphabet 
must consist of Characters) and Unicode's technical definition.

Mebe, but it's no different from a string in any other computer language. In
/no/ language of which I am aware is a string an array of Unicode characters. In
C and C++ on Windows, for example, a char is eight bits wide, and so /obviously/
can't store all Unicode characters. In fact, it's very hard for C source code to
know the encoding of a C string, and everything will work fine only if
everything sticks to the system default. This makes internationalization much
harder.



I would strongly suggest adding a String class to phobos which 
implements a String of Characters and reserve the term string to 
that class alone. Hence if a String class is written then 
Object.toString() should return a String reference.

We D users can write a Unicode aware String class (and I believe Hauke is doing
that); we can publish it; we can even /suggest/ that it be moved into Phobos.
But Walter is the only one who can approve/disapprove/implement that suggestion.
Phobos is Walter's baby. Deimos is one place where we can put things in the
meantime, but the tight integration that you suggest can only happen if
everything is in the same place.

But I'm tempted to ask why? I mean, what's wrong with a char[] (UTF-8 sequence)?
Its good enough for many purposes, especially for mostly-ASCII strings (which
Object.toString() is likely to return), and you can always convert it to a
String (pending such a class) if you want more functionality.


Of course a String class is not the only valid solution and I do 
not have enough experience with D or Unicode to suggest that it 
would be the best. I certainly would not suggest that it should 
be done because that is how it was done in Java...

True. And Java made the mistake of declaring a String class /final/. I found
that damned annoying, as I couldn't extend it. If I wanted additional
functionality not provided by Java String, I would have had to have written a
brand new class from scratch, and even then it wouldn't have cast. I seriously
hope D doesn't make /that/ mistake. However much functionality a String may
provide, there's always going to be at least one user who wants /just one more
function/.



I am under the impression that while Unicode has enities that 
require more than 16 bits to represent it has been said that 
such 32 bit examples will be "vanishingly rare". Thus 16 bits is 
the normal case with occasional use of 32 bit entities.

Depends what you want to do. As a musician, I've often wanted to use the musical
characters U+1D100 to U+1D1DD. As a mathematician, I similarly would want to use
the mathematical letters U+1D400 to U+1D7FF. Mystical types would probably like
to use the tetragrams between U+1D306 and U+1D356. So you see, the characters
beyond U+FFFF are not /all/ strange alphabets we've never heard of, and I
certainly wouldn't call the desire to go beyond U+FFFF "vanishingly rare".



Optimizing the most frequent case suggests using an internal 
representation of wchar[] for the 16 bit entities

Makes sense

and another 
sparse wchar array for the cases in which a wchar is too small. 

I don't understand that. UTF-16 is better, from the point of view of most common
case and memory usage.


The querry-length function is obviously constant time in all 
cases. However so is the substring operation - thanks to 
copy-on-write.

It's not /that/ hard to count characters in UTF-8 and UTF-16. In UTF-8, you only
have to ignore code units between 0x80 and 0xBF, and in UTF-16 you only have to
ignore code units between 0xDC00 and 0xDFFF. Count all the rest and you've got
the number of characters.

Nice thoughts though. Keep them coming.

Jill

Jul 28 2004

J C Calvarese <jcc7 cox.net> writes:

In article <ce7o9r$2uo$1 digitaldaemon.com>, Arcane Jill says...
In article <ce6pke$2o2m$1 digitaldaemon.com>, parabolis says...
Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).

You'll have to ask Walter that one. (I mean, you'll have to wait and see if
Walter answers this question). I suspect not, because I'm only providing a
library, and it's written in D. The DMD compiler is written in C, and so can't
call D libraries, and therefore won't be able to take advantage of any D library
I provide. Adding Unicode support to /the compiler/ would also bloat the
compiler somewhat. But that's just a guess. As I said, only Walter can answer
this one definitively.

Unless I don't understand the question (which is always a strong possibility),
DMD already supports non-ASCII names for identifiers:

"Identifiers start with a letter, _, or unicode alpha, and are followed by any
number of letters, _, digits, or universal alphas. Universal alphas are as
defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" 

http://www.digitalmars.com/d/lex.html

I've tested it before and it worked for me.

jcc7

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce8e8g$al2$1 digitaldaemon.com>, J C Calvarese says...

Unless I don't understand the question (which is always a strong possibility),
DMD already supports non-ASCII names for identifiers:

"Identifiers start with a letter, _, or unicode alpha, and are followed by any
number of letters, _, digits, or universal alphas. Universal alphas are as
defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" 

So it does. Cool.

I looked at that document (ISO/IEC 9899:1999(E) Appendix D). It describes a
fixed list of identifier characters, which will never change with time (as
opposed to up-to-date Unicode, which contains an ever-growing list, growing with
each new version of Unicode). Anyway, I'm impressed. This is brilliant.

Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

J C Calvarese wrote:

 In article <ce7o9r$2uo$1 digitaldaemon.com>, Arcane Jill says...
 
In article <ce6pke$2o2m$1 digitaldaemon.com>, parabolis says...

Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).

You'll have to ask Walter that one. (I mean, you'll have to wait and see if
Walter answers this question). I suspect not, because I'm only providing a
library, and it's written in D. The DMD compiler is written in C, and so can't
call D libraries, and therefore won't be able to take advantage of any D library
I provide. Adding Unicode support to /the compiler/ would also bloat the
compiler somewhat. But that's just a guess. As I said, only Walter can answer
this one definitively.

 
 
 Unless I don't understand the question (which is always a strong possibility),
 DMD already supports non-ASCII names for identifiers:
 
 "Identifiers start with a letter, _, or unicode alpha, and are followed by any
 number of letters, _, digits, or universal alphas. Universal alphas are as
 defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" 
 
 http://www.digitalmars.com/d/lex.html
 
 I've tested it before and it worked for me.
 
 jcc7


Wow I am impressed. That was really forward thinking.

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 
 You'll have to ask Walter that one. (I mean, you'll have to wait and see if

....
 analysis, and syntax analysis needs to know all the reserved words. But again,
 I'm just guessing. Only Walter can be definitive.

I will probabably wait so see if he responds to me. As I have 
said before I imagine that Unicode acceptance will suggest these 
issues become solvable for compiler writers and so languages in 
the near future will be developed with these aspects.

A char for 
example is an 8-bit code unit that may in special cases 
represent a Character. Of course the type name 'char' was
strongly suggested for C-compatibility so the misnomer was not 
wanton.

 
 
 I think it was also chosen for ASCII compatibility. It makes sense for
 Westerners. "hello world\n" has got twelve characters in it, as well as twelve
 code units. See - D is trying to educate people /gently/. If it had started out
 with the following as basic types:
 
 *    codeunit    // UTF-8 code unit
 *    wcodeunit   // UTF-16 code unit
 *    dcodeunit   // UTF-32 code unit
 *    char        // 32-bit wide character (same as dcodeunit)
 
 then everything would have worked, but people who used mostly ASCII would
likely
 go: Eh? And ASCII strings would be four times as long.
 

Actually facet of UTF is exactly why I want to see proper use 
used for things. People who generally use ASCII can expect char 
to represent a character and People who generally use subset of 
16 bit Unicode values can expect a wchar to represent a 
Character. Combine that with the fact char seems to be short for 
Character and it is obvious people will make the wrong over 
generalization that a char or wchar actually represent Characters.

The result is very subtle bugs when assuming char[].length or 
wchar[].length counts the /Characters/ in an array or that 
char[i..k] slices the /Character/ in an array.

 
 
However conceptually confusing a String of Characters with a 
Unicode String (of code units) led to what I consider a fairly 
glaring omission in even the most basic or unfinished library:

The most basic of String operations length-querry and substring 
are not supported. It is clearly possible to count the code 
units with _char[].length and it is possible to slice the code 
units with char[i..j]. But no predefined operations actually 
indicate how many Characters a char[] (or wchar[]) actually 
contains. To put it another way:

    char[].length != <String>.length
       char[i..k] != <String>.substring(i,j)

 
 
 That's only partially true. As noted above, it's the /names/ for things that
are
 wrong, not that things are absent. If you pretend that "dchar" is the character
 type, rather than "char", then you /do/ get the behavior you desire. You
/could/
 simply pretend that char and wchar don't exist, if you really wanted.

But things are absent. D does not currently have the facility to 
do to two most fundamental String operations. I really doubt 
that if it werent for an oversight

 
       char[].sort  (just amusing to consider)

 
 
 This one will actually work. Lexicographical UTF-8 order is the same as
 lexicographical Unicode order.

I agree with you but you missed the multiple code units case 
where sort has the nice property that it /destroys/ a valid 
encoding:
================================================================
     char threeInOrderCharacters[] = [
         0xE6,0x97,0xA5,    // u+65E5
         0xE6,0x9C,0xAC,    // u+672C
         0xE8,0xAA,0x9E,    // u+8A9E
     ];

     void main(char[][] argv) {
         uint max = threeInOrderCharacters.length;
         threeInOrderCharacters.sort;
         for( uint i = 0; i < max; i++ ) {
         printf( "%2X ", threeInOrderCharacters[i] );
     	}
     }
================================================================
     Output:
     97 9C 9E A5 AA AC E6 E6 E8
================================================================

 
I may seem like I am being overly pedantic but I came to D 
without any knowledge of Unicode.

 
 
 Most people do, and you're not being overly pedantic.

lol I know I just dont want them to hate me.

 
It took me days to finally 
figure out that when anything D related says 'string' it 
actually means something different from the intuitive notion of 
a string, the formal notion of a string (assuming an alphabet 
must consist of Characters) and Unicode's technical definition.

 
 
 Mebe, but it's no different from a string in any other computer language. In

C uses "string" the same way. If it did not then all the ctype.h 
functions would have to take pointers to char arrays to be able 
to answer the questions that they answer. The char type is so 
named because it was expected that a Character would be 
represented (wholly) by a char. So string.h was built assuming 
strlen gives the number of Characters.

C++ char[] and wchar[] /arrays/ should not be confused with a 
String. See STL which defines String to work in a manner 
consistent with my Characters of Strings notion.

Java obviously also uses "string" the same way... see 
java.lang.String.

I /suspect/ that Objective C and ECMA-262 also define Strings in 
a similar manner.

 /no/ language of which I am aware is a string an array of Unicode characters.
In
 C and C++ on Windows, for example, a char is eight bits wide, and so
/obviously/
 can't store all Unicode characters. In fact, it's very hard for C source code
to
 know the encoding of a C string, and everything will work fine only if
 everything sticks to the system default. This makes internationalization much
 harder.

Perhaps I am being overly pedantic again but consider u+0000 and 
u+0001. I believe calling the following Characters is acceptable:

     typedef bit tinyCharacter;
     bit[n] t_string = new bit[n]

Here I have an array which is also a String since there is 
/always/ a 1:1 correspondence between elements and Characters.

Let me guess... You want Strings that support a larger subset of 
Unicode Characters?

Well fortunately C originally supported the Unicode range from 
u+0000 to u+007F with arrays of Characters.

Arrays of Java's char type do not make a string. Likewise arrays 
of char or wchar in C++ do not make strings. Fortunately there 
are String classes to support the more trying requirement of 
supporting just the Unicode range from u+0000 to u+FFFF.

 
 
I would strongly suggest adding a String class to phobos which 
implements a String of Characters and reserve the term string to 
that class alone. Hence if a String class is written then 
Object.toString() should return a String reference.

 
 
 We D users can write a Unicode aware String class (and I believe Hauke is doing
 that); we can publish it; we can even /suggest/ that it be moved into Phobos.
 But Walter is the only one who can approve/disapprove/implement that
suggestion.
 Phobos is Walter's baby. Deimos is one place where we can put things in the
 meantime, but the tight integration that you suggest can only happen if
 everything is in the same place.

I am happy he controls entries as I am sure phobos' quality will 
be much improved as a result.

 
 But I'm tempted to ask why? I mean, what's wrong with a char[] (UTF-8
sequence)?

I believe I explain why farther below in my post. If that did 
not answer the question you are asking then please help me 
understand better what you want to know.

 
Of course a String class is not the only valid solution and I do 
not have enough experience with D or Unicode to suggest that it 
would be the best. I certainly would not suggest that it should 
be done because that is how it was done in Java...

 
 
 True. And Java made the mistake of declaring a String class /final/. I found
 that damned annoying, as I couldn't extend it. If I wanted additional
 functionality not provided by Java String, I would have had to have written a
 brand new class from scratch, and even then it wouldn't have cast. I seriously
 hope D doesn't make /that/ mistake. However much functionality a String may
 provide, there's always going to be at least one user who wants /just one more
 function/.
 
 
 
 
I am under the impression that while Unicode has enities that 
require more than 16 bits to represent it has been said that 
such 32 bit examples will be "vanishingly rare". Thus 16 bits is 
the normal case with occasional use of 32 bit entities.

 
 
 Depends what you want to do. As a musician, I've often wanted to use the
musical
 characters U+1D100 to U+1D1DD. As a mathematician, I similarly would want to
use
 the mathematical letters U+1D400 to U+1D7FF. Mystical types would probably like
 to use the tetragrams between U+1D306 and U+1D356. So you see, the characters
 beyond U+FFFF are not /all/ strange alphabets we've never heard of, and I
 certainly wouldn't call the desire to go beyond U+FFFF "vanishingly rare".
 

Actually the "vanishingly rare" from the Unicode documents meant 
the frequency with which they will be extedning beyond 32 bits. 
I wish I had a link so I could find it again...

However I do still doubt the likelihood of ever seeing full 
sentence which contains exclusively (or perhaps even mostly) 
entities above u+FFFF. Perhaps a transcription in Linear B.

and another 
sparse wchar array for the cases in which a wchar is too small. 

 
 
 I don't understand that. UTF-16 is better, from the point of view of most
common
 case and memory usage.

I apologize I should have made this much more clear:
================================================================
     class String {
         private wchar[] loBits;
         private SparseArray hiBits;
         // implementation here
     }
================================================================
For every Character in the String there is an entry for that 
Character in lowBits. So for length() returning loBits.length 
will accurately indicate the number of Characters in the calling 
String object.

For any Unicode with a value from u+0000 to u+FFFF that value is 
stored in loBits and hiBits remains unchanged. For values 
greater than u+FFFF the lowest 16 bits are stored in lowBits and 
the upper 16 bits are stored in hiBits.

Memory usage will be almost exactly the same as encoding with 
UTF-16. (Identical in big O terms)

 
The querry-length function is obviously constant time in all 
cases. However so is the substring operation - thanks to 
copy-on-write.

 
 
 It's not /that/ hard to count characters in UTF-8 and UTF-16. In UTF-8, you
only

It is not an issue of difficultly but rather efficiency. A 
String class can perform length and substring in constant time 
wheras parsing UTF16 will always require a loop.
of it being not possible in constant time:

So to recap in big O terms:
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce8hhq$c9o$1 digitaldaemon.com>, parabolis says...

The result is very subtle bugs when assuming char[].length or 
wchar[].length counts the /Characters/ in an array or that 
char[i..k] slices the /Character/ in an array.

I'm not disagreeing with you, but see my separate post on graphemes and glyphs
and things. There are distinctions in Unicode which never existed in ASCII, so
people are not used to them. In ASCII, every character was either a control or a
grapheme. This correspondence no longer holds in Unicode, so even basing your
strings on characters is not always the desirable thing to do.

What, for example, is (cast(dchar[]) "caf�")[3..4] ?
or...                 (cast(dchar[]) "caf�").length ?

The answer depends on how your text editor composed the "�" when you wrote the
source code. To paraphrase you, the result is very subtle bugs when assuming
dchar[].length counts the /graphemes/ in an array or that dchar[i..k] slices the
/graphemes/ in an array. Of course "char" doesn't suggest "grapheme" in the same
way that it suggests "character" - but in reality, most people don't know the
difference (because there pretty much is no difference in ASCII).

So - like I say - I'm not disagreeing with you. But I don't see where you're
going with this. I see the flaws in current support, and I think "We can fix
that". Hence the planned future functionality. You see the same flaws, but you
seem instead to be saying "ditch the char". But you know that's not going to
happen. Have I misunderstood you?


       char[].sort  (just amusing to consider)

 
 This one will actually work. Lexicographical UTF-8 order is the same as
 lexicographical Unicode order.

I agree with you but you missed the multiple code units case 
where sort has the nice property that it /destroys/ a valid 
encoding:

Yeah, my bad. I read that as char[][].sort. You're right that char[].sort will
break the (conceptual) char[] invariant. You'll get a UTF conversion exception
later on.

I see what you're saying, but I'm sure that a string class will exist in the
future. That it doesn't exist yet, to me, makes it just something to look
forward to, not the end of the world.



See STL which defines String to work in a manner 
consistent with my Characters of Strings notion.

Now that's cheating. std::string (being a typedef for std::basic_string<char>)
has the same concept of character as (char *). It's dependent on the source code
encoding.


Java obviously also uses "string" the same way... see 
java.lang.String.

I just looked at it. Seems to be based on 16-bit wide Java chars to me. That
smells of UTF-16, hence /not/ the 1-1 correspondence you suggest.



Perhaps I am being overly pedantic again but consider u+0000 and 
u+0001. I believe calling the following Characters is acceptable:

     typedef bit tinyCharacter;

Errm. Sort of. Really the only definition of "character" that makes sense is
that a character is a member of some character set, so if you first defined a
character set with two characters in it, then you could indeed encode such
characters with one bit. But you can't just go around picking arbitrary subsets
of existing character sets and representing them in fewer than the required
number of bits.



Let me guess... You want Strings that support a larger subset of 
Unicode Characters?

Either we're talking Unicode or we're not. There are Unicode strings; there are
Latin-1 strings; there are ASCII strings. I don't get the question.



Well fortunately C originally supported the Unicode range from 
u+0000 to u+007F with arrays of Characters.

If we're going to be /really/ pedantic here, it did not. It supported ASCII. The
fact that there is a 1-1 correspondence between the codepoints of ASCII and the
codepoints U+0000 to U_007F of Unicode was a design feature of Unicode, not a
design feature of C.

But really, you know - who cares? I mean, I see no point in this little tangent.
I think we've drifted into the utterly trivial here, and I'm keen to move out of
it. 


However I do still doubt the likelihood of ever seeing full 
sentence which contains exclusively (or perhaps even mostly) 
entities above u+FFFF.

Depends what language you speak.

Jul 28 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Arcane Jill wrote:

 In article <ce8hhq$c9o$1 digitaldaemon.com>, parabolis says...
 
 
The result is very subtle bugs when assuming char[].length or 
wchar[].length counts the /Characters/ in an array or that 
char[i..k] slices the /Character/ in an array.

 
 
 I'm not disagreeing with you, but see my separate post on graphemes and glyphs
 and things. There are distinctions in Unicode which never existed in ASCII, so
 people are not used to them. In ASCII, every character was either a control or
a
 grapheme. This correspondence no longer holds in Unicode, so even basing your
 strings on characters is not always the desirable thing to do.
 
 What, for example, is (cast(dchar[]) "caf�")[3..4] ?
 or...                 (cast(dchar[]) "caf�").length ?
 
 The answer depends on how your text editor composed the "�" when you wrote the
 source code. To paraphrase you, the result is very subtle bugs when assuming
 dchar[].length counts the /graphemes/ in an array or that dchar[i..k] slices
the
 /graphemes/ in an array. Of course "char" doesn't suggest "grapheme" in the
same
 way that it suggests "character" - but in reality, most people don't know the
 difference (because there pretty much is no difference in ASCII).
 
 So - like I say - I'm not disagreeing with you. But I don't see where you're
 going with this. I see the flaws in current support, and I think "We can fix
 that". Hence the planned future functionality. You see the same flaws, but you
 seem instead to be saying "ditch the char". But you know that's not going to
 happen. Have I misunderstood you?
 

Perhaps I am missing something, but the general idea that I am used to
operating with is a standard internal to the language.  I.e. all strings
are encoded UTF-32BE, but the IO should be able to translate the native
string to whatever format is necessary/available.  So the file (from
your editor) might write UTF-8, but using an encoding scheme with your
IO stream would be able to convert it to UTF-32BE--which would be native
for the language.

I am only using it as an example.  I do the same thing with things
bigger than strings myself.  For example, I have a model that becomes
the basis for decoupling the translation side and the usage side.  It
works very well.

As long as the library was consistent with its standard, wouldn't that
work well for D?

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

I am leaving everything you said there...


 
 Yeah, my bad. I read that as char[][].sort. You're right that char[].sort will
 break the (conceptual) char[] invariant. You'll get a UTF conversion exception
 later on.

Actually on the topic of UTF conversion exception... There 
really is no such thing according to the standard.

Personally I also prefer failing fast but I figured I would 
point out that it is non-standard behavoiur.

 
 I see what you're saying, but I'm sure that a string class will exist in the
 future. That it doesn't exist yet, to me, makes it just something to look
 forward to, not the end of the world.
 

Back to the comment that started this discussion:
================================================
In my opinion D is off to a really bad start with Unicode.
================================================

And the reason for the comment:
================================================
I have only seen the phobos.std.string and the D docs which 
mistankenly say UTF implements Strings. I was not previously 
privy to the D's Unicode plans. I saw what appeared to be a 
significant ambiguity between thd doc's use of String and 
Unicode string and that suggested a bad start.
================================================

I am much less pessimistic now that I know D will support 
inuitive Strings (and indeed a plethora at that - Character, 
Grapheme and Glyph Strings).

I will be quite blown away if D actually manages to pull this 
off without requiring any knowledge of Unicode except where 
Unicode specific features are required.

 
 But really, you know - who cares? I mean, I see no point in this little
tangent.
 I think we've drifted into the utterly trivial here, and I'm keen to move out
of
 it. 

I think if it has any relevence it will come up in the newly 
started thread and is probably best dealt with there.


However I do still doubt the likelihood of ever seeing full 
sentence which contains exclusively (or perhaps even mostly) 
entities above u+FFFF.

 
 
 Depends what language you speak.

I would be suprised to find that the UC has defined characters 
in that range that are used in a living languguage. I was kind 
of hoping you might have an example.

Jul 28 2004

"Carlos Santander B." <carlos8294 msn.com> writes:

"parabolis" <parabolis softhome.net> escribi� en el mensaje
news:ce8hhq$c9o$1 digitaldaemon.com
|
| I agree with you but you missed the multiple code units case
| where sort has the nice property that it /destroys/ a valid
| encoding:
| ================================================================
|      char threeInOrderCharacters[] = [
|          0xE6,0x97,0xA5,    // u+65E5
|          0xE6,0x9C,0xAC,    // u+672C
|          0xE8,0xAA,0x9E,    // u+8A9E
|      ];
|
|      void main(char[][] argv) {
|          uint max = threeInOrderCharacters.length;
|          threeInOrderCharacters.sort;
|          for( uint i = 0; i < max; i++ ) {
|          printf( "%2X ", threeInOrderCharacters[i] );
|      }
|      }
| ================================================================
|      Output:
|      97 9C 9E A5 AA AC E6 E6 E8
| ================================================================
|

Do what Jill said: use dchar.

/////////////////////////////
import std.utf;

char threeInOrderCharacters[] = [
    0xE6,0x97,0xA5,    // u+65E5
    0xE6,0x9C,0xAC,    // u+672C
    0xE8,0xAA,0x9E,    // u+8A9E
];

void main(char[][] argv) {
    dchar [] tIOC = toUTF32(threeInOrderCharacters);
    //uint max = threeInOrderCharacters.length;
    //threeInOrderCharacters.sort;
    tIOC.sort;
    char [] tIOC2 = toUTF8(tIOC);
    uint max = tIOC2.length;
    for( uint i = 0; i < max; i++ ) {
        //printf( "%2X ", threeInOrderCharacters[i] );
        printf( "%2X ", tIOC2[i] );
    }
}

/////////////////////////////

================================================================
     Output:
     E6 97 A5 E6 9C AC E8 AA 9E
================================================================

What you expected, right?
And, btw, using wchar gives the same result (and, of course, replacing toUTF32
by toUTF16).

-----------------------
Carlos Santander Bernal

Jul 28 2004

Ben Hinkle <bhinkle4 juno.com> writes:

 The D docs (in Arrays.Special Array Types.Strings) say this:
 ================================
 Dynamic arrays in D suggest the obvious solution - a string is
 just a dynamic array of characters. String literals become just
 an easy way to write character arrays.
 ================================
 
 I agree that a string is a sequence of characters. However D's
 conception of string seems to be a Unicode string which is most
 decidedly NOT a sequence of characters. Unicode defines a
 Character in a sensible fashion:
 
 ================================
 (from http://www.unicode.org/glossary/)
 
 Character. (1) The smallest component of written language that
 has semantic value; ...

The section of the D doc that you quote is followed by an example and then
"char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format.
dchar[] strings are in UTF-32 format."
Would it help to move those sentances to right after the one you quote
instead of putting it after the example? That way users will see that UTF-8
and realize how Walter is using the words "character" and the type "char". 

Or maybe change the first sentance to "Dynamic arrays in D suggest the
obvious solution - a string is just a dynamic array of characters in UTF-8,
UTF-16 or UTF-32 format." Nipping in the bud any questions about what is
meant by the word "character".

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

Or maybe change the first sentance to "Dynamic arrays in D suggest the
obvious solution - a string is just a dynamic array of characters in UTF-8,
UTF-16 or UTF-32 format." Nipping in the bud any questions about what is
meant by the word "character".

That works for me.

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Ben Hinkle wrote:

The D docs (in Arrays.Special Array Types.Strings) say this:
================================
Dynamic arrays in D suggest the obvious solution - a string is
just a dynamic array of characters. String literals become just
an easy way to write character arrays.
================================


...
 The section of the D doc that you quote is followed by an example and then
 "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format.
 dchar[] strings are in UTF-32 format."
 Would it help to move those sentances to right after the one you quote
 instead of putting it after the example? That way users will see that UTF-8
 and realize how Walter is using the words "character" and the type "char". 

No I do not believe that would help. I think that would simply 
attempt to evade the issue that D has no types corresponding to 
Characters (and thus in fact effectively has no String support) 
whatsoever because the docs also clearly state D *wants* to 
provide string support.

 Or maybe change the first sentance to "Dynamic arrays in D suggest the
 obvious solution - a string is just a dynamic array of characters in UTF-8,
 UTF-16 or UTF-32 format." Nipping in the bud any questions about what is
 meant by the word "character".

But that is not true. A string is a sequence of Characters and 
so it is not at all an obvious solutoin to implement strings 
using arrays of encoded data in which a Character will be 
anywhere from 1-4 characters and must be parsed to obtain 
Character data accoring to the appropriate UTF standard.

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce8i48$civ$1 digitaldaemon.com>, parabolis says...

No I do not believe that would help. I think that would simply 
attempt to evade the issue that D has no types corresponding to 
Characters

except of course dchar

(and thus in fact effectively has no String support)
whatsoever

except of course dchar[]


Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Berin Loritsch wrote:
 
 Another place to look, if you want to see how they are planning on 
 improving things is here:
 
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
 http://java.sun.com/j2se/1.5.0/docs/api/

Sadly the one thing I really want them to implement will never happen - 
unsigned primitive types.

Jul 26 2004

Berin Loritsch <bloritsch d-haven.org> writes:

parabolis wrote:
 Berin Loritsch wrote:
 
 Another place to look, if you want to see how they are planning on 
 improving things is here:

 http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
 http://java.sun.com/j2se/1.5.0/docs/api/

 
 
 Sadly the one thing I really want them to implement will never happen - 
 unsigned primitive types.

I think the main reason for that is the focus of Java.  Java is designed
for applications, while C/C++/D is designed to include systems
development as well.  I have not come accross many instances where an
unsigned primitive would be useful in the application space.  And those
few times where it does make a difference the signed primitives can be
used because there are no comparisons to be done.  In short, for the
things Java is good at, I haven't run into the need for unsigned
primitives myself.

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Berin Loritsch wrote:

 parabolis wrote:
 
 Sadly the one thing I really want them to implement will never happen 
 - unsigned primitive types.

 
 
 I think the main reason for that is the focus of Java.  Java is designed
 for applications, while C/C++/D is designed to include systems
 development as well.  I have not come accross many instances where an
 unsigned primitive would be useful in the application space.  And those
 few times where it does make a difference the signed primitives can be
 used because there are no comparisons to be done.  In short, for the
 things Java is good at, I haven't run into the need for unsigned
 primitives myself.

================================================================
I thought I remembered reading that Java was originally deisgned
for appliance microprocessors, but I could be wrong.

As for the unsigned primitive... Consider java.lang.String's:

     copyValueOf(char[] data, int offset, int count)

These type of functions are everywhere in the library code. Does
a negative offset or count ever make sense? Almost never... So
the first few lines of the code check to make sure the values
are in fact unsigned.

The same thing is true with reading and writing to arrays and
other sequential data structures in general. Every read/write
is checked to make sure it is actually unsigned.

The reason it bothers me is that I almost never write any code
using signed primitives in any other language. Being forced
to declare function parameters as signed and then check that
the values are not signed is a doubly whammy...

You probably wouldnt think about it but does it really make sense
to use a signed value in most of the for loops you write? It may
seem like an odd question but I tend to use unsigned by default
and signed when I must. So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.

Jul 27 2004

Berin Loritsch <bloritsch d-haven.org> writes:

parabolis wrote:

 Berin Loritsch wrote:
 
 parabolis wrote:

 Sadly the one thing I really want them to implement will never happen 
 - unsigned primitive types.



 I think the main reason for that is the focus of Java.  Java is designed
 for applications, while C/C++/D is designed to include systems
 development as well.  I have not come accross many instances where an
 unsigned primitive would be useful in the application space.  And those
 few times where it does make a difference the signed primitives can be
 used because there are no comparisons to be done.  In short, for the
 things Java is good at, I haven't run into the need for unsigned
 primitives myself.

 
 ================================================================
 I thought I remembered reading that Java was originally deisgned
 for appliance microprocessors, but I could be wrong.

Ok, you are going pre sun involvement here...

 
 As for the unsigned primitive... Consider java.lang.String's:
 

<snip/>

Let me just say that it doesn't have a serious impact on day to day
programming activities--even if it is not "ideologically pure".  Most
values used in day to day development fall well within the range of
the signed positive value range.  Most folks don't even worry about
whether it would be more efficient to use a byte or an int.  We just
use ints because the performance gains of using the smaller primitive
is no where near the gains of improving the algorithm.

But that's just my experience (public projects I have worked on include
Apache Avalon, Apache Cocoon, Apache JMeter, Apache Axis, and the
D-Haven projects.  I know this is a D forum, but I am including these
to add weight to the argument that signed vs. unsigned arguments really
don't impact most average programs all that much.

Does it affect some people?  sure.  But the most common solution is
either to ignore the sign or jump up to the next larger data size.
It's no biggy.

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Berin Loritsch wrote:
 parabolis wrote:
 
 I thought I remembered reading that Java was originally deisgned
 for appliance microprocessors, but I could be wrong.

 
 
 Ok, you are going pre sun involvement here...

No what I remember reading was that Sun wanted to dabble in
'smart' appliances... But this is a vague impression I have
from an article I read 5+ years ago...
 
 As for the unsigned primitive... Consider java.lang.String's:

 
 <snip/>
 
 Let me just say that it doesn't have a serious impact on day to day
 programming activities--even if it is not "ideologically pure".  Most
 values used in day to day development fall well within the range of
 the signed positive value range.  Most folks don't even worry about
 whether it would be more efficient to use a byte or an int.  We just
 use ints because the performance gains of using the smaller primitive
 is no where near the gains of improving the algorithm.

(Not that it matters but I believe using a 32 bit condition 
variable on a 32 bit machine is actually faster than a type with 
fewer bits...)

 But that's just my experience (public projects I have worked on include
 Apache Avalon, Apache Cocoon, Apache JMeter, Apache Axis, and the
 D-Haven projects.  I know this is a D forum, but I am including these
 to add weight to the argument that signed vs. unsigned arguments really
 don't impact most average programs all that much.

I apologize if I seemed to be arguing that unsigned is 
inherently better. I was just trying to make the point
that not only do I have to avoid using my default in
Java, but I also have to guard against conditions that
are a direct result of my not getting to use unsigned.

Perhaps a better way to make the point is imagine a
language which does not allow the use of integer
types. So now fictional.lang.String has the function:

   copyValueOf(char[] data, float offset, float count)

And you have to write methods similar methods yourself
and check to make sure the number is both integral and
positive... This is an overstatement of my frustrations
but I think it does illustrate what I mean.

Jul 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce6cl2$2j39$1 digitaldaemon.com>, parabolis says...

So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.

Well, here's one possible reason:



is likely to be a few cycles faster than



(depending on how good the compiler is at optimizing - a black art about which I
know nothing)

Jill

Jul 27 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Arcane Jill wrote:

 In article <ce6cl2$2j39$1 digitaldaemon.com>, parabolis says...
 
 
So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.

 
 
 Well, here's one possible reason:
 

 
 is likely to be a few cycles faster than
 

 
 (depending on how good the compiler is at optimizing - a black art about which
I
 know nothing)

This will not hold true for all processor types, so it is generally
better to code normally and trust the compiler to do the right
optimization (if any).

But that is a whole other topic (premature optimizations, etc.)

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 In article <ce6cl2$2j39$1 digitaldaemon.com>, parabolis says...
 
 
So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.

 
 
 Well, here's one possible reason:
 

 
 is likely to be a few cycles faster than
 

 
 (depending on how good the compiler is at optimizing - a black art about which
I
 know nothing)
 
 Jill
 
 

Yes I would wonder why you wrote that and probably assume to 
save a few cycles... However more often I tend to see:

     for (int i=array.length-1; i>=0; --i) { /* blah */ }

is not likely to be a few cycles faster than

     for (uint i=array.length; i<10; ++i) { /* blah */ }

But the general case is this horrible version:


   (call String.length(); every iteration)

Also you should consider that. I am assuming you are pointing 
out that using 0 as a sentinel is faster than another number.
Also dont forget that any speed benifit from using 0 as a 
sentinel is completely negated for a processor which does not 
implement unsigned addition identically to signed addition.


And finally do not dismiss the unsigned alternate to your 
original suggestion:

     for( uint i = 0xFFFFFFFA; i != 0; i++ )

Jul 27 2004

Sean Kelly <sean f4.ca> writes:

parabolis wrote:
 But the general case is this horrible version:
 

   (call String.length(); every iteration)

Though I'm generally too prone to premature optimization to do this, I 
think the above code has the potential to be just as fast as the 
unsigned version.  String likely contains a size_t variable to represent 
string length and it would be trivial for a compiler to inline calls to 
the String.length() function.  Unless you want to compare results on a 
per instruction basis, I would likely not be too concerned with 
performance differences between the calls.


Sean

Jul 27 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 parabolis wrote:
 
 But the general case is this horrible version:


   (call String.length(); every iteration)

 
 
 Though I'm generally too prone to premature optimization to do this, I 
 think the above code has the potential to be just as fast as the 

Emphasis on the potential. That means that you must first be 
able to guarantee that any compiler that gets your code will 
optimize it before you should feel safe writing it.

Jul 28 2004

Sean Kelly <sean f4.ca> writes:

In article <ce8q1l$g6r$1 digitaldaemon.com>, parabolis says...
Sean Kelly wrote:

 parabolis wrote:
 
 But the general case is this horrible version:


   (call String.length(); every iteration)

 
 
 Though I'm generally too prone to premature optimization to do this, I 
 think the above code has the potential to be just as fast as the 

Emphasis on the potential. That means that you must first be 
able to guarantee that any compiler that gets your code will 
optimize it before you should feel safe writing it.

Perhaps I've been spoiled by the latest set of C++ compilers.  The tests I've
seen with them tend to suggest that premature optimization often actually slows
the resulting code down compared to what the optimizer can generate.

But the above example was pretty straightfoward.  I would be surprised if any
production level D compiler didn't inline such calls.


Sean

Jul 28 2004

Sha Chancellor <schancel pacific.net> writes:

In article <ce3k0a$1co1$1 digitaldaemon.com>,
 parabolis <parabolis softhome.net> wrote:

 DataInputStream input =
    new DataInputStream(
      new CheckedInputStream(
        new DeflaterInputStream(
          new BufferedInputStream(
            new FileInputStream("filename.ext")
          )
        )
      )
    );

Do they have a DecaffinateInputStream and a DefatInputStream class also 
by any chance?  I heard they do.

Jul 26 2004

parabolis <parabolis softhome.net> writes:

Sha Chancellor wrote:

 In article <ce3k0a$1co1$1 digitaldaemon.com>,
  parabolis <parabolis softhome.net> wrote:
 
 
DataInputStream input =
   new DataInputStream(
     new CheckedInputStream(
       new DeflaterInputStream(
         new BufferedInputStream(
           new FileInputStream("filename.ext")
         )
       )
     )
   );

 
 
 Do they have a DecaffinateInputStream and a DefatInputStream class also 
 by any chance?  I heard they do.

lol
I have surely never used them myself. Maybe java.beans somewhere... :P

Jul 26 2004

D Programming

C/C++ Programming

Other

digitalmars.D - OT - scanf in Java