www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - OT - scanf in Java

reply Arcane Jill <Arcane_member pathlink.com> writes:
I realize that this is not a Java forum, but I'm trying to get a feel for how D
compares to other things. I want to know, how does one get a line of input from
the console in Java? I've written some insignificant amount of code in Java in
the past, but none of it ever needed to get a line of input from the console.

Here's the reference program - written in C++. Just as an exercise, I'm
comparing this with other languages. (I know it's not really a fair test, but
what the hell?)











I can translate that into most other languages easily - but for Java, I'm stuck.
How would you do this? Especially, how would you do this without using any
deprecated functions?

(D doesn't do very well at this one, incidently, but that's just a temporary
phase. Things will obviously get better when stream support improves and we get
a native-D scanf replacement, both of which, I gather, are underway).

Arcane Jill
Jul 26 2004
next sibling parent Andy Friesen <andy ikagames.com> writes:
Arcane Jill wrote:

 I realize that this is not a Java forum, but I'm trying to get a feel for how D
 compares to other things. I want to know, how does one get a line of input from
 the console in Java? I've written some insignificant amount of code in Java in
 the past, but none of it ever needed to get a line of input from the console.
 
 Here's the reference program - written in C++. Just as an exercise, I'm
 comparing this with other languages. (I know it's not really a fair test, but
 what the hell?)
 









 
 I can translate that into most other languages easily - but for Java, I'm
stuck.
 How would you do this? Especially, how would you do this without using any
 deprecated functions?
 
 (D doesn't do very well at this one, incidently, but that's just a temporary
 phase. Things will obviously get better when stream support improves and we get
 a native-D scanf replacement, both of which, I gather, are underway).
It's been quite awhile, but I think it goes something like this: import java.io.*; public class TheMainClass { public static void main(String[] args) { InputStreamReader isr = new InputStreamReader(System.in); BufferedReader br = new BufferedReader(isr); String s = br.readLine(); System.out.writeLine(s); } } Beating java on this one isn't very hard. :) -- andy
Jul 26 2004
prev sibling next sibling parent Berin Loritsch <bloritsch d-haven.org> writes:
Arcane Jill wrote:

 I realize that this is not a Java forum, but I'm trying to get a feel for how D
 compares to other things. I want to know, how does one get a line of input from
 the console in Java? I've written some insignificant amount of code in Java in
 the past, but none of it ever needed to get a line of input from the console.
 
 Here's the reference program - written in C++. Just as an exercise, I'm
 comparing this with other languages. (I know it's not really a fair test, but
 what the hell?)
 









 
 I can translate that into most other languages easily - but for Java, I'm
stuck.
 How would you do this? Especially, how would you do this without using any
 deprecated functions?
 
 (D doesn't do very well at this one, incidently, but that's just a temporary
 phase. Things will obviously get better when stream support improves and we get
 a native-D scanf replacement, both of which, I gather, are underway).
You have some options. In Java 1.5, there is a new Scanner class--since I haven't played with it much, I will have to stick with the older methods. The System class holds a reference to stdin and stdout as System.in and System.out respectively. Using the System.in, you can wrap whatever input streams/readers you need to parse the input as expected.
Jul 26 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:
 I realize that this is not a Java forum, but I'm trying to get a feel for how D
 compares to other things. I want to know, how does one get a line of input from
 the console in Java? I've written some insignificant amount of code in Java in
 the past, but none of it ever needed to get a line of input from the console.
Andy Friesen's code solves the problem, but since you are comparing languages I would suggest you take a stroll through the java.io classes to get a feel for how the IO library works. It is really well done in my opinion and an shining example of what really well done OO code looks like. The Java API specs are available from http://java.sun.com/j2se/1.4.2/docs/api/ The abstract class InputStream defines a small number of fundamental operations that conceptually define a Stream. The _only_ abstract function is abstract int read(); This is the only function a subclass needs to define to have the whole InputStream repotoire available. Some InputStream subclasses provide data from sources like files, socket connections and in memory data structures: FileInputStream in java.io SocketInputStream in java.net StringInputStream in java.io The rest of the functions have meainingful default behavior. So reading bytes into an array in the general case is handled in InputStream by: int read(byte buf, int off, int len) (not abstract!) Interesting intermediate behavior is obtained by passing an InputStream subclass to other InputStreams. Interesting intermediate behavior includes: Buffering in java.io.Buffered___Stream En/De-cryption in javax.crypto.Cipher___Stream Compression in java.util.zip.Inflater___Stream Decompression in java.util.zip.Deflater___Stream Digesting (eg CRC32) in java.util.zip.Checked___Stream Interesting final behavior (ie the reason you opened the stream to begin with...) includes: Read\Write general Data in java.io.Data___Stream Read\Write Ojects in java.io.Object___Stream Read\Write zip file in java.util.zip.Zip___Stream The end result is mixing and matching Streams to suit your needs. Say you want to read something in. You only need to answer 3 questions: 1) From where? File, Socket, Data Structure, etc... 2) How? Buffered, Encrypted, Digested, etc... 3) What kind? Data, Object, etc... Say you want to read in 1) from a File 2) compressed, digested and buffed 3) Data That would be: DataInputStream input = new DataInputStream( new CheckedInputStream( new DeflaterInputStream( new BufferedInputStream( new FileInputStream("filename.ext") ) ) ) ); In this case input will buffer, then decompress, then digest anything you read from filename.ext. An item to pay particular attention to is the Object___Stream. When combined with Socket___Stream you can send Objects to (or read object from) a TCP connection. If you write a java.lang.Runnable object to an ObjectOutputStream which then sends it to a server the server can cast object read to Runnable and then start a new thread which calls the objects run() method. Thus it is possible to start a Server and leave it running then later write new code and send it (code the Server has never seen before).
Jul 26 2004
next sibling parent reply Berin Loritsch <bloritsch d-haven.org> writes:
parabolis wrote:

 Arcane Jill wrote:
 
 I realize that this is not a Java forum, but I'm trying to get a feel 
 for how D
 compares to other things. I want to know, how does one get a line of 
 input from
 the console in Java? I've written some insignificant amount of code in 
 Java in
 the past, but none of it ever needed to get a line of input from the 
 console.
Andy Friesen's code solves the problem, but since you are comparing languages I would suggest you take a stroll through the java.io classes to get a feel for how the IO library works. It is really well done in my opinion and an shining example of what really well done OO code looks like. The Java API specs are available from http://java.sun.com/j2se/1.4.2/docs/api/
Another place to look, if you want to see how they are planning on improving things is here: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html http://java.sun.com/j2se/1.5.0/docs/api/
Jul 26 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce3o9u$1enc$1 digitaldaemon.com>, Berin Loritsch says...

Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
Hey, cool. They can parse non-Latin digits. returns true So, Arab digits, Bengali digits, no problem. Not sure how they'd cope with Osmanya digits though - these have codepoints U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char. We'll have this in D eventually, but we won't stop at wchars. Arcane Jill
Jul 26 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <ce3pln$1fds$1 digitaldaemon.com>, Arcane Jill says...
In article <ce3o9u$1enc$1 digitaldaemon.com>, Berin Loritsch says...

Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
Hey, cool. They can parse non-Latin digits. returns true So, Arab digits, Bengali digits, no problem. Not sure how they'd cope with Osmanya digits though - these have codepoints U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char. We'll have this in D eventually, but we won't stop at wchars.
I've been wondering about this. readf (was scanf) still uses some lame shortcuts like "x - '0'" but that wouldn't be too terribly hard to fix. I don't suppose the unicode isdigit function currently supports these numbering schemes? Also, is it reasonable to assume that every numbering scheme is base 10? I'd certainly think so, but I suppose it's worth asking. Sean
Jul 26 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce3qsf$1fqt$1 digitaldaemon.com>, Sean Kelly says...

I've been wondering about this.  readf (was scanf) still uses some lame
shortcuts like "x - '0'" but that wouldn't be too terribly hard to fix.  I don't
suppose the unicode isdigit function currently supports these numbering schemes?
The function getDecimalDigit(dchar) in etc.unicode returns the numeric value in the range 0 to 9 of all Unicode decimal digits. It returns -1 for all non-digits. You can find source code for this function in Deimos on dsource. Temporarily, there is no prebuilt library, but the source code works fine. When I get back to writing code, my very next task will be to tidy up etc.unicode, release the codebuilder code, etc.. Right now I'm still taking a few weeks off coding because I'm still a bit blown away by my gran's death, so, for now, you'll just have to put up with me ranting on this forum without actually /doing/ anything - but I imagine I'll get back onto the task in hand in maybe a couple of weeks or so. There is also a similar function, getDigit(dchar), which is similarly defined, except that it also considers things like SUPERSCRIPT TWO and CIRCLED THREE to be "digits". I imagine, therefore, that for readf(), getDecimalDigit() would be more appropriate than getDigit().
Also, is it reasonable to assume that every numbering scheme is base 10?  I'd
certainly think so, but I suppose it's worth asking.
As far as Unicode is concerned, yes. As far as reality is concerned, no. In the Tamil script, for example, they use base twelve. Unicode simply cannot comprehend this, and (erroneously) declares Tamil digits 0 to 9 to be "decimal". However - for our purposes, /this doesn't matter/. Our job is to implement the Unicode standard, even if it's wrong. Fixing the Unicode code charts is a job for the Unicode Consortium, and that may happen in some future release. For now - as Walter said - we put metaphorical blinkers on and go with what the standard says. For hexadecimal, there's the function getHexValue(), which returns a value in the range 0 to 15 for hex digits, -1 otherwise. (It's possible I may not have implemented that yet, or that I implemented it inefficiently. When I get back to D-coding, I'll fix this). Jill
Jul 26 2004
parent Sean Kelly <sean f4.ca> writes:
Arcane Jill wrote:
 In article <ce3qsf$1fqt$1 digitaldaemon.com>, Sean Kelly says...
 
Also, is it reasonable to assume that every numbering scheme is base 10?  I'd
certainly think so, but I suppose it's worth asking.
As far as Unicode is concerned, yes. As far as reality is concerned, no. In the Tamil script, for example, they use base twelve. Unicode simply cannot comprehend this, and (erroneously) declares Tamil digits 0 to 9 to be "decimal". However - for our purposes, /this doesn't matter/. Our job is to implement the Unicode standard, even if it's wrong. Fixing the Unicode code charts is a job for the Unicode Consortium, and that may happen in some future release. For now - as Walter said - we put metaphorical blinkers on and go with what the standard says.
Makes sense. The scanf spec that I was working off of makes no concession for a base 12 numbering scheme anyway. And I hesitate to add it as it would confuse things.
 For hexadecimal, there's the function getHexValue(), which returns a value in
 the range 0 to 15 for hex digits, -1 otherwise. (It's possible I may not have
 implemented that yet, or that I implemented it inefficiently. When I get back
to
 D-coding, I'll fix this).
Perfect. I'll just use this function for everything. It will simplify the code a bit anyway. Sean
Jul 27 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 In article <ce3o9u$1enc$1 digitaldaemon.com>, Berin Loritsch says...
 
 
Another place to look, if you want to see how they are planning on 
improving things is here:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
Hey, cool. They can parse non-Latin digits. returns true So, Arab digits, Bengali digits, no problem.
I am not suprised. I excpect one of the areas of programming language development in the near future to be inclusion of non ASCII names for things like classes and variables etc. Just imagine trying to program with class names in Hiragana. I dont think support would be difficult. However I am still trying to fathom the depths of that Unicode beast.
 
 Not sure how they'd cope with Osmanya digits though - these have codepoints
 U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.
 
(been working on Unicode stuff so I know this...) I believe they would cope using an escape sequence (surrogate pairs). Which I suppose means the String.length() function lies sometimes. From java.nio.Charset ================================ The native coded character set of the Java programming language is that of the first seventeen planes of the Unicode version 3.0 character set; that is, it consists in the basic multilingual plane (BMP) of Unicode version 1 plus the next sixteen planes of Unicode version 3. This is because the language's internal representation of characters uses the UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode the other planes ================================
Jul 26 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce4ec5$1moq$2 digitaldaemon.com>, parabolis says...


 returns true
 
 Not sure how they'd cope with Osmanya digits though - these have codepoints
 U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.
 
(been working on Unicode stuff so I know this...) I believe they would cope using an escape sequence (surrogate pairs). Which I suppose means the String.length() function lies sometimes.
Yes, Java uses UTF-16 (which is what you meant by "escape sequence" or "surrogate pairs"). However, that doesn't change the definition above: "A non-ASCII character c for which Character.isDigit(c) returns true". The function Character.isDigit(c) takes a Java char as it's parameter, not a UTF-16 sequence. It doesn't matter for me, though, as I don't use Java, and I intend for D to do better. Jill
Jul 27 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 
 It doesn't matter for me, though, as I don't use Java, and I intend for D to do
 better.
 
In my opinion D is off to a really bad start with Unicode.
Jul 27 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce68g6$2h7r$1 digitaldaemon.com>, parabolis says...
Arcane Jill wrote:

 
 It doesn't matter for me, though, as I don't use Java, and I intend for D to do
 better.
 
In my opinion D is off to a really bad start with Unicode.
The "start" hasn't even happened yet. What we have now isn't anything like what we're /going/ to have. There are /loads/ of (other) things that D doesn't have yet (like decent streams support), but most of these things are *in progress*. I'd say you made your call too early. Look at it like this. D has only been around for three or four years, and it was basically a one-person project. We're not even at version 1.0 yet, so the best is most definitely yet to come. Now Walter planned for good Unicode support from the start, and, with that in mind, he laid down the foundations, for example by insisting that D strings be Unicode. Those foundations are now being built upon. For example, the library etc.unicode (temporarily on hold for a few weeks due to a family death) currently gives you access to (almost) every Unicode character property. C doesn't give you this. C++ doesn't give you this. Even Java only gives you this for codepoints up to U+FFFF. D covers the lot - and that's /right now/. What's more, this library is robot-built from the actual Unicode database files, and so can be rebuilt with every new version of Unicode as it comes out, /and/ can be rebuilt for old versions of Unicode should that need arise. We're way ahead of Java there, which leaves you stuck with whatever version happens to come with your JVM. And as for the future - well, for stage 2 we've got the normalization, canonical and compatibility equivalence stuff all planned, grapheme boundary detection, full localized casing ... which I think will take us way ahead of Java. And meanwhile, there are guys working on strings and streams who are getting transcoding issues sussed. For stage three - and by this stage we'll be way ahead of the field - we'll have fuzzy matching, collation, and so on, all of which are locale-aware, plus full support for PUA properties. And meanwhile, there will be other guys working on other internationalization translation issues like number formatting and whatnot. I think you have made your judgement too early. Phobos is tiny right now, compared with Java's vast array of classes. Deimos is even tinier, and somewhat more piecemeal. But already D's Unicode support is: * Better than C * Better than C++ * Catching up with Java (and better in some areas) To expect the full whack right at the start is unrealistic (and we /are/ still right at the start). Walter was way too busy getting the core of the language together to start worrying about how you do uppercasing in Deseret*, but the language has now reached the point where we can do that. So tell me. Against what are you comparing D? Java? Tell me in what ways you think D is behind? Tell me what does better than D, and in what way? I suspect you may be hard pressed to come up with examples. Arcane Jill * something which Java can't do, but D can, right now.
Jul 27 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:
Arcane Jill wrote:


It doesn't matter for me, though, as I don't use Java, and I intend for D to do
better.
In my opinion D is off to a really bad start with Unicode.
The "start" hasn't even happened yet. What we have now isn't anything like what we're /going/ to have. There are /loads/ of (other) things that D doesn't have yet (like decent streams support), but most of these things are *in progress*. I'd say you made your call too early.
Actually the start has happened. What I was referring to was that the conception of the string in D has seemingly been defined. The Object.toString method returns a UTF sequence. (I will explain further down...)
 
 Look at it like this. D has only been around for three or four years, and it
was
 basically a one-person project. We're not even at version 1.0 yet, so the best
...
 
 And as for the future - well, for stage 2 we've got the normalization,
canonical
...
 For stage three - and by this stage we'll be way ahead of the field - we'll
have
... Please forgive me but I have only started figuring out Unicode this week so a good deal of D's planned implementation are features that I at best partially understand. I am happy to know there is a master plan however. I applaud the robot builds. I would be curious whether non-ASCII names will be supported (ie classes and variable names, etc). I am also curious about whether there it will be possible for a non-English speaker to use their language's version of D's reserverved words (ie Swedish word for synchronized, etc).
 
 I think you have made your judgement too early. Phobos is tiny right now,
 compared with Java's vast array of classes. Deimos is even tinier, and somewhat
 more piecemeal. But already D's Unicode support is:
 
 * Better than C
 * Better than C++
 * Catching up with Java (and better in some areas)
 
 To expect the full whack right at the start is unrealistic (and we /are/ still
 right at the start). Walter was way too busy getting the core of the language
 together to start worrying about how you do uppercasing in Deseret*, but the
 language has now reached the point where we can do that.
 
 So tell me. Against what are you comparing D? Java? Tell me in what ways you
 think D is behind? Tell me what does better than D, and in what way? I suspect
 you may be hard pressed to come up with examples.
 
Ok before I explain what aspects of D's Unicode implementation bother me I feel given the context of the thread that I need to point out that I am not comparing D to any other language. I have used Java's Unicode related documents only to clarify the diverse Unicode technical vocabulary. As stated above (and in the 'Source level Java to D converter' thread) I do not agree with D's apparent conception of the string. The D docs (in Arrays.Special Array Types.Strings) say this: ================================ Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters. String literals become just an easy way to write character arrays. ================================ I agree that a string is a sequence of characters. However D's conception of string seems to be a Unicode string which is most decidedly NOT a sequence of characters. Unicode defines a Character in a sensible fashion: ================================ (from http://www.unicode.org/glossary/) Character. (1) The smallest component of written language that has semantic value; ... Unicode String. A code unit sequence ... ================================ What D calls characters are in fact code units. A char for example is an 8-bit code unit that may in special cases represent a Character. Of course the type name 'char' was strongly suggested for C-compatibility so the misnomer was not wanton. However conceptually confusing a String of Characters with a Unicode String (of code units) led to what I consider a fairly glaring omission in even the most basic or unfinished library: The most basic of String operations length-querry and substring are not supported. It is clearly possible to count the code units with _char[].length and it is possible to slice the code units with char[i..j]. But no predefined operations actually indicate how many Characters a char[] (or wchar[]) actually contains. To put it another way: char[].length != <String>.length char[i..k] != <String>.substring(i,j) char[].sort (just amusing to consider) I may seem like I am being overly pedantic but I came to D without any knowledge of Unicode. It took me days to finally figure out that when anything D related says 'string' it actually means something different from the intuitive notion of a string, the formal notion of a string (assuming an alphabet must consist of Characters) and Unicode's technical definition. I would strongly suggest adding a String class to phobos which implements a String of Characters and reserve the term string to that class alone. Hence if a String class is written then Object.toString() should return a String reference. Of course a String class is not the only valid solution and I do not have enough experience with D or Unicode to suggest that it would be the best. I certainly would not suggest that it should be done because that is how it was done in Java... With that said I do have doubts that a feasible solution exists without implementing a String class. Non-class methods would have to parse UTF once for each length and substring call whereas a proper class implementation can do it in constant time (see implementation suggestion below if in doubt). I am under the impression that while Unicode has enities that require more than 16 bits to represent it has been said that such 32 bit examples will be "vanishingly rare". Thus 16 bits is the normal case with occasional use of 32 bit entities. Optimizing the most frequent case suggests using an internal representation of wchar[] for the 16 bit entities and another sparse wchar array for the cases in which a wchar is too small. The querry-length function is obviously constant time in all cases. However so is the substring operation - thanks to copy-on-write. I might also suggest considering making String an interface and implementing it in 3 seperate classes (or more): 1) String8 2) String16 3) String32 (these are horrible names, sorry) The Interface implementation of course has the benefit that anybody who wants to tune a String class to work for them can either subclass an existing String class or write their own implementation (without inheriting super class stuff) and still have the class recognized by Object.toString() and Exception.this(String)
Jul 27 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
parabolis wrote:
 
 What D calls characters are in fact code units. A char for example is an 
 8-bit code unit that may in special cases represent a Character. Of 
 course the type name 'char' was
 strongly suggested for C-compatibility so the misnomer was not wanton.
This is a multifaceted issue. D supports UTF-8, UTF-16, and UTF-32 representations, stored in arrays of char, wchar, and dchar, respectively. While char strings are technically UTF-8, there is a 1-1 correspondence between characters and bytes so long as the values are within the range of the ASCII character set. And in the case of dchars, there (as far as I know) is always a 1-1 correspondence between D characters and Unicode characters.
 However conceptually confusing a String of Characters with a Unicode 
 String (of code units) led to what I consider a fairly glaring omission 
 in even the most basic or unfinished library:
 
 The most basic of String operations length-querry and substring are not 
 supported. It is clearly possible to count the code units with 
 _char[].length and it is possible to slice the code units with 
 char[i..j]. But no predefined operations actually indicate how many 
 Characters a char[] (or wchar[]) actually contains. To put it another way:
 
     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)
        char[].sort  (just amusing to consider)
Good point. However C++ has this exact same issue with its string class. Perhaps the problem is one of semantics. While C++ merely claims that its strings are an ordered sequence of bytes, the D documentation suggests that these bytes are in a specific encoding format (though the language does not require this).
 I may seem like I am being overly pedantic but I came to D without any 
 knowledge of Unicode. It took me days to finally figure out that when 
 anything D related says 'string' it actually means something different 
 from the intuitive notion of a string, the formal notion of a string 
 (assuming an alphabet must consist of Characters) and Unicode's 
 technical definition.
Part of this has come about because we've been actively discussing internationalization recently, so much of what's said about strings is done so in that context. I'm only passingly familiar with many of the details of Unicode as well, but I do believe that there is room in the language for both definitions of "string."
 I would strongly suggest adding a String class to phobos which 
 implements a String of Characters and reserve the term string to that 
 class alone. Hence if a String class is written then Object.toString() 
 should return a String reference.
True enough. I agree that if a sequence of characters is to be printed then it must be properly encoded. Whether the internal representation is properly encoded, however, isn't much of an issue to me, so long as there is a clear means of producing the encoded string when output is desired.
 With that said I do have doubts that a feasible solution exists without 
 implementing a String class. Non-class methods would have to parse UTF 
 once for each length and substring call whereas a proper class 
 implementation can do it in constant time (see implementation suggestion 
 below if in doubt).
True enough. At the very least, we need some method of determing "true" string length. ie. how many representable characters a string contains. I have a feeling that there is a Unicode function for this, but I could not tell you its name. Frankly, I suspect that we will begin to use dcar arrays more and more often to avoid the trouble that dealing with multibyte encodings causes.
 I might also suggest considering making String an interface and 
 implementing it in 3 seperate classes (or more):
     1) String8
     2) String16
     3) String32
 (these are horrible names, sorry)
I'm not sure if there's one in the DTL, but it might be worth waiting to see. Assuming there is, I suspect that the signature would be along the lines of: class String(CharT) {...} so String!(char); String!(wchar); String!(dchar); Sean
Jul 27 2004
parent parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 parabolis wrote:
 
 What D calls characters are in fact code units. A char for example is 
 an 8-bit code unit that may in special cases represent a Character. Of 
 course the type name 'char' was
 strongly suggested for C-compatibility so the misnomer was not wanton.
This is a multifaceted issue. D supports UTF-8, UTF-16, and UTF-32 representations, stored in arrays of char, wchar, and dchar,
Yes Unicode calls them code units instead of characters because they do not always represent a character.
 respectively.  While char strings are technically UTF-8, there is a 1-1 
 correspondence between characters and bytes so long as the values are 
 within the range of the ASCII character set.  And in the case of dchars, 
 there (as far as I know) is always a 1-1 correspondence between D 
 characters and Unicode characters.
Yes that would be the special case in which a char actually holds sufficient code units to be interpreted as a Character.
 
 However conceptually confusing a String of Characters with a Unicode 
 String (of code units) led to what I consider a fairly glaring 
 omission in even the most basic or unfinished library:

 The most basic of String operations length-querry and substring are 
 not supported. It is clearly possible to count the code units with 
 _char[].length and it is possible to slice the code units with 
 char[i..j]. But no predefined operations actually indicate how many 
 Characters a char[] (or wchar[]) actually contains. To put it another 
 way:

     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)
        char[].sort  (just amusing to consider)
Good point. However C++ has this exact same issue with its string class. Perhaps the problem is one of semantics. While C++ merely claims that its strings are an ordered sequence of bytes, the D documentation suggests that these bytes are in a specific encoding format (though the language does not require this).
If there were actually a string class I would not excpect the above to hold. I simply meant that there is no way currently in D to find any of: <String>.length <String>.substring(i,j) Because only these are implemented: char[].length char[i..k]
 
 I may seem like I am being overly pedantic but I came to D without any 
 knowledge of Unicode. It took me days to finally figure out that when 
 anything D related says 'string' it actually means something different 
 from the intuitive notion of a string, the formal notion of a string 
 (assuming an alphabet must consist of Characters) and Unicode's 
 technical definition.
Part of this has come about because we've been actively discussing internationalization recently, so much of what's said about strings is done so in that context. I'm only passingly familiar with many of the details of Unicode as well, but I do believe that there is room in the language for both definitions of "string."
I think you may be missing my point. I am not suggesting eliminating "Unicode string" support for the sake of a 1:1 corresondence between a primitive type and character. I am saying that there is really only one definition of "string" and calling sequences of code units 'strings' does not fit any standard notion of a "string".
 
 With that said I do have doubts that a feasible solution exists 
 without implementing a String class. Non-class methods would have to 
 parse UTF once for each length and substring call whereas a proper 
 class implementation can do it in constant time (see implementation 
 suggestion below if in doubt).
True enough. At the very least, we need some method of determing "true" string length. ie. how many representable characters a string contains. I have a feeling that there is a Unicode function for this, but I could not tell you its name. Frankly, I suspect that we will begin to use dcar arrays more and more often to avoid the trouble that dealing with multibyte encodings causes.
 I might also suggest considering making String an interface and 
 implementing it in 3 seperate classes (or more):
     1) String8
     2) String16
     3) String32
 (these are horrible names, sorry)
I'm not sure if there's one in the DTL, but it might be worth waiting to see. Assuming there is, I suspect that the signature would be along the lines of: class String(CharT) {...} so String!(char); String!(wchar); String!(dchar);
I think a templated version of String should also implement a String interface because it would still allow other implementations to be used: interface String class StringT(CharT) : StringT {...}
Jul 28 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce6pke$2o2m$1 digitaldaemon.com>, parabolis says...
Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).
You'll have to ask Walter that one. (I mean, you'll have to wait and see if Walter answers this question). I suspect not, because I'm only providing a library, and it's written in D. The DMD compiler is written in C, and so can't call D libraries, and therefore won't be able to take advantage of any D library I provide. Adding Unicode support to /the compiler/ would also bloat the compiler somewhat. But that's just a guess. As I said, only Walter can answer this one definitively.
I am also curious about whether there it will be possible for a 
non-English speaker to use their language's version of D's 
reserverved words (ie Swedish word for synchronized, etc).
I'd be surprised if that were so. Syntax analysis happens /before/ semantic analysis, and syntax analysis needs to know all the reserved words. But again, I'm just guessing. Only Walter can be definitive.
Ok before I explain what aspects of D's Unicode implementation 
bother me I feel given the context of the thread that I need to 
point out that I am not comparing D to any other language. I 
have used Java's Unicode related documents only to clarify the 
diverse Unicode technical vocabulary.

As stated above (and in the 'Source level Java to D converter' 
thread) I do not agree with D's apparent conception of the string.

The D docs (in Arrays.Special Array Types.Strings) say this:
================================
Dynamic arrays in D suggest the obvious solution - a string is 
just a dynamic array of characters. String literals become just 
an easy way to write character arrays.
================================

I agree that a string is a sequence of characters. However D's 
conception of string seems to be a Unicode string which is most 
decidedly NOT a sequence of characters. Unicode defines a 
Character in a sensible fashion:

================================
(from http://www.unicode.org/glossary/)

Character. (1) The smallest component of written language that 
has semantic value; ...

Unicode String. A code unit sequence ...
================================

What D calls characters are in fact code units.
Correct.
A char for 
example is an 8-bit code unit that may in special cases 
represent a Character. Of course the type name 'char' was
strongly suggested for C-compatibility so the misnomer was not 
wanton.
I think it was also chosen for ASCII compatibility. It makes sense for Westerners. "hello world\n" has got twelve characters in it, as well as twelve code units. See - D is trying to educate people /gently/. If it had started out with the following as basic types: * codeunit // UTF-8 code unit * wcodeunit // UTF-16 code unit * dcodeunit // UTF-32 code unit * char // 32-bit wide character (same as dcodeunit) then everything would have worked, but people who used mostly ASCII would likely go: Eh? And ASCII strings would be four times as long.
However conceptually confusing a String of Characters with a 
Unicode String (of code units) led to what I consider a fairly 
glaring omission in even the most basic or unfinished library:

The most basic of String operations length-querry and substring 
are not supported. It is clearly possible to count the code 
units with _char[].length and it is possible to slice the code 
units with char[i..j]. But no predefined operations actually 
indicate how many Characters a char[] (or wchar[]) actually 
contains. To put it another way:

     char[].length != <String>.length
        char[i..k] != <String>.substring(i,j)
That's only partially true. As noted above, it's the /names/ for things that are wrong, not that things are absent. If you pretend that "dchar" is the character type, rather than "char", then you /do/ get the behavior you desire. You /could/ simply pretend that char and wchar don't exist, if you really wanted.
        char[].sort  (just amusing to consider)
This one will actually work. Lexicographical UTF-8 order is the same as lexicographical Unicode order.
I may seem like I am being overly pedantic but I came to D 
without any knowledge of Unicode.
Most people do, and you're not being overly pedantic.
It took me days to finally 
figure out that when anything D related says 'string' it 
actually means something different from the intuitive notion of 
a string, the formal notion of a string (assuming an alphabet 
must consist of Characters) and Unicode's technical definition.
Mebe, but it's no different from a string in any other computer language. In /no/ language of which I am aware is a string an array of Unicode characters. In C and C++ on Windows, for example, a char is eight bits wide, and so /obviously/ can't store all Unicode characters. In fact, it's very hard for C source code to know the encoding of a C string, and everything will work fine only if everything sticks to the system default. This makes internationalization much harder.
I would strongly suggest adding a String class to phobos which 
implements a String of Characters and reserve the term string to 
that class alone. Hence if a String class is written then 
Object.toString() should return a String reference.
We D users can write a Unicode aware String class (and I believe Hauke is doing that); we can publish it; we can even /suggest/ that it be moved into Phobos. But Walter is the only one who can approve/disapprove/implement that suggestion. Phobos is Walter's baby. Deimos is one place where we can put things in the meantime, but the tight integration that you suggest can only happen if everything is in the same place. But I'm tempted to ask why? I mean, what's wrong with a char[] (UTF-8 sequence)? Its good enough for many purposes, especially for mostly-ASCII strings (which Object.toString() is likely to return), and you can always convert it to a String (pending such a class) if you want more functionality.
Of course a String class is not the only valid solution and I do 
not have enough experience with D or Unicode to suggest that it 
would be the best. I certainly would not suggest that it should 
be done because that is how it was done in Java...
True. And Java made the mistake of declaring a String class /final/. I found that damned annoying, as I couldn't extend it. If I wanted additional functionality not provided by Java String, I would have had to have written a brand new class from scratch, and even then it wouldn't have cast. I seriously hope D doesn't make /that/ mistake. However much functionality a String may provide, there's always going to be at least one user who wants /just one more function/.
I am under the impression that while Unicode has enities that 
require more than 16 bits to represent it has been said that 
such 32 bit examples will be "vanishingly rare". Thus 16 bits is 
the normal case with occasional use of 32 bit entities.
Depends what you want to do. As a musician, I've often wanted to use the musical characters U+1D100 to U+1D1DD. As a mathematician, I similarly would want to use the mathematical letters U+1D400 to U+1D7FF. Mystical types would probably like to use the tetragrams between U+1D306 and U+1D356. So you see, the characters beyond U+FFFF are not /all/ strange alphabets we've never heard of, and I certainly wouldn't call the desire to go beyond U+FFFF "vanishingly rare".
Optimizing the most frequent case suggests using an internal 
representation of wchar[] for the 16 bit entities
Makes sense
and another 
sparse wchar array for the cases in which a wchar is too small. 
I don't understand that. UTF-16 is better, from the point of view of most common case and memory usage.
The querry-length function is obviously constant time in all 
cases. However so is the substring operation - thanks to 
copy-on-write.
It's not /that/ hard to count characters in UTF-8 and UTF-16. In UTF-8, you only have to ignore code units between 0x80 and 0xBF, and in UTF-16 you only have to ignore code units between 0xDC00 and 0xDFFF. Count all the rest and you've got the number of characters. Nice thoughts though. Keep them coming. Jill
Jul 28 2004
next sibling parent reply J C Calvarese <jcc7 cox.net> writes:
In article <ce7o9r$2uo$1 digitaldaemon.com>, Arcane Jill says...
In article <ce6pke$2o2m$1 digitaldaemon.com>, parabolis says...
Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).
You'll have to ask Walter that one. (I mean, you'll have to wait and see if Walter answers this question). I suspect not, because I'm only providing a library, and it's written in D. The DMD compiler is written in C, and so can't call D libraries, and therefore won't be able to take advantage of any D library I provide. Adding Unicode support to /the compiler/ would also bloat the compiler somewhat. But that's just a guess. As I said, only Walter can answer this one definitively.
Unless I don't understand the question (which is always a strong possibility), DMD already supports non-ASCII names for identifiers: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" http://www.digitalmars.com/d/lex.html I've tested it before and it worked for me. jcc7
Jul 28 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce8e8g$al2$1 digitaldaemon.com>, J C Calvarese says...

Unless I don't understand the question (which is always a strong possibility),
DMD already supports non-ASCII names for identifiers:

"Identifiers start with a letter, _, or unicode alpha, and are followed by any
number of letters, _, digits, or universal alphas. Universal alphas are as
defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" 
So it does. Cool. I looked at that document (ISO/IEC 9899:1999(E) Appendix D). It describes a fixed list of identifier characters, which will never change with time (as opposed to up-to-date Unicode, which contains an ever-growing list, growing with each new version of Unicode). Anyway, I'm impressed. This is brilliant. Jill
Jul 28 2004
prev sibling parent parabolis <parabolis softhome.net> writes:
J C Calvarese wrote:

 In article <ce7o9r$2uo$1 digitaldaemon.com>, Arcane Jill says...
 
In article <ce6pke$2o2m$1 digitaldaemon.com>, parabolis says...

Actually the start has happened. What I was referring to was 
that the conception of the string in D has seemingly been 
defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

I would be curious whether non-ASCII names will be supported (ie 
classes and variable names, etc).
You'll have to ask Walter that one. (I mean, you'll have to wait and see if Walter answers this question). I suspect not, because I'm only providing a library, and it's written in D. The DMD compiler is written in C, and so can't call D libraries, and therefore won't be able to take advantage of any D library I provide. Adding Unicode support to /the compiler/ would also bloat the compiler somewhat. But that's just a guess. As I said, only Walter can answer this one definitively.
Unless I don't understand the question (which is always a strong possibility), DMD already supports non-ASCII names for identifiers: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" http://www.digitalmars.com/d/lex.html I've tested it before and it worked for me. jcc7
Wow I am impressed. That was really forward thinking.
Jul 28 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 
 You'll have to ask Walter that one. (I mean, you'll have to wait and see if
....
 analysis, and syntax analysis needs to know all the reserved words. But again,
 I'm just guessing. Only Walter can be definitive.
I will probabably wait so see if he responds to me. As I have said before I imagine that Unicode acceptance will suggest these issues become solvable for compiler writers and so languages in the near future will be developed with these aspects.
A char for 
example is an 8-bit code unit that may in special cases 
represent a Character. Of course the type name 'char' was
strongly suggested for C-compatibility so the misnomer was not 
wanton.
I think it was also chosen for ASCII compatibility. It makes sense for Westerners. "hello world\n" has got twelve characters in it, as well as twelve code units. See - D is trying to educate people /gently/. If it had started out with the following as basic types: * codeunit // UTF-8 code unit * wcodeunit // UTF-16 code unit * dcodeunit // UTF-32 code unit * char // 32-bit wide character (same as dcodeunit) then everything would have worked, but people who used mostly ASCII would likely go: Eh? And ASCII strings would be four times as long.
Actually facet of UTF is exactly why I want to see proper use used for things. People who generally use ASCII can expect char to represent a character and People who generally use subset of 16 bit Unicode values can expect a wchar to represent a Character. Combine that with the fact char seems to be short for Character and it is obvious people will make the wrong over generalization that a char or wchar actually represent Characters. The result is very subtle bugs when assuming char[].length or wchar[].length counts the /Characters/ in an array or that char[i..k] slices the /Character/ in an array.
 
 
However conceptually confusing a String of Characters with a 
Unicode String (of code units) led to what I consider a fairly 
glaring omission in even the most basic or unfinished library:

The most basic of String operations length-querry and substring 
are not supported. It is clearly possible to count the code 
units with _char[].length and it is possible to slice the code 
units with char[i..j]. But no predefined operations actually 
indicate how many Characters a char[] (or wchar[]) actually 
contains. To put it another way:

    char[].length != <String>.length
       char[i..k] != <String>.substring(i,j)
That's only partially true. As noted above, it's the /names/ for things that are wrong, not that things are absent. If you pretend that "dchar" is the character type, rather than "char", then you /do/ get the behavior you desire. You /could/ simply pretend that char and wchar don't exist, if you really wanted.
But things are absent. D does not currently have the facility to do to two most fundamental String operations. I really doubt that if it werent for an oversight
 
       char[].sort  (just amusing to consider)
This one will actually work. Lexicographical UTF-8 order is the same as lexicographical Unicode order.
I agree with you but you missed the multiple code units case where sort has the nice property that it /destroys/ a valid encoding: ================================================================ char threeInOrderCharacters[] = [ 0xE6,0x97,0xA5, // u+65E5 0xE6,0x9C,0xAC, // u+672C 0xE8,0xAA,0x9E, // u+8A9E ]; void main(char[][] argv) { uint max = threeInOrderCharacters.length; threeInOrderCharacters.sort; for( uint i = 0; i < max; i++ ) { printf( "%2X ", threeInOrderCharacters[i] ); } } ================================================================ Output: 97 9C 9E A5 AA AC E6 E6 E8 ================================================================
 
I may seem like I am being overly pedantic but I came to D 
without any knowledge of Unicode.
Most people do, and you're not being overly pedantic.
lol I know I just dont want them to hate me.
 
It took me days to finally 
figure out that when anything D related says 'string' it 
actually means something different from the intuitive notion of 
a string, the formal notion of a string (assuming an alphabet 
must consist of Characters) and Unicode's technical definition.
Mebe, but it's no different from a string in any other computer language. In
C uses "string" the same way. If it did not then all the ctype.h functions would have to take pointers to char arrays to be able to answer the questions that they answer. The char type is so named because it was expected that a Character would be represented (wholly) by a char. So string.h was built assuming strlen gives the number of Characters. C++ char[] and wchar[] /arrays/ should not be confused with a String. See STL which defines String to work in a manner consistent with my Characters of Strings notion. Java obviously also uses "string" the same way... see java.lang.String. I /suspect/ that Objective C and ECMA-262 also define Strings in a similar manner.
 /no/ language of which I am aware is a string an array of Unicode characters.
In
 C and C++ on Windows, for example, a char is eight bits wide, and so
/obviously/
 can't store all Unicode characters. In fact, it's very hard for C source code
to
 know the encoding of a C string, and everything will work fine only if
 everything sticks to the system default. This makes internationalization much
 harder.
Perhaps I am being overly pedantic again but consider u+0000 and u+0001. I believe calling the following Characters is acceptable: typedef bit tinyCharacter; bit[n] t_string = new bit[n] Here I have an array which is also a String since there is /always/ a 1:1 correspondence between elements and Characters. Let me guess... You want Strings that support a larger subset of Unicode Characters? Well fortunately C originally supported the Unicode range from u+0000 to u+007F with arrays of Characters. Arrays of Java's char type do not make a string. Likewise arrays of char or wchar in C++ do not make strings. Fortunately there are String classes to support the more trying requirement of supporting just the Unicode range from u+0000 to u+FFFF.
 
 
I would strongly suggest adding a String class to phobos which 
implements a String of Characters and reserve the term string to 
that class alone. Hence if a String class is written then 
Object.toString() should return a String reference.
We D users can write a Unicode aware String class (and I believe Hauke is doing that); we can publish it; we can even /suggest/ that it be moved into Phobos. But Walter is the only one who can approve/disapprove/implement that suggestion. Phobos is Walter's baby. Deimos is one place where we can put things in the meantime, but the tight integration that you suggest can only happen if everything is in the same place.
I am happy he controls entries as I am sure phobos' quality will be much improved as a result.
 
 But I'm tempted to ask why? I mean, what's wrong with a char[] (UTF-8
sequence)?
I believe I explain why farther below in my post. If that did not answer the question you are asking then please help me understand better what you want to know.
 
Of course a String class is not the only valid solution and I do 
not have enough experience with D or Unicode to suggest that it 
would be the best. I certainly would not suggest that it should 
be done because that is how it was done in Java...
True. And Java made the mistake of declaring a String class /final/. I found that damned annoying, as I couldn't extend it. If I wanted additional functionality not provided by Java String, I would have had to have written a brand new class from scratch, and even then it wouldn't have cast. I seriously hope D doesn't make /that/ mistake. However much functionality a String may provide, there's always going to be at least one user who wants /just one more function/.
I am under the impression that while Unicode has enities that 
require more than 16 bits to represent it has been said that 
such 32 bit examples will be "vanishingly rare". Thus 16 bits is 
the normal case with occasional use of 32 bit entities.
Depends what you want to do. As a musician, I've often wanted to use the musical characters U+1D100 to U+1D1DD. As a mathematician, I similarly would want to use the mathematical letters U+1D400 to U+1D7FF. Mystical types would probably like to use the tetragrams between U+1D306 and U+1D356. So you see, the characters beyond U+FFFF are not /all/ strange alphabets we've never heard of, and I certainly wouldn't call the desire to go beyond U+FFFF "vanishingly rare".
Actually the "vanishingly rare" from the Unicode documents meant the frequency with which they will be extedning beyond 32 bits. I wish I had a link so I could find it again... However I do still doubt the likelihood of ever seeing full sentence which contains exclusively (or perhaps even mostly) entities above u+FFFF. Perhaps a transcription in Linear B.
and another 
sparse wchar array for the cases in which a wchar is too small. 
I don't understand that. UTF-16 is better, from the point of view of most common case and memory usage.
I apologize I should have made this much more clear: ================================================================ class String { private wchar[] loBits; private SparseArray hiBits; // implementation here } ================================================================ For every Character in the String there is an entry for that Character in lowBits. So for length() returning loBits.length will accurately indicate the number of Characters in the calling String object. For any Unicode with a value from u+0000 to u+FFFF that value is stored in loBits and hiBits remains unchanged. For values greater than u+FFFF the lowest 16 bits are stored in lowBits and the upper 16 bits are stored in hiBits. Memory usage will be almost exactly the same as encoding with UTF-16. (Identical in big O terms)
 
The querry-length function is obviously constant time in all 
cases. However so is the substring operation - thanks to 
copy-on-write.
It's not /that/ hard to count characters in UTF-8 and UTF-16. In UTF-8, you only
It is not an issue of difficultly but rather efficiency. A String class can perform length and substring in constant time wheras parsing UTF16 will always require a loop. of it being not possible in constant time: So to recap in big O terms: 1) The memory requirements of String are identical to UTF16 2) For length() 2a) The time requirements of String are 1 3a) The time requirements of UTF16 are N 3) For substring() 3a) The time requirements of String are 1 3a) The time requirements of UTF16 are N
Jul 28 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce8hhq$c9o$1 digitaldaemon.com>, parabolis says...

The result is very subtle bugs when assuming char[].length or 
wchar[].length counts the /Characters/ in an array or that 
char[i..k] slices the /Character/ in an array.
I'm not disagreeing with you, but see my separate post on graphemes and glyphs and things. There are distinctions in Unicode which never existed in ASCII, so people are not used to them. In ASCII, every character was either a control or a grapheme. This correspondence no longer holds in Unicode, so even basing your strings on characters is not always the desirable thing to do. What, for example, is (cast(dchar[]) "café")[3..4] ? or... (cast(dchar[]) "café").length ? The answer depends on how your text editor composed the "é" when you wrote the source code. To paraphrase you, the result is very subtle bugs when assuming dchar[].length counts the /graphemes/ in an array or that dchar[i..k] slices the /graphemes/ in an array. Of course "char" doesn't suggest "grapheme" in the same way that it suggests "character" - but in reality, most people don't know the difference (because there pretty much is no difference in ASCII). So - like I say - I'm not disagreeing with you. But I don't see where you're going with this. I see the flaws in current support, and I think "We can fix that". Hence the planned future functionality. You see the same flaws, but you seem instead to be saying "ditch the char". But you know that's not going to happen. Have I misunderstood you?
       char[].sort  (just amusing to consider)
This one will actually work. Lexicographical UTF-8 order is the same as lexicographical Unicode order.
I agree with you but you missed the multiple code units case where sort has the nice property that it /destroys/ a valid encoding:
Yeah, my bad. I read that as char[][].sort. You're right that char[].sort will break the (conceptual) char[] invariant. You'll get a UTF conversion exception later on. I see what you're saying, but I'm sure that a string class will exist in the future. That it doesn't exist yet, to me, makes it just something to look forward to, not the end of the world.
See STL which defines String to work in a manner 
consistent with my Characters of Strings notion.
Now that's cheating. std::string (being a typedef for std::basic_string<char>) has the same concept of character as (char *). It's dependent on the source code encoding.
Java obviously also uses "string" the same way... see 
java.lang.String.
I just looked at it. Seems to be based on 16-bit wide Java chars to me. That smells of UTF-16, hence /not/ the 1-1 correspondence you suggest.
Perhaps I am being overly pedantic again but consider u+0000 and 
u+0001. I believe calling the following Characters is acceptable:

     typedef bit tinyCharacter;
Errm. Sort of. Really the only definition of "character" that makes sense is that a character is a member of some character set, so if you first defined a character set with two characters in it, then you could indeed encode such characters with one bit. But you can't just go around picking arbitrary subsets of existing character sets and representing them in fewer than the required number of bits.
Let me guess... You want Strings that support a larger subset of 
Unicode Characters?
Either we're talking Unicode or we're not. There are Unicode strings; there are Latin-1 strings; there are ASCII strings. I don't get the question.
Well fortunately C originally supported the Unicode range from 
u+0000 to u+007F with arrays of Characters.
If we're going to be /really/ pedantic here, it did not. It supported ASCII. The fact that there is a 1-1 correspondence between the codepoints of ASCII and the codepoints U+0000 to U_007F of Unicode was a design feature of Unicode, not a design feature of C. But really, you know - who cares? I mean, I see no point in this little tangent. I think we've drifted into the utterly trivial here, and I'm keen to move out of it.
However I do still doubt the likelihood of ever seeing full 
sentence which contains exclusively (or perhaps even mostly) 
entities above u+FFFF.
Depends what language you speak.
Jul 28 2004
next sibling parent Berin Loritsch <bloritsch d-haven.org> writes:
Arcane Jill wrote:

 In article <ce8hhq$c9o$1 digitaldaemon.com>, parabolis says...
 
 
The result is very subtle bugs when assuming char[].length or 
wchar[].length counts the /Characters/ in an array or that 
char[i..k] slices the /Character/ in an array.
I'm not disagreeing with you, but see my separate post on graphemes and glyphs and things. There are distinctions in Unicode which never existed in ASCII, so people are not used to them. In ASCII, every character was either a control or a grapheme. This correspondence no longer holds in Unicode, so even basing your strings on characters is not always the desirable thing to do. What, for example, is (cast(dchar[]) "café")[3..4] ? or... (cast(dchar[]) "café").length ? The answer depends on how your text editor composed the "é" when you wrote the source code. To paraphrase you, the result is very subtle bugs when assuming dchar[].length counts the /graphemes/ in an array or that dchar[i..k] slices the /graphemes/ in an array. Of course "char" doesn't suggest "grapheme" in the same way that it suggests "character" - but in reality, most people don't know the difference (because there pretty much is no difference in ASCII). So - like I say - I'm not disagreeing with you. But I don't see where you're going with this. I see the flaws in current support, and I think "We can fix that". Hence the planned future functionality. You see the same flaws, but you seem instead to be saying "ditch the char". But you know that's not going to happen. Have I misunderstood you?
Perhaps I am missing something, but the general idea that I am used to operating with is a standard internal to the language. I.e. all strings are encoded UTF-32BE, but the IO should be able to translate the native string to whatever format is necessary/available. So the file (from your editor) might write UTF-8, but using an encoding scheme with your IO stream would be able to convert it to UTF-32BE--which would be native for the language. I am only using it as an example. I do the same thing with things bigger than strings myself. For example, I have a model that becomes the basis for decoupling the translation side and the usage side. It works very well. As long as the library was consistent with its standard, wouldn't that work well for D?
Jul 28 2004
prev sibling parent parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

I am leaving everything you said there...


 
 Yeah, my bad. I read that as char[][].sort. You're right that char[].sort will
 break the (conceptual) char[] invariant. You'll get a UTF conversion exception
 later on.
Actually on the topic of UTF conversion exception... There really is no such thing according to the standard. Personally I also prefer failing fast but I figured I would point out that it is non-standard behavoiur.
 
 I see what you're saying, but I'm sure that a string class will exist in the
 future. That it doesn't exist yet, to me, makes it just something to look
 forward to, not the end of the world.
 
Back to the comment that started this discussion: ================================================ In my opinion D is off to a really bad start with Unicode. ================================================ And the reason for the comment: ================================================ I have only seen the phobos.std.string and the D docs which mistankenly say UTF implements Strings. I was not previously privy to the D's Unicode plans. I saw what appeared to be a significant ambiguity between thd doc's use of String and Unicode string and that suggested a bad start. ================================================ I am much less pessimistic now that I know D will support inuitive Strings (and indeed a plethora at that - Character, Grapheme and Glyph Strings). I will be quite blown away if D actually manages to pull this off without requiring any knowledge of Unicode except where Unicode specific features are required.
 
 But really, you know - who cares? I mean, I see no point in this little
tangent.
 I think we've drifted into the utterly trivial here, and I'm keen to move out
of
 it. 
I think if it has any relevence it will come up in the newly started thread and is probably best dealt with there.
However I do still doubt the likelihood of ever seeing full 
sentence which contains exclusively (or perhaps even mostly) 
entities above u+FFFF.
Depends what language you speak.
I would be suprised to find that the UC has defined characters in that range that are used in a living languguage. I was kind of hoping you might have an example.
Jul 28 2004
prev sibling parent "Carlos Santander B." <carlos8294 msn.com> writes:
"parabolis" <parabolis softhome.net> escribió en el mensaje
news:ce8hhq$c9o$1 digitaldaemon.com
|
| I agree with you but you missed the multiple code units case
| where sort has the nice property that it /destroys/ a valid
| encoding:
| ================================================================
|      char threeInOrderCharacters[] = [
|          0xE6,0x97,0xA5,    // u+65E5
|          0xE6,0x9C,0xAC,    // u+672C
|          0xE8,0xAA,0x9E,    // u+8A9E
|      ];
|
|      void main(char[][] argv) {
|          uint max = threeInOrderCharacters.length;
|          threeInOrderCharacters.sort;
|          for( uint i = 0; i < max; i++ ) {
|          printf( "%2X ", threeInOrderCharacters[i] );
|      }
|      }
| ================================================================
|      Output:
|      97 9C 9E A5 AA AC E6 E6 E8
| ================================================================
|

Do what Jill said: use dchar.

/////////////////////////////
import std.utf;

char threeInOrderCharacters[] = [
    0xE6,0x97,0xA5,    // u+65E5
    0xE6,0x9C,0xAC,    // u+672C
    0xE8,0xAA,0x9E,    // u+8A9E
];

void main(char[][] argv) {
    dchar [] tIOC = toUTF32(threeInOrderCharacters);
    //uint max = threeInOrderCharacters.length;
    //threeInOrderCharacters.sort;
    tIOC.sort;
    char [] tIOC2 = toUTF8(tIOC);
    uint max = tIOC2.length;
    for( uint i = 0; i < max; i++ ) {
        //printf( "%2X ", threeInOrderCharacters[i] );
        printf( "%2X ", tIOC2[i] );
    }
}

/////////////////////////////

================================================================
     Output:
     E6 97 A5 E6 9C AC E8 AA 9E
================================================================

What you expected, right?
And, btw, using wchar gives the same result (and, of course, replacing toUTF32
by toUTF16).

-----------------------
Carlos Santander Bernal
Jul 28 2004
prev sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
 The D docs (in Arrays.Special Array Types.Strings) say this:
 ================================
 Dynamic arrays in D suggest the obvious solution - a string is
 just a dynamic array of characters. String literals become just
 an easy way to write character arrays.
 ================================
 
 I agree that a string is a sequence of characters. However D's
 conception of string seems to be a Unicode string which is most
 decidedly NOT a sequence of characters. Unicode defines a
 Character in a sensible fashion:
 
 ================================
 (from http://www.unicode.org/glossary/)
 
 Character. (1) The smallest component of written language that
 has semantic value; ...
The section of the D doc that you quote is followed by an example and then "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format. dchar[] strings are in UTF-32 format." Would it help to move those sentances to right after the one you quote instead of putting it after the example? That way users will see that UTF-8 and realize how Walter is using the words "character" and the type "char". Or maybe change the first sentance to "Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters in UTF-8, UTF-16 or UTF-32 format." Nipping in the bud any questions about what is meant by the word "character".
Jul 28 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
Or maybe change the first sentance to "Dynamic arrays in D suggest the
obvious solution - a string is just a dynamic array of characters in UTF-8,
UTF-16 or UTF-32 format." Nipping in the bud any questions about what is
meant by the word "character".
That works for me.
Jul 28 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Ben Hinkle wrote:

The D docs (in Arrays.Special Array Types.Strings) say this:
================================
Dynamic arrays in D suggest the obvious solution - a string is
just a dynamic array of characters. String literals become just
an easy way to write character arrays.
================================
...
 The section of the D doc that you quote is followed by an example and then
 "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format.
 dchar[] strings are in UTF-32 format."
 Would it help to move those sentances to right after the one you quote
 instead of putting it after the example? That way users will see that UTF-8
 and realize how Walter is using the words "character" and the type "char". 
No I do not believe that would help. I think that would simply attempt to evade the issue that D has no types corresponding to Characters (and thus in fact effectively has no String support) whatsoever because the docs also clearly state D *wants* to provide string support.
 Or maybe change the first sentance to "Dynamic arrays in D suggest the
 obvious solution - a string is just a dynamic array of characters in UTF-8,
 UTF-16 or UTF-32 format." Nipping in the bud any questions about what is
 meant by the word "character".
But that is not true. A string is a sequence of Characters and so it is not at all an obvious solutoin to implement strings using arrays of encoded data in which a Character will be anywhere from 1-4 characters and must be parsed to obtain Character data accoring to the appropriate UTF standard.
Jul 28 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce8i48$civ$1 digitaldaemon.com>, parabolis says...

No I do not believe that would help. I think that would simply 
attempt to evade the issue that D has no types corresponding to 
Characters
except of course dchar
(and thus in fact effectively has no String support)
whatsoever
except of course dchar[] Jill
Jul 28 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Berin Loritsch wrote:
 
 Another place to look, if you want to see how they are planning on 
 improving things is here:
 
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
 http://java.sun.com/j2se/1.5.0/docs/api/
Sadly the one thing I really want them to implement will never happen - unsigned primitive types.
Jul 26 2004
parent reply Berin Loritsch <bloritsch d-haven.org> writes:
parabolis wrote:
 Berin Loritsch wrote:
 
 Another place to look, if you want to see how they are planning on 
 improving things is here:

 http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
 http://java.sun.com/j2se/1.5.0/docs/api/
Sadly the one thing I really want them to implement will never happen - unsigned primitive types.
I think the main reason for that is the focus of Java. Java is designed for applications, while C/C++/D is designed to include systems development as well. I have not come accross many instances where an unsigned primitive would be useful in the application space. And those few times where it does make a difference the signed primitives can be used because there are no comparisons to be done. In short, for the things Java is good at, I haven't run into the need for unsigned primitives myself.
Jul 27 2004
parent reply parabolis <parabolis softhome.net> writes:
Berin Loritsch wrote:

 parabolis wrote:
 
 Sadly the one thing I really want them to implement will never happen 
 - unsigned primitive types.
I think the main reason for that is the focus of Java. Java is designed for applications, while C/C++/D is designed to include systems development as well. I have not come accross many instances where an unsigned primitive would be useful in the application space. And those few times where it does make a difference the signed primitives can be used because there are no comparisons to be done. In short, for the things Java is good at, I haven't run into the need for unsigned primitives myself.
================================================================ I thought I remembered reading that Java was originally deisgned for appliance microprocessors, but I could be wrong. As for the unsigned primitive... Consider java.lang.String's: copyValueOf(char[] data, int offset, int count) These type of functions are everywhere in the library code. Does a negative offset or count ever make sense? Almost never... So the first few lines of the code check to make sure the values are in fact unsigned. The same thing is true with reading and writing to arrays and other sequential data structures in general. Every read/write is checked to make sure it is actually unsigned. The reason it bothers me is that I almost never write any code using signed primitives in any other language. Being forced to declare function parameters as signed and then check that the values are not signed is a doubly whammy... You probably wouldnt think about it but does it really make sense to use a signed value in most of the for loops you write? It may seem like an odd question but I tend to use unsigned by default and signed when I must. So when I see a for loop with a signed condition variable I wonder why someone would choose to do that.
Jul 27 2004
next sibling parent reply Berin Loritsch <bloritsch d-haven.org> writes:
parabolis wrote:

 Berin Loritsch wrote:
 
 parabolis wrote:

 Sadly the one thing I really want them to implement will never happen 
 - unsigned primitive types.
I think the main reason for that is the focus of Java. Java is designed for applications, while C/C++/D is designed to include systems development as well. I have not come accross many instances where an unsigned primitive would be useful in the application space. And those few times where it does make a difference the signed primitives can be used because there are no comparisons to be done. In short, for the things Java is good at, I haven't run into the need for unsigned primitives myself.
================================================================ I thought I remembered reading that Java was originally deisgned for appliance microprocessors, but I could be wrong.
Ok, you are going pre sun involvement here...
 
 As for the unsigned primitive... Consider java.lang.String's:
 
<snip/> Let me just say that it doesn't have a serious impact on day to day programming activities--even if it is not "ideologically pure". Most values used in day to day development fall well within the range of the signed positive value range. Most folks don't even worry about whether it would be more efficient to use a byte or an int. We just use ints because the performance gains of using the smaller primitive is no where near the gains of improving the algorithm. But that's just my experience (public projects I have worked on include Apache Avalon, Apache Cocoon, Apache JMeter, Apache Axis, and the D-Haven projects. I know this is a D forum, but I am including these to add weight to the argument that signed vs. unsigned arguments really don't impact most average programs all that much. Does it affect some people? sure. But the most common solution is either to ignore the sign or jump up to the next larger data size. It's no biggy.
Jul 27 2004
parent parabolis <parabolis softhome.net> writes:
Berin Loritsch wrote:
 parabolis wrote:
 
 I thought I remembered reading that Java was originally deisgned
 for appliance microprocessors, but I could be wrong.
Ok, you are going pre sun involvement here...
No what I remember reading was that Sun wanted to dabble in 'smart' appliances... But this is a vague impression I have from an article I read 5+ years ago...
 
 As for the unsigned primitive... Consider java.lang.String's:
<snip/> Let me just say that it doesn't have a serious impact on day to day programming activities--even if it is not "ideologically pure". Most values used in day to day development fall well within the range of the signed positive value range. Most folks don't even worry about whether it would be more efficient to use a byte or an int. We just use ints because the performance gains of using the smaller primitive is no where near the gains of improving the algorithm.
(Not that it matters but I believe using a 32 bit condition variable on a 32 bit machine is actually faster than a type with fewer bits...)
 But that's just my experience (public projects I have worked on include
 Apache Avalon, Apache Cocoon, Apache JMeter, Apache Axis, and the
 D-Haven projects.  I know this is a D forum, but I am including these
 to add weight to the argument that signed vs. unsigned arguments really
 don't impact most average programs all that much.
I apologize if I seemed to be arguing that unsigned is inherently better. I was just trying to make the point that not only do I have to avoid using my default in Java, but I also have to guard against conditions that are a direct result of my not getting to use unsigned. Perhaps a better way to make the point is imagine a language which does not allow the use of integer types. So now fictional.lang.String has the function: copyValueOf(char[] data, float offset, float count) And you have to write methods similar methods yourself and check to make sure the number is both integral and positive... This is an overstatement of my frustrations but I think it does illustrate what I mean.
Jul 27 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce6cl2$2j39$1 digitaldaemon.com>, parabolis says...

So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.
Well, here's one possible reason: is likely to be a few cycles faster than (depending on how good the compiler is at optimizing - a black art about which I know nothing) Jill
Jul 27 2004
next sibling parent Berin Loritsch <bloritsch d-haven.org> writes:
Arcane Jill wrote:

 In article <ce6cl2$2j39$1 digitaldaemon.com>, parabolis says...
 
 
So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.
Well, here's one possible reason: is likely to be a few cycles faster than (depending on how good the compiler is at optimizing - a black art about which I know nothing)
This will not hold true for all processor types, so it is generally better to code normally and trust the compiler to do the right optimization (if any). But that is a whole other topic (premature optimizations, etc.)
Jul 27 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 In article <ce6cl2$2j39$1 digitaldaemon.com>, parabolis says...
 
 
So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.
Well, here's one possible reason: is likely to be a few cycles faster than (depending on how good the compiler is at optimizing - a black art about which I know nothing) Jill
Yes I would wonder why you wrote that and probably assume to save a few cycles... However more often I tend to see: for (int i=array.length-1; i>=0; --i) { /* blah */ } is not likely to be a few cycles faster than for (uint i=array.length; i<10; ++i) { /* blah */ } But the general case is this horrible version: (call String.length(); every iteration) Also you should consider that. I am assuming you are pointing out that using 0 as a sentinel is faster than another number. Also dont forget that any speed benifit from using 0 as a sentinel is completely negated for a processor which does not implement unsigned addition identically to signed addition. And finally do not dismiss the unsigned alternate to your original suggestion: for( uint i = 0xFFFFFFFA; i != 0; i++ )
Jul 27 2004
parent reply Sean Kelly <sean f4.ca> writes:
parabolis wrote:
 But the general case is this horrible version:
 

   (call String.length(); every iteration)
Though I'm generally too prone to premature optimization to do this, I think the above code has the potential to be just as fast as the unsigned version. String likely contains a size_t variable to represent string length and it would be trivial for a compiler to inline calls to the String.length() function. Unless you want to compare results on a per instruction basis, I would likely not be too concerned with performance differences between the calls. Sean
Jul 27 2004
parent reply parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 parabolis wrote:
 
 But the general case is this horrible version:


   (call String.length(); every iteration)
Though I'm generally too prone to premature optimization to do this, I think the above code has the potential to be just as fast as the
Emphasis on the potential. That means that you must first be able to guarantee that any compiler that gets your code will optimize it before you should feel safe writing it.
Jul 28 2004
parent Sean Kelly <sean f4.ca> writes:
In article <ce8q1l$g6r$1 digitaldaemon.com>, parabolis says...
Sean Kelly wrote:

 parabolis wrote:
 
 But the general case is this horrible version:


   (call String.length(); every iteration)
Though I'm generally too prone to premature optimization to do this, I think the above code has the potential to be just as fast as the
Emphasis on the potential. That means that you must first be able to guarantee that any compiler that gets your code will optimize it before you should feel safe writing it.
Perhaps I've been spoiled by the latest set of C++ compilers. The tests I've seen with them tend to suggest that premature optimization often actually slows the resulting code down compared to what the optimizer can generate. But the above example was pretty straightfoward. I would be surprised if any production level D compiler didn't inline such calls. Sean
Jul 28 2004
prev sibling parent reply Sha Chancellor <schancel pacific.net> writes:
In article <ce3k0a$1co1$1 digitaldaemon.com>,
 parabolis <parabolis softhome.net> wrote:

 DataInputStream input =
    new DataInputStream(
      new CheckedInputStream(
        new DeflaterInputStream(
          new BufferedInputStream(
            new FileInputStream("filename.ext")
          )
        )
      )
    );
Do they have a DecaffinateInputStream and a DefatInputStream class also by any chance? I heard they do.
Jul 26 2004
parent parabolis <parabolis softhome.net> writes:
Sha Chancellor wrote:

 In article <ce3k0a$1co1$1 digitaldaemon.com>,
  parabolis <parabolis softhome.net> wrote:
 
 
DataInputStream input =
   new DataInputStream(
     new CheckedInputStream(
       new DeflaterInputStream(
         new BufferedInputStream(
           new FileInputStream("filename.ext")
         )
       )
     )
   );
Do they have a DecaffinateInputStream and a DefatInputStream class also by any chance? I heard they do.
lol I have surely never used them myself. Maybe java.beans somewhere... :P
Jul 26 2004