digitalmars.D.bugs - the D crowd does bobdamn Rocket Science
- Georg Wrede (180/196) Nov 18 2005 I've spent a week studying the UTF issue, and another trying to explain
- Derek Parnell (28/28) Nov 18 2005 On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
- Georg Wrede (26/55) Nov 18 2005 If somebody wants to retain the bit pattern while storing the contents
- Sean Kelly (28/69) Nov 18 2005 I somewhat agree. Since the three char types in D really do represent
- Regan Heath (11/41) Nov 18 2005 Making the cast explicit sounds like a good compromise to me.
- Sean Kelly (8/15) Nov 18 2005 This is the comparison I was thinking of as well. Though I've never
- Regan Heath (7/20) Nov 18 2005 Nope. Kris's post has something about it, here:
- Kris (49/118) Nov 18 2005 FWIW, I agree. And it should be explicit, to avoid unseen /runtime/
- Sean Kelly (17/27) Nov 18 2005 It may well not be. A set of properties is another approach:
- Kris (21/46) Nov 18 2005 I would agree, since it thoroughly isolates the special cases:
- Regan Heath (24/67) Nov 18 2005 You have my vote here.
- Derek Parnell (44/99) Nov 18 2005 Agreed. There are times, I suppose, when the coder does not want this to
- Regan Heath (8/56) Nov 18 2005 Georg/Derek, I replied to Georg here:
- Georg Wrede (5/13) Nov 20 2005 Good suggestion!
- Regan Heath (36/65) Nov 20 2005 Ok. I have taken your reply, clicked reply, and pasted it in here :)
- Derek Parnell (39/120) Nov 20 2005 Are you suggesting that in the situation where multiple function signatu...
- Regan Heath (98/187) Nov 20 2005 I'm suggesting that an undecorated string literal could default to char[...
- Derek Parnell (8/11) Nov 20 2005 Not only do great minds think alike, so you and I! I'm starting to think...
- Kris (4/19) Nov 20 2005 Aye!
- Georg Wrede (69/146) Nov 20 2005 Would you be surprised:
- Derek Parnell (29/197) Nov 20 2005 Surprised about the two conversions? No, I just said that's what it woul...
- Georg Wrede (7/9) Nov 20 2005 Whew, I was just starting to wonder what to do. :-)
- Regan Heath (16/24) Nov 20 2005 I'm interested in both your opinions on:
- Bruno Medeiros (6/13) Nov 20 2005 It should? Why?, what is the problem of using the toUTFxx functions?
- Georg Wrede (5/17) Nov 20 2005 Nothing wrong. But cast should not do the union thing.
- Derek Parnell (7/16) Nov 20 2005 Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong...
- Bruno Medeiros (11/29) Nov 21 2005 No, we don't. But the case is different: between primitive numbers the
- Derek Parnell (12/37) Nov 21 2005 Why? If documented, the user can be prepared.
- Don Clugston (39/81) Nov 21 2005 I would think so. I'd define trivial as: "the assembly code doesn't have...
- Oskar Linde (13/24) Nov 21 2005 String literals already work like this. :)
- Sean Kelly (5/9) Nov 21 2005 Really? I tested this a few days ago and it seemed like literals larger...
- Oskar Linde (20/28) Nov 21 2005 You are right, large integers are automatically treated as longs, but to...
- Sean Kelly (4/7) Nov 21 2005 This seems reasonable though, since it's really a matter of precision
- Bruno Medeiros (16/42) Nov 23 2005 A good question indeed. I was thinking something equivalent to what Don
I've spent a week studying the UTF issue, and another trying to explain it. Some progress, but not enough. Either I'm a bad explainer (hence, skip dreams of returning to teaching CS when I'm old), or, this really is an intractable issue. (Hmmmm, or everybody else is right, and I need to get on the green pills.) My almost final try: Let's imagine (please, pretty please!), that Bill Gates descends from Claudius. This would invariably lead to cardinals being represented as Roman Numerals in computer user interfaces, ever since MSDOS times. Then we'd have that as an everyday representation of integers. Obviously we'd have the functions r2a and a2r for changing string representations between Arabic ("1234...") and Roman ("I II III ...") numerals. To make this example work we also need to imagine that we need a notation for roman numerals within string. Let this notation be: "\R" as in "\RXIV" (to represent the number 14) (Since this in reality is not needed, we have to (again) imagine that it is, for Historical Reasons -- the MSDOS machines sometimes crashed when there were too many capital X on a command line, but at the time nobody found the reason, so the \R notation was created as a [q&d] fix.) So, since it is politically incorrect to write "December 24", we have to write "December XXIV" but since the ancient bug lurks if this file gets transferred to "a major operating system", we have to be careful and write "December \RXXIV". Now, programmers are lazy, and they end up writing "\Rxxiv" and getting all kinds of error messages like "invalid string literal". So a few anarchist programmers decided to implement the possibility of writing lower case roman numerals, even if the Romans themselves disapproved of it from the beginning. The prefix \r is already taken, so they had two choices, either make computers smart and let them understand \Rxxiv, but that would risk Bill getting angry. So they needed another prefix. They chose \N (this choice is a unix inside joke). --- Then a compiler guru (who happened to descend from Asterix the Gallian) decided to write a new language. In the midst of that all, he stumbled upon the Roman issue. Beign diligent (which I've wanted to be all my life too but never succeeded (go ask my mother, my teacher, my bosses)), he decided to implement strings in a non-brekable way. So now we have: char[], Rchar[] and Nchar[], the latter two being for situations where the string might contain [expletive deleted] roman values. The logical next step was to decorate the strings themselves, so that the computer can unambiguously know what to assign where. Therefore we now have "", ""R and ""N kind of strings. Oh, and to be totally unambiguous and symmetric, the redundant ""C was introduced to explicitly denote the non-R, non-N kind of string, in case such might be needed some day. Now, being modern, the guru had already made the "" kinds of strings Roman Proof, with the help of Ancient Gill, an elusive but legendary oracle. --- The III wise men had also become aware of this problem space. Since everything in modern times grows exponentially, the letters X, C and M (for ten, hundred and thousand), would sooner than later need to be accompanied by letters for larger numbers. For a million M was already taken, so they chose N. And then G, T, P, E, Z, Y for giga, tera, peta, exa, zetta and yotta. Then the in-betweens had to be worked out too, for 5000, 5000000 etc. 50 was already L and 500 was D..... whatever, you get the picture. .-) So, they decided that, to make a string spec that lasts "forever" the new string had to be stored with 32 bits per character. (Since exponential growth (it's true, just look how Bill's purse grows!), is reality, they figured the numerals would run out of letters, and that's why new glyphs would have to be invented eventually, ad absurdum. 32 bits would carry us until the end of the Universe.) They called it N. This was the official representation of strings that might contain roman numerals way into the future. Then some other guys thought "Naaw, t's nuff if we have strings that take us till the day we retire, so 16 bits oughtta be plenty." That became the R string. Practical Bill adopted the R string. Later the other guys had to admit that their employer might catch the "retiring plot", so they amended the R string to SOMETIMES contain 32 bits. Now, Bill, the practic he is, ignored the issue (probably on the same "retiring plot"). And what Bill does, defines what is right (at least with suits, and hey, they rule -- as opposed to us geeks). --- Luckily II blessed-by-Bob Sourcerers (notice the spelling), thought the R and N stuff was wasting space, was needed only occasionally, and was in general cumbersome. Everybody elses but Bill's R had to be able to handle 32 bits every once in a while, and the N stuff really was overkill. They figured "The absolute majority of crap needs 7 bits, the absolute majority of the rest needs 9 bits, and the absolute majority of the rest needs 12 bits. So there's pretty little left after all this -- however, since we are blessed-by-Bob, and we do stuff properly, we won't give up until we can handle all this, and handle it gracefully." They decided that their string (which they christened ""C) has to be compact, handle 7-bit stuff as fast as non-roman-aware programs do, 9 bit stuff almost as fast as the R programs, and it has to be lightning fast to convert to and from. Also, they wanted the C strings to be usable as much as possible by old library routines, so for example, the old routines should be able to search and sort their strings without upgrades. And they knew that strings do get chopped, so they designed them so that you can start wherever, and just by looking at the particular octet, you'd know whether it's proper to chop the string there. And if it isn't, it should be trivial to look a couple of octets forward (or even back), and just immediately see where the next breakable place is. Ha, and they wanted the C strings to be endiannes-proof!! The II were already celebrities with the Enlightened, so it was decided that the C string will be standard on POSIX systems. Smart crowd. *** if I don't see light here, I'll write some more one day *** --- If I write the following strings here, and somebody pastes them in his source code, "abracadabra" "räyhäpäivä" "ШЖЯЮЄШ" compiles his D program, and runs it, what should (and probably will!) happen, is that the program output looks like the strings here. If the guy has never heard of our Unbelievable utf-discussion, he probably never is aware that some UTF or other crap is or has been involved. (Hell, I've used Finnish letters in my D source code all the time, and never thought any of it.) After having seen this discussion, he gets nervous, and quickly changes all his strings so that they are ""c ""w and ""d decorated. From then on, he hardly dares to touch strings that contain non US content. Like us here. The interesting thing is, did I originally write them in UTF-8, UTF-16 or UTF-32? How many times were they converted between these widths while travelling from my keyboard to this newsgroup to his machine to the executable to the output file? Probably they've been in UTF-7 too, since they've gone through mail transport, which still is from the previous millennium. --- At this point I have to ask, are there any folks here who do not believe that the following proves anything:Xref: digitalmars.com digitalmars.D.bugs:5436 digitalmars.D:29904 Xref: digitalmars.com digitalmars.D.bugs:5440 digitalmars.D:29906 Those show that the meaning of the program does not change when the source code is transliterated to a different UTF encoding. They also show that editing code in different UTF formats, and inserting "foreign" text even directly to string literals, does survive intact when the source file is converted between different UTF formats. Further, they show that decorating a string literal to c, w, or d, does not change the interpretation of the contents of the string whether it contains "foreign" literals directly inserted, or not. Most permutations of the above 3 paragraphs were tested.(Oh, correction to the last line: "_All_ cross permutations of the 3 paragraphs were tested.) Endiannes was not considered, but hey, with wrong endianness, either your text editor cant't read the file to begin with, or if it can, then you _can_ edit the strings with even more "foreign characters" and still be ok! I hereby declare, that it makes _no_ difference whatsoever in which width a string literal is stored, as long as the compiler implicitly casts it when it gets used. I hereby also declare, that implicit casts of strings (be they literal or heap or stack allocated), carries no risks whatsoever. Period. I hereby declare that string literal decorations are not only unneeded, they create an enormous amount of confusion. (Even we are totally bewildered, so _every_ newcomer to D will be that too.) There are _no_ upsides to them. I hereby declare that it should be illegal to implicitly convert char or wchar to any integer type. Further it should be illegal to even cast char or wchar to any integer type. The cast should have to be via a cast to void! (I.e. difficult but possible.) With dchar even implicit casts are ok. Cast from char or wchar via dchar should be illegal. (Trust me, illegal. While at the same time even implicit casts from char[] and wchar[] to each other and to and from dchar[] are ok!) Cast between char wchar and dchar should be illegal, unless via void. A good programmer would use the same width all over the place. An even better programmer would typedef his own anyway. If an idiot has his program convert width at every other assignment, then he'll have other idiocies in his code too. He should go to VB. --- But some other things are (both now, and even if we fix the above) downright hazardous, and should cause a throw, and in non-release programs a runtime assert failure: Copying any string to a fixed length array, _if_ the array is either wchar[] or char[]. (dchar[] is ok.) The (throw or) assert should fail if the copied string is not breakable where the receiving array gets full. whatever foo = "ää"; // foo and "" can be any of c/w/d. char[3] barf = foo; // write cast if needed // Odd number of chars in barf, breaks ää wrong. "ää" is 4 bytes. Same goes for wchar[3]. --- Once we agree on this, then it's time to see if some more AJ stuff is left to fix. But not before.
Nov 18 2005
On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that. x = std.utf.toUTF16(y); However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls? dchar y; foo("Some Test Data"); // Which one now? Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call? D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string! -- Derek Parnell Melbourne, Australia 18/11/2005 10:42:31 PM
Nov 18 2005
Derek Parnell wrote:On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).dchar y; foo("Some Test Data"); // Which one now?Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain. I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses! (Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not. This also I try to explain in the other posts. (The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?) We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic. Aaaaaaaaaaah, now I got it. It's been Halloween again. Sigh!
Nov 18 2005
Georg Wrede wrote:Derek Parnell wrote:I somewhat agree. Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK). On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies. Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly. And either way, I would love to have UTF conversion for strings supported in-language. It does make some sense, given that the three encodings exist as distinct value types in D already.On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.I disagree. While this would make programming quite simple, it would also incur hidden runtime costs that would be difficult to ferret out. This might be fine for a scripting-type language, but not for a systems programming language IMO.However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).True.D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.This also I try to explain in the other posts. (The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?) We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic.What I like about the current behavior (no implicit conversion), is that it makes it readily obvious where translation needs to occur and thus makes it easy for the programmer to decide if that seems appropriate. That said, I agree that the overall runtime cost is likely consistent between a program with and without implicit conversion--either the API calls with have overloads for all types and thus allow you to avoid conversion, or they will only support one type and require conversion if you've standardized on a different type. It may well be that concerns over implicit convesion is unfounded, but I'll have to give the matter some more thought before I can say one way or the other. My current experience with D isn't such that I've had to deal with this particular issue much. Sean
Nov 18 2005
On Fri, 18 Nov 2005 09:48:51 -0800, Sean Kelly <sean f4.ca> wrote:Georg Wrede wrote:Making the cast explicit sounds like a good compromise to me. The way I see it casting from int to float is similar to casting from char[] to wchar[]. The data must be converted from one form to another for it to make sense, you'd never 'paint' and 'int' as a 'float' it would be meaningless, the same is true for char[] to wchar[]. The correct way to paint data as char[], wchar[] or dchar[] is to paint a byte[](or ubyte[]). In other words if you have some data of unknown encoding you should be reading it into byte[](or ubyte[]) and then painting as the correct type, once it is known. ReganDerek Parnell wrote:I somewhat agree. Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK). On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies. Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly. And either way, I would love to have UTF conversion for strings supported in-language. It does make some sense, given that the three encodings exist as distinct value types in D already.On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.
Nov 18 2005
Regan Heath wrote:Making the cast explicit sounds like a good compromise to me. The way I see it casting from int to float is similar to casting from char[] to wchar[]. The data must be converted from one form to another for it to make sense, you'd never 'paint' and 'int' as a 'float' it would be meaningless, the same is true for char[] to wchar[].This is the comparison I was thinking of as well. Though I've never tried casting an array of ints to floats. I suspect it doesn't work, does it? My only other reservation is that the behavior could not be preserved for casting char types, and unlike narrowing conversions (such as float to int), meaning can't even be preserved in narrowing char conversions (such as wchar to char). Sean
Nov 18 2005
On Fri, 18 Nov 2005 15:29:23 -0800, Sean Kelly <sean f4.ca> wrote:Regan Heath wrote:Nope. Kris's post has something about it, here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30158Making the cast explicit sounds like a good compromise to me. The way I see it casting from int to float is similar to casting from char[] to wchar[]. The data must be converted from one form to another for it to make sense, you'd never 'paint' and 'int' as a 'float' it would be meaningless, the same is true for char[] to wchar[].This is the comparison I was thinking of as well. Though I've never tried casting an array of ints to floats. I suspect it doesn't work, does it?My only other reservation is that the behavior could not be preserved for casting char types, and unlike narrowing conversions (such as float to int), meaning can't even be preserved in narrowing char conversions (such as wchar to char).Indeed. Due to the fact that the meaning (the "character") may be represented as 1 wchar, but 2 char's. The thread above has some more interesting stuff about this. Regan
Nov 18 2005
"Sean Kelly" <sean f4.ca> wrote in messageGeorg Wrede wrote:Amen to that.Derek Parnell wrote:I somewhat agree. Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK). On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies.On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly. And either way, I would love to have UTF conversion for strings supported in-language. It does make some sense, given that the three encodings exist as distinct value types in D already.FWIW, I agree. And it should be explicit, to avoid unseen /runtime/ conversion (the performance issue). But, I have a feeling that cast([]) is not the right approach here? One reason is that structs/classes can have only one opCast() method. perhaps there's another approach for such syntax? That's assuming, however, that one does not create a special-case for char[] types (per above inconsistencies).Amen to that, too!I disagree. While this would make programming quite simple, it would also incur hidden runtime costs that would be difficult to ferret out. This might be fine for a scripting-type language, but not for a systems programming language IMO.However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).Right on!True.D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.This also I try to explain in the other posts. (The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?) We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic.What I like about the current behavior (no implicit conversion), is that it makes it readily obvious where translation needs to occur and thus makes it easy for the programmer to decide if that seems appropriate.That said, I agree that the overall runtime cost is likely consistent between a program with and without implicit conversion--either the API calls with have overloads for all types and thus allow you to avoid conversion, or they will only support one type and require conversion if you've standardized on a different type.As long as the /runtime/ penalties are clear within the code design (not quietly 'padded' by the compiler), that makes sense.It may well be that concerns over implicit convesion is unfounded, but I'll have to give the matter some more thought before I can say one way or the other. My current experience with D isn't such that I've had to deal with this particular issue much.I'm afraid I have. Both in Mango.io and in the ICU wrappers. While there are no metrics for such things (that I'm aware of) my gut feel was that 'hidden' conversion would not be a good thing. Of course, that depends upon the "level" one is talking about: High level :: slow to medium performance Low level :: high performance A lot of folks just don't care about performance (oh, woe!) and that's fine. But I think it's worth keeping the distinction in mind when discussing this topic. I'd be a bit horrified to find the compiler adding hidden transcoding at the IO level (via Mango.io for example). But then, I'm a dinosaur. So. That doesn't mean that the language should not perhaps support some sugar for such operations. Yet the difficulty there is said sugar would likely bind directly to some internal runtime support (such as utf.d), which may not be the most appropriate for the task (it tends to be character oriented, rather than stream oriented). In addition, there's often a need for multiple return-values from certain types of transcoding ops. I imagine that would be tricky via such sugar? Maybe not. Transcoding is easy when the source content is reasonably small and fully contained within block of memory. It quickly becomes quite complex when streaming instead. That's really worth considering. To illustrate, here's some of the transcoder signatures from the ICU code: uint function (Handle, wchar*, uint, void*, uint, inout Error) ucnv_toUChars; uint function (Handle, void*, uint, wchar*, uint, inout Error) ucnv_fromUChars; Above are the simple ones, where all of the source is present in memory. void function (Handle, void**, void*, wchar**, wchar*, int*, ubyte, inout Error) ucnv_fromUnicode; void function (Handle, wchar**, wchar*, void**, void*, int*, ubyte, inout Error) ucnv_toUnicode; void function (Handle, Handle, void**, void*, void**, void*, wchar*, wchar*, wchar*, wchar*, ubyte, ubyte, inout Error) ucnv_convertEx; And those are the ones for handling streaming; note the double pointers? That's so one can handle "trailing" partial characters. Non trival :-) Thus, I'd suspect it may be appropriate for D to add some transcoding sugar. But it would likely have to be highly constrained (per the simple case). Is it worth it?
Nov 18 2005
Kris wrote:But, I have a feeling that cast([]) is not the right approach here? One reason is that structs/classes can have only one opCast() method. perhaps there's another approach for such syntax? That's assuming, however, that one does not create a special-case for char[] types (per above inconsistencies).It may well not be. A set of properties is another approach: char[] c = "abc"; dchar[] d = c.toDString(); but this would still only work for arrays. Conversion between char types still only make sense if they are widening conversions. Perrhaps I'm simply becoming spoiled by having so much built into D. This may well be simply a job for library code.Transcoding is easy when the source content is reasonably small and fully contained within block of memory. It quickly becomes quite complex when streaming instead. That's really worth considering.Good point. One of the first things I had to do for readf/unFormat was rewrite std.utf to accept delegates. There simply isn't any other good way to ensure that too much data isn't read from the stream by mistake. > Thus, I'd suspect it may be appropriate for D to add some transcoding sugar.But it would likely have to be highly constrained (per the simple case). Is it worth it?Probably not :-) But I suppose it's worth discussing. I do like the idea of not having to rely on library code to do simple string transcoding, though this seems of limited use given the above concerns. Sean
Nov 18 2005
"Sean Kelly" <sean f4.ca> wrote ...Kris wrote:I would agree, since it thoroughly isolates the special cases: char[].utf16 char[].utf32 wchar[].utf8 wchar[].utf32 dchar[].utf8 dchar[].utf16 Generics might require the addition of 'identity' properties, like char[].utf8 ?But, I have a feeling that cast([]) is not the right approach here? One reason is that structs/classes can have only one opCast() method. perhaps there's another approach for such syntax? That's assuming, however, that one does not create a special-case for char[] types (per above inconsistencies).It may well not be. A set of properties is another approach: char[] c = "abc"; dchar[] d = c.toDString();but this would still only work for arrays. Conversion between char types still only make sense if they are widening conversions.Aye. If the above set of properties were for arrays only, then one may be able to make a case that it doesn't break consistency. There might be a second, somewhat distinct, set: char.utf16 char.utf32 wchar.utf32 I think your approach is far more amenable than cast(), Sean. And properties don't eat up keyword space <g>Yeah. It would be limited (e.g. no streaming), and would likely be implemented using the heap. Even then, as you note, it could be attractive to some.Transcoding is easy when the source content is reasonably small and fully contained within block of memory. It quickly becomes quite complex when streaming instead. That's really worth considering.Good point. One of the first things I had to do for readf/unFormat was rewrite std.utf to accept delegates. There simply isn't any other good way to ensure that too much data isn't read from the stream by mistake. > Thus, I'd suspect it may be appropriate for D to add some transcoding sugar.But it would likely have to be highly constrained (per the simple case). Is it worth it?Probably not :-) But I suppose it's worth discussing. I do like the idea of not having to rely on library code to do simple string transcoding, though this seems of limited use given the above concerns.
Nov 18 2005
On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Derek Parnell wrote:You have my vote here.On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.The main argument against this last time it was proposed was that a expression containing several char[] types would implicitly convert any number of times during the expression. This transcoding would be inefficient, and silent, and thus bad, eg. char[] a = "this is a test string" wchar[] b = "regan was here"; dchar[] c = "georg posted this thing"; char[] d = c[0..7] ~ b[6..10] ~ a[10..14] ~ c[20..$] ~ a[14..$] ~ c[16..17] //supposed to be: georg was testing strings :) How many times does the above transcode using the current implicit conversion rules (the last time this topic was aired it branched into a discussion about how these rules could change to improve the situation)However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).Which is what it does currently, right?dchar y; foo("Some Test Data"); // Which one now?Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!I'm still not convinced. I will raise my issues in the later posts you promise.(Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)I tend to agree here but as I say above, last time this aired people complained about this very thing.Ok.Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.Ok. ReganD has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not. This also I try to explain in the other posts.
Nov 18 2005
On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:Derek Parnell wrote:Agreed. There are times, I suppose, when the coder does not want this to happen, but those could be coded with a cast(byte[]) to avoid that.On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote: [snip] It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does. dchar[] y; wchar[] x; x = cast(wchar[])y; does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.We have problems with inout and out parameters. foo(inout wchar x) {} dchar[] y = "abc"; foo(y); In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ... auto wchar[] temp; temp = toUTF16(y); foo(temp); y = toUTF32(temp);However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).Yes, at that's what happens now.dchar y; foo("Some Test Data"); // Which one now?Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one. If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.(Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called. If the coder had written foo("Some Test Data"w); then its pretty clear which function was intended. For example, D rightly complains when the similar situation occurs with the various integers. void foo(long x) {} void foo(int x) {} void main() { short y; foo(y); } If D did implicit conversions and chose one at random I'm sure we would complain.Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables. -- Derek Parnell Melbourne, Australia 19/11/2005 8:59:16 AMD has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.
Nov 18 2005
On Sat, 19 Nov 2005 09:19:28 +1100, Derek Parnell <derek psych.ward> wrote:Georg/Derek, I replied to Georg here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587 saying essentially the same things as Derek has above. I reckon we combine these threads and continue in this one, as opposed to the one I linked above. I or you can link the other thread to here with a post if you're in agreement. ReganI'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one. If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.(Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called. If the coder had written foo("Some Test Data"w); then its pretty clear which function was intended. For example, D rightly complains when the similar situation occurs with the various integers. void foo(long x) {} void foo(int x) {} void main() { short y; foo(y); } If D did implicit conversions and chose one at random I'm sure we would complain.Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.
Nov 18 2005
Regan Heath wrote:Georg/Derek, I replied to Georg here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587 saying essentially the same things as Derek has above. I reckon we combine these threads and continue in this one, as opposed to the one I linked above. I or you can link the other thread to here with a post if you're in agreement.Good suggestion! I actually intended that, but forgot about it while reading and thinking. :-/ So, the reply is to it directly.
Nov 20 2005
On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Regan Heath wrote:Ok. I have taken your reply, clicked reply, and pasted it in here :) (I hope this post isn't confusing for anyone) ------------------------- Copied from: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5607 ------------------------- On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Georg/Derek, I replied to Georg here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587 saying essentially the same things as Derek has above. I reckon we combine these threads and continue in this one, as opposed to the one I linked above. I or you can link the other thread to here with a post if you're in agreement.Good suggestion! I actually intended that, but forgot about it while reading and thinking. :-/ So, the reply is to it directly.Regan Heath wrote:You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why... http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") In essence integer literals _default_ to 'int' unless another type is specified or required. This suggested change does that, and nothing else? (can anyone see a difference?) If so and if I can accept the behaviour for integer literals why can't I for string literals? The only logical reason I can think of for not accepting it, is if there exists a difference between integer literals and string literals which affects this behaviour. I can think of differences, but none which affect the behaviour. So, it seems that if I accept the risk for integers, I have to accept the risk for string literals too. --- Note that string promotion should occur just like integer promotion does, eg: void foo(long i) {} foo(5); //calls foo(long) with no error void foo(wchar[] s) {} foo("test"); //should call foo(wchar[]) with no error this behaviour is current and should not change. ReganOn Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede <georg.wrede nospam.org> wrote: Lets assume there is 2 functions of the same name (unintentionally), doing different things. In that source file the programmer writes: write("test"); DMD tries to choose the storage type of "test" based on the available overloads. There are 2 available overloads X and Y. It currently fails and gives an error. If instead it picked an overload (X) and stored "test" in the type for X, calling the overload for X, I agree, there would be _absolutely no problems_ with the stored data. BUT the overload for X doesn't do the same thing as the overload for Y.Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?
Nov 20 2005
On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote: Ok, I'll comment but only 'cos you asked ;-)On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Are you suggesting that in the situation where multiple function signatures could possibly match an undecorated string literal, that D should assume that the string literal is actually in utf-8 format, and if that then fails to find a match, it should signal an error?Regan Heath wrote:Ok. I have taken your reply, clicked reply, and pasted it in here :) (I hope this post isn't confusing for anyone) ------------------------- Copied from: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5607 ------------------------- On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Georg/Derek, I replied to Georg here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587 saying essentially the same things as Derek has above. I reckon we combine these threads and continue in this one, as opposed to the one I linked above. I or you can link the other thread to here with a post if you're in agreement.Good suggestion! I actually intended that, but forgot about it while reading and thinking. :-/ So, the reply is to it directly.Regan Heath wrote:You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why... http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") In essence integer literals _default_ to 'int' unless another type is specified or required. This suggested change does that, and nothing else? (can anyone see a difference?)On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede <georg.wrede nospam.org> wrote: Lets assume there is 2 functions of the same name (unintentionally), doing different things. In that source file the programmer writes: write("test"); DMD tries to choose the storage type of "test" based on the available overloads. There are 2 available overloads X and Y. It currently fails and gives an error. If instead it picked an overload (X) and stored "test" in the type for X, calling the overload for X, I agree, there would be _absolutely no problems_ with the stored data. BUT the overload for X doesn't do the same thing as the overload for Y.Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?If so and if I can accept the behaviour for integer literals why can't I for string literals? The only logical reason I can think of for not accepting it, is if there exists a difference between integer literals and string literals which affects this behaviour. I can think of differences, but none which affect the behaviour. So, it seems that if I accept the risk for integers, I have to accept the risk for string literals too.What might be a relevant point about this is that we are trying to talk about strings, but as far as D is concerned, we are really talking about arrays (of code-units). And for arrays, the current D behaviour is self-consistent. If however, D supported a true string data type, then a great deal of our messy code dealing with UTF conversions would disappear, just as it does with integers and floating point values. Imagine the problems we would have if integers were regarded as arrays of bits by the compiler!--- Note that string promotion should occur just like integer promotion does, eg: void foo(long i) {} foo(5); //calls foo(long) with no errorBut what happens when ... void foo(long i) {} void foo(short i) {} foo(5); //calls ???void foo(wchar[] s) {} foo("test"); //should call foo(wchar[]) with no error this behaviour is current and should not change.Agreed. void foo(wchar[] s) {} void foo(char[] s) {} foo("test"); //should call ??? I'm now thinking that it should call the char[] signature without error. But in this case ... void foo(wchar[] s) {} void foo(dchar[] s) {} foo("test"); //should call an error. If we had a generic string type we'd probably just code .... void foo(string s) {} foo("test"); // Calls the one function foo("test"d); // Also calls the one function D would convert to an appropriate UTF format silently before (and after calling). -- Derek (skype: derek.j.parnell) Melbourne, Australia 21/11/2005 10:41:35 AM
Nov 20 2005
On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek psych.ward> wrote:On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote: Ok, I'll comment but only 'cos you asked ;-)Thanks <g>.I'm suggesting that an undecorated string literal could default to char[] similar to how an undecorated integer literal defaults to 'int' and that the risk created by that behaviour would be no different in either case.Are you suggesting that in the situation where multiple function signatures could possibly match an undecorated string literal, that D should assume that the string literal is actually in utf-8 format, and if that then fails to find a match, it should signal an error?Regan Heath wrote:You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why... http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") In essence integer literals _default_ to 'int' unless another type is specified or required. This suggested change does that, and nothing else? (can anyone see a difference?)On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede <georg.wrede nospam.org> wrote: Lets assume there is 2 functions of the same name (unintentionally), doing different things. In that source file the programmer writes: write("test"); DMD tries to choose the storage type of "test" based on the available overloads. There are 2 available overloads X and Y. It currently fails and gives an error. If instead it picked an overload (X) and stored "test" in the type for X, calling the overload for X, I agree, there would be _absolutely no problems_ with the stored data. BUT the overload for X doesn't do the same thing as the overload for Y.Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?I'm not sure it makes any difference that char[] is an array, if you imagine that we removed the current integer literal rules, here: http://www.digitalmars.com/d/lex.html#integerliteral (see "The type of the integer is resolved as follows") then short/int/long would exhibit the same problem that char[]/wchar[]/dchar[] does, this would be illegal: void foo(short i) {} void foo(int i) {} void foo(long i) {} foo(5); requiring: foo(5s); //to call short version foo(5i); //to call int version foo(5l); //to call long version or: foo(cast(short)5); //to call short version foo(cast(int)5); //to call int version foo(cast(long)5); //to call long version just like char[]/wchar[]/dchar[] does today.If so and if I can accept the behaviour for integer literals why can't I for string literals? The only logical reason I can think of for not accepting it, is if there exists a difference between integer literals and string literals which affects this behaviour. I can think of differences, but none which affect the behaviour. So, it seems that if I accept the risk for integers, I have to accept the risk for string literals too.What might be a relevant point about this is that we are trying to talk about strings, but as far as D is concerned, we are really talking about arrays (of code-units). And for arrays, the current D behaviour is self-consistent. If however, D supported a true string data type, then a great deal of our messy code dealing with UTF conversions would disappear, just as it does with integers and floating point values. Imagine the problems we would have if integers were regarded as arrays of bits by the compiler!You get: test.d(8): function test.foo called with argument types: (int) matches both: test.foo(short) and: test.foo(long) which is correct IMO because 'int' can be promoted to both 'short' and 'long' with equal preference. ("that's the long and short of it" <g>)--- Note that string promotion should occur just like integer promotion does, eg: void foo(long i) {} foo(5); //calls foo(long) with no errorBut what happens when ... void foo(long i) {} void foo(short i) {} foo(5); //calls ???That's what they've been suggesting. I have started to agree (because it's no more risky than current integer literal behaviour, which I have blithely accepted for years now - perhaps due to lack of knowledge when I first started programming, and now because I am used to it and it seems natural)void foo(wchar[] s) {} foo("test"); //should call foo(wchar[]) with no error this behaviour is current and should not change.Agreed. void foo(wchar[] s) {} void foo(char[] s) {} foo("test"); //should call ??? I'm now thinking that it should call the char[] signature without error.But in this case ... void foo(wchar[] s) {} void foo(dchar[] s) {} foo("test"); //should call an error.Agreed, just like the integer literal example above.If we had a generic string type we'd probably just code .... void foo(string s) {} foo("test"); // Calls the one function foo("test"d); // Also calls the one function D would convert to an appropriate UTF format silently before (and after calling).It's an interesting idea. I was thinking the same thing recently, why not have 1 super-type "string" and have it convert between the format required when asked eg. //writing strings void c_function_call(char *string) {} void os_function_call(wchar[] string) {} void write_to_file_in_specific_encoding(dchar[] string) {} string a = "test"; //"test" is stored in application defined default internal representation (more on this later) c_function_call(a.utf8); os_function_call(a.utf16); write_to_file_in_specific_encoding(a.utf32); normal_d_function(a); //reading strings void read_from_file_in_specific_encoding(inout dchar[]) {} string a; read_from_file_in_specific_encoding(a.utf32); or, perhaps we can go one step further and implicitly transcode where required, eg: c_function_call(a); os_function_call(a); write_to_file_in_specific_encoding(a); read_from_file_in_specific_encoding(a); The properties (Sean's idea, thanks Sean) utf8, utf16, and utf32 would be of type char[], wchar[] and dchar[] respectively. (so, these types remain) Slicing string would give characters as opposed to code units (parts of characters). I still believe the only times you care which encoding it is in, and/or should be transcoding, is on input and output, and for performance reasons you do not want it converting all over the place. To address performance concerns each application may want to define the default internal encoding of strings for performance reasons, and/or we could use the encoding specified on assignment/creation, eg. string a; //stored in application defined default (or char[] as that is D's general purpose default) string a = "test"w; //stored as wchar[] internally a.utf16 //does no transcoding a.utf32; a.utf8 //causes transcoding or, when you have nothing to assign a special syntax is used to specify the internal encoding //some options off the top of my head... string a = string.UTF16; string a!(wchar[]); //random though, can all this be achieved with a template? string a(UTF16); read_from_file_in_specific_encoding(a.utf32); the above would create an empty/non-existant(lets no go here yet <g>) utf16 string in memory, and transcode from the file, which is utf32 to utf16 for internal representation, then: a.utf16 //does no transcoding a.utf8 a.utf32 //causes transcoding Assignment of strings of different internal representation would cause transcoding. This should be rare as most should be in the application defined internal representation, it would naturally occur on input and output where you cannot avoid it anyway. This idea has me quite excited, if no-one can poke large unsightly holes in it perhaps we could work on a draft spec for it? (i.e. post it to digitalmars.D and see what everyone thinks) Regan
Nov 20 2005
On Mon, 21 Nov 2005 14:16:22 +1300, Regan Heath wrote: [snip]This idea has me quite excited, if no-one can poke large unsightly holes in it perhaps we could work on a draft spec for it? (i.e. post it to digitalmars.D and see what everyone thinks)Not only do great minds think alike, so you and I! I'm starting to thinking that you (and your minion helpers) have hit upon a 'Great Idea(tm)' -- Derek Parnell Melbourne, Australia 21/11/2005 12:57:40 PM
Nov 20 2005
"Regan Heath" <regan netwin.co.nz> wrote...On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek psych.ward> wrote:[snip]Aye!void foo(wchar[] s) {} void foo(char[] s) {} foo("test"); //should call ??? I'm now thinking that it should call the char[] signature without error.That's what they've been suggesting. I have started to agree (because it's no more risky than current integer literal behaviourAye!But in this case ... void foo(wchar[] s) {} void foo(dchar[] s) {} foo("test"); //should call an error.Agreed, just like the integer literal example above.
Nov 20 2005
Derek Parnell wrote:On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:Derek Parnell wrote:Would you be surprised: Foo[10] foo = new Foo; for(ubyte i=0; i<10; i++) // Not short, int, or long, "save space" { foo[i] = whatever; // Gee, compiler silently casts to int! } He might either be stupid, uneducated, or then not coded since 1985. And it happens.We have problems with inout and out parameters. foo(inout wchar x) {} dchar[] y = "abc"; foo(y); In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ... auto wchar[] temp; temp = toUTF16(y); foo(temp); y = toUTF32(temp);However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).If the overloaded functions purport to take UTF (of any width at all), then it is assumed that they do _semantically_ the same thing. Thus, one has the right to sleep at night. The programmer shall not see any difference whichever is chosen: - if there's only one type, then there's no choice anyway. - if there's one that matches, then pick that, (not that it would be obligatory, but it's polite.) - if there are the two non-matching, then pick the one preferred by the compiler writer, or the OS vendor. If not, then just pick either one. - if there are no UTF versions, then it'd be okay to complain, at compile time.Yes, at that's what happens now.dchar y; foo("Some Test Data"); // Which one now?Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one.If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.I want everyone to understand that it makes just as little difference as when the compiler optimizer chooses a datatype for variable i in this: for(ubyte i=0; i<256; i++) { // do stuff } Can you honestly say that it makes a difference which type i is? (Except signed byte, of course. And we're not talking about performance.) I wouldn't be surprised if DMD (haven't checked!) would sneak i to int instead of the explicitly asked-for ubyte, already in the default compile mode. And -release, and at -O probably should. (Again, haven't checked, and even if it does not do it, the issue is a matter of principle: would making it int make a difference in this example?)Suppose you're in a huge software project with D, and the customer has ordered it to do all arithmetic in long. After 1500000 lines it goes to the beta testers, and they report wierd behavior. Three weeks of searching, and the boss is raving around with an axe. One night the following code is found: import std.stdio; void main() { long myvar; ... myvar = int.max / 47; ... 300 lines myvar = scale(myvar); ... 500 lines } ... 50000 lines later long scale(int v) { long tmp = 1000 * v; return tmp / 3; } Folks suspect the bug is here, but what is wrong? Does the compiler complain? Should it?(Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called.Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.If the coder had written foo("Some Test Data"w); then its pretty clear which function was intended.Except that my example above is dangerous, while with UTF it can't get dangerous. Hey, what sould the compiler complain if I write: char[] a = "\U00000041"c; (Do you think it currently complains? Saying what? Or doesn't it? And what do you say happens if one would get this currently compiled and run?)Would it be correct to say that the undecorated string literal can't possibly be done anything with so that the type of the receiver is not known? Apart from passing to overloaded functions (each of which does know "what it wants"), is there any situation where UTF is accepted, but the receiver does not itself know which it "wants", or even "prefers"? Should there be such cases? Could there?But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.
Nov 20 2005
On Sun, 20 Nov 2005 17:02:04 +0200, Georg Wrede wrote:Derek Parnell wrote:Surprised about the two conversions? No, I just said that's what it would have to do, so no I wouldn't be surprised. I just said it would be a problem. In so far as the compiler would (currently) no warn coders about the performance hit until they profiled it, and even then it might not be obvious to some people.On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:Derek Parnell wrote:Would you be surprised:We have problems with inout and out parameters. foo(inout wchar x) {} dchar[] y = "abc"; foo(y); In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ... auto wchar[] temp; temp = toUTF16(y); foo(temp); y = toUTF32(temp);However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?Both. And everywhere else (in case we forgot to name some situation).Foo[10] foo = new Foo; for(ubyte i=0; i<10; i++) // Not short, int, or long, "save space" { foo[i] = whatever; // Gee, compiler silently casts to int! } He might either be stupid, uneducated, or then not coded since 1985. And it happens.What on earth has the above example got to do with double conversions? And converting from ubyte to int is not exactly a performance drain.Assumptions like that have a nasty habit of generating nightmares. It is *only* an assumption and not a decision based on actual knowledge.If the overloaded functions purport to take UTF (of any width at all), then it is assumed that they do _semantically_ the same thing. Thus, one has the right to sleep at night.Yes, at that's what happens now.dchar y; foo("Some Test Data"); // Which one now?Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!I disagree. Without know what the intention of the function is, one has no way of knowing which function to call. Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one.The programmer shall not see any difference whichever is chosen: - if there's only one type, then there's no choice anyway.But is more than one.- if there's one that matches, then pick that, (not that it would be obligatory, but it's polite.)Sorry, no matches.- if there are the two non-matching, then pick the one preferred by the compiler writer, or the OS vendor. If not, then just pick either one.BANG! This is where we part company. My believe is that to assume that functions with the same name are going to do the same thing is a dangerous one and can lead to mistakes. Whereas you seem to be saying that this is a safe assumption to make.- if there are no UTF versions, then it'd be okay to complain, at compile time.No, but what's this got to do with the argument?If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.I want everyone to understand that it makes just as little difference as when the compiler optimizer chooses a datatype for variable i in this: for(ubyte i=0; i<256; i++) { // do stuff } Can you honestly say that it makes a difference which type i is? (Except signed byte, of course. And we're not talking about performance.)I wouldn't be surprised if DMD (haven't checked!) would sneak i to int instead of the explicitly asked-for ubyte, already in the default compile mode. And -release, and at -O probably should. (Again, haven't checked, and even if it does not do it, the issue is a matter of principle: would making it int make a difference in this example?)Red Herring Alert!No it doesn't and yes it should.Suppose you're in a huge software project with D, and the customer has ordered it to do all arithmetic in long. After 1500000 lines it goes to the beta testers, and they report wierd behavior. Three weeks of searching, and the boss is raving around with an axe. One night the following code is found: import std.stdio; void main() { long myvar; ... myvar = int.max / 47; ... 300 lines myvar = scale(myvar); ... 500 lines } ... 50000 lines later long scale(int v) { long tmp = 1000 * v; return tmp / 3; } Folks suspect the bug is here, but what is wrong? Does the compiler complain? Should it?(Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called.Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?That is (I'm sorry, no offense), based on a misconception. Please see my other posts today, where I try to clear (among other things) this very issue.Assumptions can hurt too.If the coder had written foo("Some Test Data"w); then its pretty clear which function was intended.Except that my example above is dangerous, while with UTF it can't get dangerous.Hey, what sould the compiler complain if I write: char[] a = "\U00000041"c; (Do you think it currently complains? Saying what? Or doesn't it? And what do you say happens if one would get this currently compiled and run?)Of course not. Both 'a' and the literal are of the same data type.Again, I fail to see what this has to do with the issue. Let's call a halt to this discussion. I suspect that you and I will not agree about this function signature matching issue anytime soon. -- Derek Parnell Melbourne, Australia 21/11/2005 6:55:57 AMWould it be correct to say that the undecorated string literal can't possibly be done anything with so that the type of the receiver is not known? Apart from passing to overloaded functions (each of which does know "what it wants"), is there any situation where UTF is accepted, but the receiver does not itself know which it "wants", or even "prefers"? Should there be such cases? Could there?But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.
Nov 20 2005
Derek Parnell wrote:Let's call a halt to this discussion. I suspect that you and I will not agree about this function signature matching issue anytime soon.Whew, I was just starting to wonder what to do. :-) Maybe we'll save the others some headaches too. Besides, at this point, I guess nobody else reads this thread anyway. :-) But it was nice to learn that with some folks you really can disagree long and good, and still not start fighting. georg
Nov 20 2005
On Mon, 21 Nov 2005 00:23:39 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Derek Parnell wrote:I'm interested in both your opinions on: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5612Let's call a halt to this discussion. I suspect that you and I will not agree about this function signature matching issue anytime soon.Whew, I was just starting to wonder what to do. :-)Maybe we'll save the others some headaches too. Besides, at this point, I guess nobody else reads this thread anyway. :-)Or they prefer to lurk. Or we scared them away.But it was nice to learn that with some folks you really can disagree long and good, and still not start fighting.It's how it's supposed to work :) The key, I believe, is to realise that it's not personal, it's an discussion/argument of opinion. Disagreeing with an opinion is not the same as disliking the person who holds that opinion. Of course this is only true when the participants do not make comments which can be taken as being directed at the person, as opposed to the points of the argument itself. This is harder than it sounds because the written word often does not convey your meaning as well as your face and voice could do in a face to face conversation. My 2c. Regan
Nov 20 2005
Georg Wrede wrote:If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.It should? Why?, what is the problem of using the toUTFxx functions? -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 20 2005
Bruno Medeiros wrote:Georg Wrede wrote:Nothing wrong. But cast should not do the union thing. Of course, we could have the toUTFxxx and no cast at all for UTF strings, no problem. But definitely _not_ have the cast do the "union thing".If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.It should? Why?, what is the problem of using the toUTFxx functions?
Nov 20 2005
On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:Georg Wrede wrote:Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), ... ? -- Derek Parnell Melbourne, Australia 21/11/2005 6:48:48 AMIf somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.It should? Why?, what is the problem of using the toUTFxx functions?
Nov 20 2005
Derek Parnell wrote:On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."Georg Wrede wrote:Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.It should? Why?, what is the problem of using the toUTFxx functions?
Nov 21 2005
On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:Derek Parnell wrote:Why? If documented, the user can be prepared. And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide? Is conversion from byte to real done in-line or via sub-routine call? I don't actually know, just asking. -- Derek Parnell Melbourne, Australia 21/11/2005 10:47:34 PMOn Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.Georg Wrede wrote:Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.It should? Why?, what is the problem of using the toUTFxx functions?
Nov 21 2005
Derek Parnell wrote:On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:I would think so. I'd define trivial as: "the assembly code doesn't have any loops".Derek Parnell wrote:Why? If documented, the user can be prepared. And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide?On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.Georg Wrede wrote:Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".) A cast should do precisely what our toUTFxxx functions currently do.It should? Why?, what is the problem of using the toUTFxx functions?Is conversion from byte to real done in-line or via sub-routine call? I don't actually know, just asking.On x86, int -> real can be done with the FILD instruction. Or can be done without FPU, in a couple of instructions. short -> int is done with MOVSX ushort -> uint is done with MOVZX. HOWEVER -- I don't think this is really relevant. The real issue is about literals, which as Georg rightly said, could be stored in ANY format. Conversions from a literal to any type has ZERO runtime cost. I think that in a few respects, the existing situation for strings is BETTER than the situation for integers. I personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real. One intriguing possibility would be to have literals having NO type (or more accurately, an unassigned type). The type only being assigned when it is used. eg "abc" is of type: const __unassignedchar []. There are implicit conversions from __unassignedchar [] to char[], wchar[], and dchar[]. But there are none from char[] to wchar[]. Adding a suffix changes the type from __unassignedchar to char[], wchar[], or dchar[], preventing any implicit conversions. (__unassignedchar could also be called __stringliteral -- it's inaccessable, anyway). Similarly, an integral constant could be of type __integerliteral UNTIL it is assigned to something. At this point, a check is performed to see if the value can actually fit in the type. If not, (eg when an extended UTF char is assigned to a char), it's an error. Admittedly, it's more difficult to deal with when you have integers, and especially with reals, where no lossless conversion exists (because 1.0/3.0f + 1.0/5.0f is not the same as cast(float)(1.0/3.0L + 1.0/5.0L) -- the roundoff errors are different). There are some vaguaries -- what rounding mode is used when performing calculations on reals? This is implementation-defined in C and C++, would be nice if it were specified in D. UTF strings are not the only worm in this can of worms :-)
Nov 21 2005
In article <dlsj73$2fod$1 digitaldaemon.com>, Don Clugston says...I think that in a few respects, the existing situation for strings is BETTER than the situation for integers. I personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real.I agree with you.One intriguing possibility would be to have literals having NO type (or more accurately, an unassigned type). The type only being assigned when it is used. eg "abc" is of type: const __unassignedchar []. There are implicit conversions from __unassignedchar [] to char[], wchar[], and dchar[]. But there are none from char[] to wchar[].String literals already work like this. :) String literals without suffix are char[], but not "committed". String literals with a suffix are "committed" to their type. Check the frontend sources. StringExp::implicitConvTo(Type *t) allows conversion of non-committed string literals to {,w,d}char arrays and pointers. This is what makes this an error: Regards, /Oskar
Nov 21 2005
Don Clugston wrote:I personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real.Really? I tested this a few days ago and it seemed like literals larger than int.max were treated as longs. I'll mock up another test on my way to work. Sean
Nov 21 2005
In article <dlt2b9$6df$3 digitaldaemon.com>, Sean Kelly says...Don Clugston wrote:You are right, large integers are automatically treated as longs, but too large floating point literals are not automatically treated as real. Prints: int long double real /OskarI personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real.Really? I tested this a few days ago and it seemed like literals larger than int.max were treated as longs. I'll mock up another test on my way to work.
Nov 21 2005
Oskar Linde wrote:You are right, large integers are automatically treated as longs, but too large floating point literals are not automatically treated as real.This seems reasonable though, since it's really a matter of precision with floating-point numbers moreso than representability. Sean
Nov 21 2005
Derek Parnell wrote:On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:A good question indeed. I was thinking something equivalent to what Don Clugston said: if the code run time depends on the object size, that is, is not constant bounded, then it's beyond the acceptable point. Another disqualifier is allocating memory on the heap. A string enconding conversion does both things.Derek Parnell wrote:Why? If documented, the user can be prepared. And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide?Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(), .... ?No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.Is conversion from byte to real done in-line or via sub-routine call? I don't actually know, just asking.I didn't know for sure the answer before Don replied, but I already suspected that it was merely an Assembly one-liner (i.e., one instruction only). Note: I think the most complex cast we have right now is a class object downcast, which, altough not universally constant bounded, it's still compile-time constant bounded. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 23 2005