digitalmars.D - toString vs. toUtf8
- Sean Kelly (57/57) Nov 19 2007 I was looking at converting Tango's use of toUtf8 to toString today and
- Steven Schveighoffer (6/16) Nov 19 2007 Can you give an example file for this problem? It would be easier to
- Sean Kelly (3/19) Nov 19 2007 tango.text.convert.Layout
- Steven Schveighoffer (12/30) Nov 19 2007 I can't say I see a problem.
- Sean Kelly (8/19) Nov 19 2007 There's no conflict, it's just more difficult to understand. Also, in
- Bill Baxter (24/43) Nov 19 2007 I you are right that the meanings of toString and toUtf8 are subtly
- Daniel Keep (16/16) Nov 19 2007 That's roughly what I've suggested before, except that I also suggested
- Kris (2/18) Nov 19 2007
- BCS (8/13) Nov 19 2007 why? What is to be gained by having tango us toString rather than toUTF*...
- Lars Ivar Igesund (8/23) Nov 19 2007 Ooh, that would be nice. Apparently toString is the only thing not up fo...
- Walter Bright (10/10) Nov 19 2007 Phobos (and D) has undergone some evolution in the thinking about
- Lars Ivar Igesund (11/23) Nov 19 2007 We certainly don't include all of the D community (of which I would say
- Gregor Richards (11/23) Nov 19 2007 I believe that this naming convention would be best in Tango (toString,
- Robert Fraser (5/30) Nov 19 2007 Agreed. It's also worth noting that toString as the name of a
- Sean Kelly (6/18) Nov 19 2007 This seems fair. It would reinforce the idea that strings really do use...
- Gregor Richards (6/27) Nov 19 2007 Worse looking than toUtf16? Would you prefer if int => int32, long =>
- Walter Bright (3/7) Nov 19 2007 Those get requested now and then, but I agree they are awful. They're a
- Jason House (13/21) Nov 19 2007 The first bullet on http://www.digitalmars.com/d/portability.html implie...
- Robert DaSilva (2/27) Nov 19 2007 Even on 64-bit systems int is 32-bit.
- Jason House (4/33) Nov 19 2007 Are you talking about what D does or what is most efficient on a 64 bit
- Sean Kelly (3/14) Nov 19 2007 You could always use tango.stdc.stdint.int_fast32_t ;-)
- Robert DaSilva (4/37) Nov 19 2007 C don't specify the sizes, but it does specify the sizes relative to
- renoX (6/31) Nov 19 2007 No!
- Jason House (5/18) Nov 22 2007 What would the portability issue be? If they use int and don't care abo...
- renoX (12/34) Nov 25 2007 Sure, it's the same as C, except that if you look to the real world,
- Sean Kelly (14/39) Nov 19 2007 Yes. I find the 'W' or 'D' in the middle of the name difficult to read....
- Kris (5/43) Nov 19 2007 Yes, it looks more akin to GoBbleDeGOOk that other options. I find such
- Regan Heath (13/46) Nov 20 2007 I agree, I think I'd prefer:
- Roberto Mariottini (4/5) Nov 20 2007 Are you saying that the toMBSz() function should return ubyte* not char*...
- Walter Bright (2/7) Nov 20 2007 Probably.
- Matti Niemenmaa (38/45) Nov 20 2007 At last! This is the way I've been thinking it should be for a long time...
- Regan Heath (12/37) Nov 20 2007 I think we should be encouraging people to convert this data to UTF-8
- Matti Niemenmaa (13/20) Nov 20 2007 This is an impossible task. Given a plaintext file, you cannot know what
- Regan Heath (25/45) Nov 20 2007 Yep, but the same thing may occur calling a D string function as it
- Matti Niemenmaa (25/50) Nov 20 2007 Which is why I think that unless you know it's UTF-8, you should use uby...
- Regan Heath (3/5) Nov 21 2007 To my mind byte = "signed interpretation of 8 bits".
- =?UTF-8?B?SnVsaW8gQ8Opc2FyIENhcnJhc2NhbCBVcnF1aWpv?= (31/62) Nov 20 2007 You can't assume that a function designed to work on an UTF-8 strings
- Regan Heath (13/15) Nov 21 2007 FYI: You probably already know this but I wanted to be sure, plus
- Matti Niemenmaa (45/89) Nov 21 2007 I am well aware of this. I chose strip as an example because it does wor...
- Regan Heath (24/42) Nov 21 2007 But, this behvaiour isn't guaranteed. In fact I would expect that in
- Matti Niemenmaa (26/48) Nov 21 2007 std.string.LS and std.string.PS are two examples of Unicode whitespace
- Regan Heath (12/26) Nov 21 2007 Agreed. I would tend to leave the std.string functions taking char[] so...
- Robert DaSilva (3/55) Nov 20 2007 Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop
- Matti Niemenmaa (5/7) Nov 21 2007 There would still need to be special handling for char and string litera...
- Jarrett Billingsley (9/12) Nov 19 2007 Do you want to know my single overriding reason for wanting toString ins...
- Gregor Richards (2/4) Nov 19 2007 Hear hear.
- Kris (6/10) Nov 19 2007 With respect to all, you're perhaps not addressing Sean's deeper questio...
- Gregor Richards (4/19) Nov 19 2007 Confusing it's, post top don't.
- =?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= (14/23) Nov 20 2007 Actually, Jarrett's other arguments seemed more compelling to me.
- BCS (5/22) Nov 19 2007 shouldn't that be?
- Christopher Wright (8/10) Nov 19 2007 toUtf8 is ugly.
- Sean Kelly (21/35) Nov 19 2007 I tend to place a tremendous amount of value on consistency, because the...
- Bill Baxter (8/46) Nov 19 2007 My just formed opinion :-) is that any sort of toWstring/toDstring
- David B. Held (16/21) Nov 20 2007 I certainly don't qualify as someone who does a "lot" of i18n
- Oskar Linde (32/46) Nov 20 2007 IMHO, the consistent alternative is pretty clear:
- Sean Kelly (15/71) Nov 20 2007 It depends :-) I prefer the suggested toStringW and toStringD
- James Dennett (28/32) Dec 16 2007 A D-wide (at least optionally *enforced*) specification that
- Bill Baxter (21/83) Nov 19 2007 Does that even work? I would think there are some valid MBSz's that are...
- Sean Kelly (8/64) Nov 19 2007 It works because D performs no run-time verification that what's in a
- Kris (6/11) Nov 19 2007 Bill: actually, toString, toStringW and toStringD are more consistent wi...
- Bill Baxter (9/22) Nov 19 2007 How so?
- Lars Ivar Igesund (8/31) Nov 20 2007 Only if you have recognized wstring and dstring as good names for those
- Walter Bright (2/4) Nov 20 2007 They'd be consistent with wchar and dchar.
- Lars Ivar Igesund (7/12) Nov 20 2007 Right ... now I don't like those either ;)
- Walter Bright (2/10) Nov 20 2007 What can I say? !!
- Kris (15/25) Nov 20 2007 hehe
- Christopher Wright (8/36) Nov 20 2007 class String {
- Sean Kelly (6/16) Nov 20 2007 Tango already has a String class with toUtf8, toUtf16, and toUtf32
- Lars Ivar Igesund (7/26) Nov 20 2007 It is already renamed to Text.
- Sean Kelly (2/20) Nov 20 2007 Oops!
- Lars Ivar Igesund (7/21) Nov 20 2007 FWIW, this would be preferable to me too.
- Regan Heath (2/18) Nov 20 2007 +votes
- Jarrett Billingsley (8/20) Nov 20 2007 Now that I've seen toWString and toStringW, I'll have to say I do like t...
- Chad J (23/38) Nov 20 2007 This conversation caught my eye and I cringed at toWString and
- Bill Baxter (15/18) Nov 20 2007 1) On the question of toWString vs toWstring and consistency:
- Sean Kelly (7/28) Nov 20 2007 Good question. Probably toUInt, though I don't like it much :-) For
- Kris (5/21) Nov 20 2007 It was resolved by having a Float module and an Integer module, containi...
- Christopher Wright (5/16) Nov 20 2007 If it had a to uint function and a to int function and a to 'sint'
- Bill Baxter (5/22) Nov 20 2007 What does length have to do with whether or not the naming scheme is
- Christopher Wright (9/33) Nov 20 2007 Sorry, mistyped. If it were 'to uint' and 'to int', that would be rather...
I was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry. Currently, Tango's use of toUtf8 as the member function for returning char strings is consitent with all use of string operations in Tango. Routines that return wchar strings are named toUtf16 whether they are members of the String class or whether they are intended to perform UTF conversions, and so on. Thus, the convention is consitent and pervasive. What I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear. Given this, I decided to look to Phobos for model to follow. What I found in Phobos was that it suffers from the same situation as I found Tango in during my test conversion. Routines that convert any type by a string to a char[] are named toString, while the string equivalent is named toUTF8. Given this, I surmised that the naming convention in D is that all strings are assumed to be Unicode, except when they're not. String literals are required to be Unicode, foreach assumes strings to be UTF encoded when performing its automatic conversions, and all of the toString functions in std.string assume UTF-8 as the output format. So who bother with the name toUTF8 in std.utf? As near as I can tell, the reason for text conversion routines to be named differently is to simplify the use of routines which covert to another format. std.windows.charset, for example, has a routine called toMBSz, to distinguish from the toUTF8 routine. What I find significant about this is that it suggests that while the transport mechanism for strings is the same in each case (both routines return a char[], ie. a string), the underlying encoding is different. Thus there seems a clear disconnect between the name of the transport mechanism (string), and routines that generate them. With this in mind, I begin to question the point of having toString as the common name for routines that generate char strings. The encoding clearly matters in some instances and cannot be ignored, so ignoring it in others just seems to confuse things. With this in mind, I will admit that I am questioning the merit of changing Tango's toUtf8 routines to be named toString. Doing so seems to sacrifice both operational consistency and clarity in an attempt to maintain consistency with the name of the transport mechanism: string. And as I have said above, while strings in D are generally expected to be Unicode, they are clearly not always Unicode, as the existence of std.windows.charset can attest. So I am left wondering whether someone can explain why toString is the preferred name for string-producing routines in D? I feel it is very important to establish a consistent naming mechanism for D, and as Phobos seems to be the model in this case I may well have no choice in the matter of toUtf8 vs. toString. But I would feel much better about the change if someone could provide a sound reason for doing so, since my first attempt at a conversion has left me somewhat worried about its long-term effect on code clarity. As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D. Sean
Nov 19 2007
"Sean Kelly" wroteWhat I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear.Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d") -Steve
Nov 19 2007
Steven Schveighoffer wrote:"Sean Kelly" wrotetango.text.convert.Layout SeanWhat I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear.Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d")
Nov 19 2007
"Sean Kelly" wroteSteven Schveighoffer wrote:I can't say I see a problem. I'd say use toUtf8 when doing a conversion from one type of encoded string to another (i.e. utf-16 to utf-8), and use toString when overriding Object's toString, OR when converting a native type (i.e. int, float, etc). For example tango.text.convert.Integer.toUtf8 should be toString. In the case of tango.text.convert.Layout, I don't see any overriding of Object.toUtf8? The Unicode.toUtf8 should be left alone since it is a conversion between utf encodings. In any case, Unicode.toUtf8 is a global function, and is not overriding Object.toUtf8, so there is no conflict there. -Steve"Sean Kelly" wrotetango.text.convert.LayoutWhat I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear.Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d")
Nov 19 2007
Steven Schveighoffer wrote:I'd say use toUtf8 when doing a conversion from one type of encoded string to another (i.e. utf-16 to utf-8), and use toString when overriding Object's toString, OR when converting a native type (i.e. int, float, etc). For example tango.text.convert.Integer.toUtf8 should be toString. In the case of tango.text.convert.Layout, I don't see any overriding of Object.toUtf8? The Unicode.toUtf8 should be left alone since it is a conversion between utf encodings. In any case, Unicode.toUtf8 is a global function, and is not overriding Object.toUtf8, so there is no conflict there.There's no conflict, it's just more difficult to understand. Also, in template code, having a consistent rule for overloaded functions can be a valuable asset. I found myself wanting to simply change everything to toString, toWString, and toDString rather than change only the Object member function as originally planned. And the conflict with other encodings worried me so I posted here. Sean
Nov 19 2007
Sean Kelly wrote:Steven Schveighoffer wrote:I you are right that the meanings of toString and toUtf8 are subtly different. My take is that toString promises to produce some textual form of the input (and it happens to use the utf8 encoding). This transformation might be wildly lossy and non-reversible as is the case with the default implementation of toString for classes, which just prints the class name. toUtf8 on the other hand, promises to do a conversion. It's probably lossless, or nearly so, and since the encoding is mentioned specifically probably that's specifically a conversion between different string encodings. The thing is some times A is B. The best textual representation of a Utf32 string as Utf8 is going to be the Utf8 converted version of it. So in that case toString and toUtf8 happen to do the same thing. So to me, the logical thing to do is to "alias toUtf8 toString;" in the cases where there's a converter that also suffices as a textual representation generator. That way everything that can be represented as text has a toString method, and things that deal with encoding conversions have toUtf blah methods. So in that case I don't see any reason for toWString, toDString. toString generates your canonical "textual representation" for whatever it is. If you need that in a different encoding for whatever reason then you need to run an encoding converter on it. --bb"Sean Kelly" wrotetango.text.convert.LayoutWhat I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear.Can you give an example file for this problem? It would be easier to understand your problem if I knew exactly what you were talking about. An actual example is fine, it doesn't need to be minimized (i.e. "take a look at tango/io/X.d")
Nov 19 2007
That's roughly what I've suggested before, except that I also suggested the following interface: interface UtfConversion { char[] toUtf8(); wchar[] toUtf16(); dchar[] toUtf32(); } This would allow all objects to have a distinct set of methods for lossless conversion to different encodings, whilst still preserving the "just give me something to throw at the user" toString method. Incidentally, given that to(T)'s entire purpose is to do generalised *value-preserving* conversions, is this really a problem? Using a formatter will always give you something, whilst to!(charT[])(v) will always preserve the value of the conversion. -- Daniel
Nov 19 2007
"Daniel Keep" <daniel.keep.lists gmail.com> wrote in >That's roughly what I've suggested before, except that I also suggested the following interface: interface UtfConversion { char[] toUtf8(); wchar[] toUtf16(); dchar[] toUtf32(); } This would allow all objects to have a distinct set of methods for lossless conversion to different encodings, whilst still preserving the "just give me something to throw at the user" toString method. Incidentally, given that to(T)'s entire purpose is to do generalised *value-preserving* conversions, is this really a problem? Using a formatter will always give you something, whilst to!(charT[])(v) will always preserve the value of the conversion.Tango already has this ....-- Daniel
Nov 19 2007
Reply to Sean,I was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry.[...]Seanwhy? What is to be gained by having tango us toString rather than toUTF*? IIRC there was a big thing, back when tango started using toUTF8 in place of toString, about how using toUTF8 would solve a number of these issues. If it is as part of the Phobos/Tango collaboration project, I think that is going the other way would be better (add toUTF8 to Phoboe's Object and an alias to make old things compile.
Nov 19 2007
BCS wrote:Reply to Sean,Ooh, that would be nice. Apparently toString is the only thing not up for discussion at all on Walter's end. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the TangoI was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry.[...]Seanwhy? What is to be gained by having tango us toString rather than toUTF*? IIRC there was a big thing, back when tango started using toUTF8 in place of toString, about how using toUTF8 would solve a number of these issues. If it is as part of the Phobos/Tango collaboration project, I think that is going the other way would be better (add toUTF8 to Phoboe's Object and an alias to make old things compile.
Nov 19 2007
Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.
Nov 19 2007
Walter Bright wrote:Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do:We certainly don't include all of the D community (of which I would say Tango is a large part, although some of Tango's users probably like your suggestions). Was these names ever really up for discussion?char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.This don't at all address the points Sean pulled forth on the naming of the functions returning various encodings, regardless of the types returned. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Nov 19 2007
Walter Bright wrote:Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.I believe that this naming convention would be best in Tango (toString, toWString, toDString). Naming them toUtf8, toUtf16, toUtf32 not only means that the coder has to understand what character encodings are (which would be nice but shouldn't be necessary), but that the familiar terminology "string" we take from literally every other language is lost. If we have to define "strings" as being a bit more confined than "arrays of bytes which presumably have some sort of form", even better. Bytes encoding random-arsed character sets in to WTF-17 don't need to be called "strings", they can be called WTF-17 arrays. - Gregor Richards
Nov 19 2007
Gregor Richards wrote:Walter Bright wrote:Agreed. It's also worth noting that toString as the name of a method/function has some precedence, so people familiar with Java, etc. will be able to get used to it right away, possibly without ever looking it up.Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.I believe that this naming convention would be best in Tango (toString, toWString, toDString). Naming them toUtf8, toUtf16, toUtf32 not only means that the coder has to understand what character encodings are (which would be nice but shouldn't be necessary), but that the familiar terminology "string" we take from literally every other language is lost. If we have to define "strings" as being a bit more confined than "arrays of bytes which presumably have some sort of form", even better. Bytes encoding random-arsed character sets in to WTF-17 don't need to be called "strings", they can be called WTF-17 arrays. - Gregor Richards
Nov 19 2007
Walter Bright wrote:Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-) Sean
Nov 19 2007
Sean Kelly wrote:Walter Bright wrote:Worse looking than toUtf16? Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy. - Gregor RichardsPhobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-) Sean
Nov 19 2007
Gregor Richards wrote:Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
Walter Bright wrote:Gregor Richards wrote:The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
Jason House wrote:Walter Bright wrote:Even on 64-bit systems int is 32-bit.Gregor Richards wrote:The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
Robert DaSilva wrote:Jason House wrote:Are you talking about what D does or what is most efficient on a 64 bit system? If 32-bit integers are less efficient, then it's a crime to make size-tolerant code use an inefficient size.Walter Bright wrote:Even on 64-bit systems int is 32-bit.Gregor Richards wrote:The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
Jason House wrote:Robert DaSilva wrote:You could always use tango.stdc.stdint.int_fast32_t ;-) SeanJason House wrote:Are you talking about what D does or what is most efficient on a 64 bit system? If 32-bit integers are less efficient, then it's a crime to make size-tolerant code use an inefficient size.If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.Even on 64-bit systems int is 32-bit.
Nov 19 2007
Jason House wrote:Robert DaSilva wrote:C don't specify the sizes, but it does specify the sizes relative to each other. sizeof(short) <= sizeof(int) && sizeof(int) <= sizeof(long)Jason House wrote:Are you talking about what D does or what is most efficient on a 64 bit system? If 32-bit integers are less efficient, then it's a crime to make size-tolerant code use an inefficient size.Walter Bright wrote:Even on 64-bit systems int is 32-bit.Gregor Richards wrote:The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
Jason House a écrit :Walter Bright wrote:No! Naturally programmers would use 'int' everywhere and this would create again portability issue. var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though. renoXGregor Richards wrote:The first bullet on http://www.digitalmars.com/d/portability.html implies some wiggle room on this issue. I really liked how D got rid of size ambiguity at first... all the way until I started developing on machines that were not 32 bit. When I don't care about the true size, I feel guilty using "int" all over the place because it is a fixed size. I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.Those get requested now and then, but I agree they are awful. They're a legacy from the C world where the sizes of basic types are unknown.
Nov 19 2007
renoX wrote:Jason House a écrit :What would the portability issue be? If they use int and don't care about the true size, it'll port fine.I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.No! Naturally programmers would use 'int' everywhere and this would create again portability issue.var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.I'm assuming programmers won't use long windes type definitions if they can avoid it.
Nov 22 2007
Jason House a écrit :renoX wrote:Sure, it's the same as C, except that if you look to the real world, you'd see that there are many portability issue in C due to this.. There are many not-very-good|overworked programmers who care only about their current target, so if you have integers with a varying size as a default portability will be poor. IMHO, that's a case of 'premature optimisation', providing machine sized integers for optimisations is nice, but using it as a default sucks, especially since it's not that obvious that they are always faster: 64bit integers on 64bit CPU can be slower than 32bit integers due to the increased memory&cache usage.. renoXJason House a écrit :What would the portability issue be? If they use int and don't care about the true size, it'll port fine.I'd love to see both a fixed and variable size option available. Maybe: int - variable size int32 - fixed size int64 - fixed size If that's done, the size of types become obvious when the programmer cares about them and may make size-sensitive code more obvious.No! Naturally programmers would use 'int' everywhere and this would create again portability issue.var_int, int_be32 (bigger or equal 32 bit), int_word: would be ok though.I'm assuming programmers won't use long windes type definitions if they can avoid it.
Nov 25 2007
Gregor Richards wrote:Sean Kelly wrote:Yes. I find the 'W' or 'D' in the middle of the name difficult to read. It literally hurts my eyes to look at that particular word. Something about the single capital letter in the middle of the word as the distinguishing characteristic, and the fact that the 'W' and 'D' do not correlate to anything meaningful in English. Didn't someone post recently that the mind is trained to recognize words by their first and last letter? I tihnk its smoehtnig lkie taht. With toUtf8, etc, I basically just see the trailing '8' and I know what it is. Trying to pick out a 'W' or 'D' in the middle of a word is much more difficult, particularly since it is next to another capital letter.Walter Bright wrote:Worse looking than toUtf16?Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-)Would you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.No, but I feel that this is an invalid comparison. We are talking about function names concerning type transformations, not type names. Sean
Nov 19 2007
"Sean Kelly" <sean f4.ca> wrote ...Gregor Richards wrote:Hear hear! :oSean Kelly wrote:Yes. I find the 'W' or 'D' in the middle of the name difficult to read. It literally hurts my eyes to look at that particular word.Walter Bright wrote:Worse looking than toUtf16?Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-)Something about the single capital letter in the middle of the word as the distinguishing characteristic, and the fact that the 'W' and 'D' do not correlate to anything meaningful in English. Didn't someone post recently that the mind is trained to recognize words by their first and last letter? I tihnk its smoehtnig lkie taht. With toUtf8, etc, I basically just see the trailing '8' and I know what it is. Trying to pick out a 'W' or 'D' in the middle of a word is much more difficult, particularly since it is next to another capital letter.Yes, it looks more akin to GoBbleDeGOOk that other options. I find such things to be as distasteful as Walter finds toUtf8 <g>Good pointWould you prefer if int => int32, long => int64, short => int16, byte => int8, real => float80 (portability be damned), double => float64, float => float32? They'd certainly be more obvious, but I can tell you I'd go crazy.No, but I feel that this is an invalid comparison. We are talking about function names concerning type transformations, not type names.
Nov 19 2007
Sean Kelly wrote:Gregor Richards wrote:I agree, I think I'd prefer: toString toStringW toStringD or toString toString16 toString32 maybe with an alias for toString to toStringA, and/or toString8. There is some precedent as Unicode versions of windows functions have a trailing W, i.e. CreateFileA, CreateFileW ReganSean Kelly wrote:Yes. I find the 'W' or 'D' in the middle of the name difficult to read. It literally hurts my eyes to look at that particular word. Something about the single capital letter in the middle of the word as the distinguishing characteristic, and the fact that the 'W' and 'D' do not correlate to anything meaningful in English. Didn't someone post recently that the mind is trained to recognize words by their first and last letter? I tihnk its smoehtnig lkie taht. With toUtf8, etc, I basically just see the trailing '8' and I know what it is. Trying to pick out a 'W' or 'D' in the middle of a word is much more difficult, particularly since it is next to another capital letter.Walter Bright wrote:Worse looking than toUtf16?Phobos (and D) has undergone some evolution in the thinking about unicode strings, and it certainly has a few anachronisms in its names. But I think we've evolved to the point where going forward, we know what to do: char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.This seems fair. It would reinforce the idea that strings really do use a common encoding format, and that foreign encodings are relegated to a different form of transport. Now if only toWString didn't look so horrible :-)
Nov 20 2007
Walter Bright wrote: [...]Non-unicode encodings should use ubyte[], ushort[], etc.Are you saying that the toMBSz() function should return ubyte* not char*? Ciao
Nov 20 2007
Roberto Mariottini wrote:Walter Bright wrote: [...]Probably.Non-unicode encodings should use ubyte[], ushort[], etc.Are you saying that the toMBSz() function should return ubyte* not char*?
Nov 20 2007
Walter Bright wrote:char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.At last! This is the way I've been thinking it should be for a long time. However, this requires a change to the language - make char/wchar/dchar types implicitly convertible to ubyte/ushort/uint - and a bunch of library changes - functions that don't require UTF should use ubyte/ushort/uint - in order to be practically usable. Details follow. Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast. The same thing applies the other way, of course - assume the C standard library accepts ubyte* instead of char* for all the C string functions. This is more correct than the current situation, as the C standard library is encoding-independent. Now, if you have a UTF-8 string which you wish to pass to a C string handling function, you need to do, for instance: "printf(cast(ubyte*)utf_8_string.ptr)" - another cast. If encoding-independent functions accept only char, then it's the former case for _every_ call to a string function when you're dealing with non-UTF strings, which quickly becomes onerous. I actually tried this, but the code ended up so unreadable that I was forced to change it back, thus having arbitrarily-encoded bytes stored in char[], just for the convenience of being able to use string functions on them. Here're the details of the solution to this problem that I've thought of: Make char, char*, char[], etc. all implicitly castable to the corresponding ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions which require UTF-x can continue to use [dw]char while functions which work regardless of encoding (most functions in std.string) should use ubyte. This way, the functions transparently work for [dw]string whilst still working for non-UTF. To be precise, in the above, "work regardless of encoding" should be read as "works on more than one encoding": even a simple function like std.string.strip would have to be changed to work on EBCDIC, for instance. I would assume ASCII, especially given that D doesn't target machines older than relatively modern 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or something else" and it's up to the programmer to not call it on functions which require ASCII. I don't think this is a problem. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 20 2007
Matti Niemenmaa wrote:Walter Bright wrote:I think we should be encouraging people to convert this data to UTF-8 before calling any D string handling functions on it (those that accept w/d/char[]). Which implies all D string handling functions should only operate on UTF-8/16/32. If they want to call a C function like those in std.c.<whatever> on it, it should just work as expected. Which implies std.c.<whatever> functions should accept ubyte* or void* or something, not char*char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.At last! This is the way I've been thinking it should be for a long time. However, this requires a change to the language - make char/wchar/dchar types implicitly convertible to ubyte/ushort/uint - and a bunch of library changes - functions that don't require UTF should use ubyte/ushort/uint - in order to be practically usable. Details follow. Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast.The same thing applies the other way, of course - assume the C standard library accepts ubyte* instead of char* for all the C string functions. This is more correct than the current situation, as the C standard library is encoding-independent. Now, if you have a UTF-8 string which you wish to pass to a C string handling function, you need to do, for instance: "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.w/d/char[] arrays are implicitly convertable to void[] (and void*?) so perhaps C functions should accept void* instead? I mean, void* means "pointer to something/anything"... Regan
Nov 20 2007
Regan Heath wrote:I think we should be encouraging people to convert this data to UTF-8 before calling any D string handling functions on it (those that accept w/d/char[]). Which implies all D string handling functions should only operate on UTF-8/16/32.This is an impossible task. Given a plaintext file, you cannot know what encoding it is in. If you assume an encoding and convert it to UTF-8 for internal use and then recode it back to that encoding for output, you may lose information.w/d/char[] arrays are implicitly convertable to void[] (and void*?) so perhaps C functions should accept void* instead? I mean, void* means "pointer to something/anything"...void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned byte(s)", which is a different thing entirely. To me, ubyte[] means either integers in the range 0-255 or "arbitrary data". void[] is more like "arbitrary memory": used for hacking around language restrictions or for extremely low-level stuff such as memory management. Would you consider malloc as returning the same type of data which mbstrlen accepts? -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 20 2007
Matti Niemenmaa wrote:Regan Heath wrote:Yep, but the same thing may occur calling a D string function as it expects UTF-8 and may even convert to dchar[] internally (which would probably throw an invalid UTF exception). Worse, it might work in one version of the library and fail in another due to internal changes of that sort. Meaning, the function cannot guarantee to operate on your 'could be any encoding' data. You'd be better of passing this data to the C function that does what you want. Convert input early and output late I reckon.I think we should be encouraging people to convert this data to UTF-8 before calling any D string handling functions on it (those that accept w/d/char[]). Which implies all D string handling functions should only operate on UTF-8/16/32.This is an impossible task. Given a plaintext file, you cannot know what encoding it is in. If you assume an encoding and convert it to UTF-8 for internal use and then recode it back to that encoding for output, you may lose information.Not the same type of data, but they could give/accept the same pointer. void *p = malloc(100); strcpy((char*)p, "test"); printf("%d", mbstrlen(p)); Memory is memory, the only difference between char* and void* is that char* knows (thinks) it's pointing at a char. What about other text encodings which do not have 8 bit sized 'character' pieces, like UCS-2 (but not because UCS-2 is a subset of UTF-16 and we can handle it as such). I'm not sure any exist, so this point may be invalid, but if one did exist then ubyte[] would not be the correct way to store it, perhaps ushort[] would. Or.. we could use void[]/void* for all types of unknown data and be done with it. Using void* basically says "we don't know the type/format of the data but we assume the function receiving the data does". Reganw/d/char[] arrays are implicitly convertable to void[] (and void*?) so perhaps C functions should accept void* instead? I mean, void* means "pointer to something/anything"...void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned byte(s)", which is a different thing entirely. To me, ubyte[] means either integers in the range 0-255 or "arbitrary data". void[] is more like "arbitrary memory": used for hacking around language restrictions or for extremely low-level stuff such as memory management. Would you consider malloc as returning the same type of data which mbstrlen accepts?
Nov 20 2007
Regan Heath wrote:Matti Niemenmaa wrote:Which is why I think that unless you know it's UTF-8, you should use ubyte[]. Functions which expect UTF-8 would require char[], thus causing a type error.Regan Heath wrote:Yep, but the same thing may occur calling a D string function as it expects UTF-8 and may even convert to dchar[] internally (which would probably throw an invalid UTF exception).I think we should be encouraging people to convert this data to UTF-8 before calling any D string handling functions on it (those that accept w/d/char[]). Which implies all D string handling functions should only operate on UTF-8/16/32.This is an impossible task. Given a plaintext file, you cannot know what encoding it is in. If you assume an encoding and convert it to UTF-8 for internal use and then recode it back to that encoding for output, you may lose information.You'd be better of passing this data to the C function that does what you want.There's not always a C function that does what you want available. Both Phobos's and Tango's string processing capabilities are greater than the C standard library's even for plain ASCII. The point is to make it easy to use non-UTF strings when necessary, without having to resort to huge amounts of casts or writing your own functions with the correct type signatures.What about other text encodings which do not have 8 bit sized 'character' pieces, like UCS-2 (but not because UCS-2 is a subset of UTF-16 and we can handle it as such). I'm not sure any exist, so this point may be invalid, but if one did exist then ubyte[] would not be the correct way to store it, perhaps ushort[] would.Walter mentioned ushort[] in his post, as did I in mine.Or.. we could use void[]/void* for all types of unknown data and be done with it. Using void* basically says "we don't know the type/format of the data but we assume the function receiving the data does".I just think "void" means "typeless" or "I don't know the type". "ubyte" means something like "byte-oriented data" or "I don't care about the type". It all depends on your point of view, but I think it's nice to have a semantic difference between void and ubyte. The meaning of plain byte, on the other hand, eludes me, beyond just "integer from -128 to 127". The problem with using void to store data is also that the garbage collectors assume it may contain pointers, and thus scan it for uncollected memory. It may also be that if they find a valid pointer (small, but nonzero, probability) they do not free memory which should be released, thus retaining it as long as the data lives, which could be as long as the program runs. Hell, we /could/ use void[] to replace char[], byte[], and ubyte[], and why not the rest of the types, too. But this isn't asm. This is D! -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 20 2007
Matti Niemenmaa wrote:The meaning of plain byte, on the other hand, eludes me, beyond just "integer from -128 to 127".To my mind byte = "signed interpretation of 8 bits". Regan
Nov 21 2007
Matti Niemenmaa wrote:Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast.You can't assume that a function designed to work on an UTF-8 strings works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with any other charset.The same thing applies the other way, of course - assume the C standard library accepts ubyte* instead of char* for all the C string functions. This is more correct than the current situation, as the C standard library is encoding-independent. Now, if you have a UTF-8 string which you wish to pass to a C string handling function, you need to do, for instance: "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.This is probably the actual problem: C string functions should accept ubyte* instead of char* because a ubyte doesn't have an implied encoding while char does.If encoding-independent functions accept only char, then it's the former case for _every_ call to a string function when you're dealing with non-UTF strings, which quickly becomes onerous.Unless you are referring to a conversion library like ICU, I don't understand your point on "encoding-independent functions". Phobos' string functions aren't "encoding-independent".I actually tried this, but the code ended up so unreadable that I was forced to change it back, thus having arbitrarily-encoded bytes stored in char[], just for the convenience of being able to use string functions on them.If you've done that I fear you'll see lots of exceptions appearing in your string handling code once you deliver your program to any non-english speaking user.Here're the details of the solution to this problem that I've thought of: Make char, char*, char[], etc. all implicitly castable to the corresponding ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions which require UTF-x can continue to use [dw]char while functions which work regardless of encoding (most functions in std.string) should use ubyte. This way, the functions transparently work for [dw]string whilst still working for non-UTF.Most function in std.string *require* UTF-8 or they'll blow up with a "Error: 4invalid UTF-8 sequence" message. Actually, I think the implicit casting would be useful for string literals: byte[] foo = "Julio César"; // In ISO-8859-1. But then I need some way to tell the compiler that the string is in ISO-8859-1. What I don't see is where does your proposal helps with the example you were giving. For example, if I try to uppercase foo I would get an exception: toupper(foo); // BOOM!To be precise, in the above, "work regardless of encoding" should be read as "works on more than one encoding": even a simple function like std.string.strip would have to be changed to work on EBCDIC, for instance. I would assume ASCII, especially given that D doesn't target machines older than relatively modern 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or something else" and it's up to the programmer to not call it on functions which require ASCII. I don't think this is a problem.I think this is unrealistic unless you want to change std.string to be something more like ICU. There are just too many (popular) encodings and variations in use today... and you'll have to support most of them once you start promising to "works on more than one encoding". Even Unicode has UCS which is the not-quite-UTF encoding used in Windows NT4 (yes, there are still lots of machines using NT4). -- Julio César Carrascal Urquijo http://jcesar.artelogico.com/
Nov 20 2007
Julio César Carrascal Urquijo wrote:Even Unicode has UCS which is the not-quite-UTF encoding used in Windows NT4 (yes, there are still lots of machines using NT4).FYI: You probably already know this but I wanted to be sure, plus others might find it of interest.. http://en.wikipedia.org/wiki/UTF-16 UCS2 is not quite UTF-16, but UCS2 is a subset of UTF-16 ("upwards compatibility from UCS-2 to UTF-16"), it's essentially UTF-16 without the surrogate pairs. So, in D you can generally* say: wchar[] data = cast(wchar[]) std.file.read("filename"); and it should work without throwing any invalid UTF errors. * this may depend on whether it's UCS-2, UCS-2BE, or UCS-2LE. I'm not sure which format D's UTF-16 is in. Regan
Nov 21 2007
Julio César Carrascal Urquijo wrote:Matti Niemenmaa wrote:I am well aware of this. I chose strip as an example because it does work on any encoding: it simply calls std.ctype.isspace on each char.Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast.You can't assume that a function designed to work on an UTF-8 strings works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with any other charset.This is probably the actual problem: C string functions should accept ubyte* instead of char* because a ubyte doesn't have an implied encoding while char does.Yes. But there are also many D string functions which would work on any encoding.Most are, actually, except for the fact that D character constants are always ASCII. Almost all the std.string functions will work for any "extended ASCII" encoding. And that's what I mean. Given that D doesn't target the kind of machines that use EBCDIC, I use "encoding-independent" to mean either "works on any encoding" or "works on any encoding with ASCII as the lower 128 values".If encoding-independent functions accept only char, then it's the former case for _every_ call to a string function when you're dealing with non-UTF strings, which quickly becomes onerous.Unless you are referring to a conversion library like ICU, I don't understand your point on "encoding-independent functions". Phobos' string functions aren't "encoding-independent".Trust me, I know what I'm doing. For instance, the integer conversion functions in std.conv only look for values in the range '0' to '9', ignoring all others. If the encoding has the digits in the same place as ASCII, it will work, regardless of what all the other bytes in the encoding are. If the encoding has the digits in a different place than ASCII, then it won't work, true. But I think you'll find that using EBCDIC or another non-ASCII-based encoding will confuse most of the programs you've got installed on your computer.I actually tried this, but the code ended up so unreadable that I was forced to change it back, thus having arbitrarily-encoded bytes stored in char[], just for the convenience of being able to use string functions on them.If you've done that I fear you'll see lots of exceptions appearing in your string handling code once you deliver your program to any non-english speaking user.Most function in std.string *require* UTF-8 or they'll blow up with a "Error: 4invalid UTF-8 sequence" message.No, they do not. Some do, but not most. Of all the functions that take char[] or char* in std.string: Functions requiring UTF-8: 22 Functions not requiring UTF-8: 35Actually, I think the implicit casting would be useful for string literals: byte[] foo = "Julio César"; // In ISO-8859-1. But then I need some way to tell the compiler that the string is in ISO-8859-1. What I don't see is where does your proposal helps with the example you were giving. For example, if I try to uppercase foo I would get an exception: toupper(foo); // BOOM!True, you would, because std.string.toupper assumes UTF-8. Hence, its type should be string(string), which you couldn't call with byte[], since byte[] doesn't implicitly convert to char[]. But consider what happens now with char[]. The following program compiles, but blows up at runtime: import std.string; void main() { char[] foo = "Julio C\xe9sar"; toupper(foo); } An amendment to my proposal to correct this would be that hex strings, and any string which contains a byte sequence which is not valid UTF, would become ubyte/ushort/uint. Thus the above would fail with a type error because the type of the literal is ubyte[], and it cannot be assigned to a char[]. If the type of foo were ubyte[], calling toupper would fail with a type error. Thereby the only way to get the program above to compile, aside from changing the string literal to UTF-8, would be with a cast, which shows that there's something unsafe going on.I think this is unrealistic unless you want to change std.string to be something more like ICU. There are just too many (popular) encodings and variations in use today... and you'll have to support most of them once you start promising to "works on more than one encoding".By "works on more than one encoding" I meant "works for anything with ASCII as the lower 128 bytes". You'll find that covers the majority of encodings in common use today. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 21 2007
Matti Niemenmaa wrote:Julio César Carrascal Urquijo wrote:But, this behvaiour isn't guaranteed. In fact I would expect that in future a library like iconv will be leveraged to determine if a character 'is a space' and it will assume the input data is UTF-8. So, if your ASCII based encoding has characters outside the ASCII range and they just happen to match a valid 'is a space' character from the UTF-8 set, then .. whoops. Now, I don't have a canonical knowledge of character sets so it may be that there are no space characters outside the ASCII range defined in UTF-8... (perhaps when you include surogate pairs?) or, even if they exist the chance of an ASCII based character set using that value may be pretty small. Who knows, all I'm saying is that if a function says it accepts char[] then it is saying "I accept valid UTF-8" and not "I accept any ASCII based character data" so all bets are off if you pass it anything other than UTF-8.Matti Niemenmaa wrote:I am well aware of this. I chose strip as an example because it does work on any encoding: it simply calls std.ctype.isspace on each char.Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast.You can't assume that a function designed to work on an UTF-8 strings works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with any other charset.At present. But that's not guaranteed and it may change in the future, in fact, I expect it to. As far as I can see the only guaranteed thing is that the C functions will not change and will continue to accept ASCII based character sets without possible future gotchas. So, if you must perform string manipulation on non UTF data then you should either write your own functions, or use the C ones. ReganThis is probably the actual problem: C string functions should accept ubyte* instead of char* because a ubyte doesn't have an implied encoding while char does.Yes. But there are also many D string functions which would work on any encoding.
Nov 21 2007
Regan Heath wrote:But, this behvaiour isn't guaranteed. In fact I would expect that in future a library like iconv will be leveraged to determine if a character 'is a space' and it will assume the input data is UTF-8.You're right. See below.So, if your ASCII based encoding has characters outside the ASCII range and they just happen to match a valid 'is a space' character from the UTF-8 set, then .. whoops. Now, I don't have a canonical knowledge of character sets so it may be that there are no space characters outside the ASCII range defined in UTF-8... (perhaps when you include surogate pairs?) or, even if they exist the chance of an ASCII based character set using that value may be pretty small.std.string.LS and std.string.PS are two examples of Unicode whitespace characters. Strip, for some reason, does not strip them.Who knows, all I'm saying is that if a function says it accepts char[] then it is saying "I accept valid UTF-8" and not "I accept any ASCII based character data" so all bets are off if you pass it anything other than UTF-8.You are correct, which is exactly my point: char[] should mean UTF-8 whereas currently many functions use it to mean "text with single-byte characters". That std.string.strip uses char[] currently says nothing about whether it expects UTF-8 or not. Were the std.c package converted to use ubyte[] everywhere, there would be a clear distinction between UTF-8 and "anything". Then, as you say, one should interpret std.string.* as accepting only UTF-8.As far as I can see the only guaranteed thing is that the C functions will not change and will continue to accept ASCII based character sets without possible future gotchas. So, if you must perform string manipulation on non UTF data then you should either write your own functions, or use the C ones.Correct. The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task because even the C functions take char (or wchar_t, which I think is wchar on Windows and dchar elsewhere) and thus the code quickly becomes castville. cast here, cast there, everywhere a cast cast - and for no good reason. Thus I believe, as per my original proposal, that library functions be converted to use ubyte[] where they are not meant to accept char[]. This may or may not mean changes in std.string - it's up to the Phobos maintainers to make the choice as to whether a function will ever require UTF-8, and whether to type it as taking char[] or ubyte[]. In any case, at least the C functions should take ubyte[]. The implicit casting from char-whatever to ubyte-whatever is useful when you want to call C functions with D strings. Once again the code would rapidly become castville if it would have to be done explicitly. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 21 2007
Matti Niemenmaa wrote:Regan Heath wrote: The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task because even the C functions take char (or wchar_t, which I think is wchar on Windows and dchar elsewhere) and thus the code quickly becomes castville. cast here, cast there, everywhere a cast cast - and for no good reason.Yeah, agreed 100%Thus I believe, as per my original proposal, that library functions be converted to use ubyte[] where they are not meant to accept char[]. This may or may not mean changes in std.string - it's up to the Phobos maintainers to make the choice as to whether a function will ever require UTF-8, and whether to type it as taking char[] or ubyte[]. In any case, at least the C functions should take ubyte[].Agreed. I would tend to leave the std.string functions taking char[] so that when they finally step up and have complete UTF compatibility their signatures do not change. If we need some functions, like strip, as a stop gap for other encodings then I reckon we add them, perhaps to a different module, and we use ubyte* (or whatever) instead of char[] for the input parameter.The implicit casting from char-whatever to ubyte-whatever is useful when you want to call C functions with D strings. Once again the code would rapidly become castville if it would have to be done explicitly.The only problem I have with implicit cast to ubyte-whatever is that I worry it will have an unexpected side effect somewhere... Perhaps I am being alarmist. Regan
Nov 21 2007
Matti Niemenmaa wrote:Walter Bright wrote:Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop as keywords?char[] => string wchar[] => wstring dchar[] => dstring These are all unicode strings. Putting non-unicode encodings in them, even temporarily, should be discouraged. Non-unicode encodings should use ubyte[], ushort[], etc.At last! This is the way I've been thinking it should be for a long time. However, this requires a change to the language - make char/wchar/dchar types implicitly convertible to ubyte/ushort/uint - and a bunch of library changes - functions that don't require UTF should use ubyte/ushort/uint - in order to be practically usable. Details follow. Assume you have an ubyte[] named iso_8859_1_string which contains a string encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying cast. The same thing applies the other way, of course - assume the C standard library accepts ubyte* instead of char* for all the C string functions. This is more correct than the current situation, as the C standard library is encoding-independent. Now, if you have a UTF-8 string which you wish to pass to a C string handling function, you need to do, for instance: "printf(cast(ubyte*)utf_8_string.ptr)" - another cast. If encoding-independent functions accept only char, then it's the former case for _every_ call to a string function when you're dealing with non-UTF strings, which quickly becomes onerous. I actually tried this, but the code ended up so unreadable that I was forced to change it back, thus having arbitrarily-encoded bytes stored in char[], just for the convenience of being able to use string functions on them. Here're the details of the solution to this problem that I've thought of: Make char, char*, char[], etc. all implicitly castable to the corresponding ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions which require UTF-x can continue to use [dw]char while functions which work regardless of encoding (most functions in std.string) should use ubyte. This way, the functions transparently work for [dw]string whilst still working for non-UTF. To be precise, in the above, "work regardless of encoding" should be read as "works on more than one encoding": even a simple function like std.string.strip would have to be changed to work on EBCDIC, for instance. I would assume ASCII, especially given that D doesn't target machines older than relatively modern 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or something else" and it's up to the programmer to not call it on functions which require ASCII. I don't think this is a problem.
Nov 20 2007
Robert DaSilva wrote:Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop as keywords?There would still need to be special handling for char and string literals, at least (since they're defined as UTF), but yes, this is a possibility. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Nov 21 2007
"Sean Kelly" <sean f4.ca> wrote in message news:fhsts6$5nn$1 digitalmars.com...As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type. My other reasons include consistency (Java uses .toString, .Net uses .ToString, phobos uses .toString) and that "toUtf8" screams "I'm a string class and this method converts my encoding!" while "toString" says "convert this object, whatever it is, to some kind of string." votes += 8 for toString, toWString, and toDString.
Nov 19 2007
Jarrett Billingsley wrote:Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type.Hear hear.
Nov 19 2007
With respect to all, you're perhaps not addressing Sean's deeper questions? Instead, this seems like another bunch of "toUtf8! NO! U ... toString() dammit!" Which is kinda superficial at this point? "Gregor Richards" <Richards codu.org> wrote in message news:fht0qs$9rk$1 digitalmars.com...Jarrett Billingsley wrote:Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type.Hear hear.
Nov 19 2007
Kris wrote:With respect to all, you're perhaps not addressing Sean's deeper questions? Instead, this seems like another bunch of "toUtf8! NO! U ... toString() dammit!" Which is kinda superficial at this point? "Gregor Richards" <Richards codu.org> wrote in message news:fht0qs$9rk$1 digitalmars.com...Confusing it's, post top don't. Actually, we're not even addressing Sean's surficial questions. - Gregor RichardsJarrett Billingsley wrote:Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type.Hear hear.
Nov 19 2007
Gregor Richards wrote:Jarrett Billingsley wrote:Actually, Jarrett's other arguments seemed more compelling to me. Jarrett Billingsley wrote:Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type.Hear hear.My other reasons include consistency (Java uses .toString, .Net uses ..ToString, phobos uses .toString) and that "toUtf8" screams "I'm astringclass and this method converts my encoding!" while "toString" says"convertthis object, whatever it is, to some kind of string."Stating intent seem more important to me than the stylistic issues between toString and toUtf8. I'm all for the toString / toWString / toDString for a readable representation of a class and toUtf8 / 16 / 32 for converting encodings. Also, toStringW seems more readable than toWString but for me its not big of a deal which one Tango developers choose. -- Julio César Carrascal Urquijo http://jcesar.artelogico.com/
Nov 20 2007
Reply to Jarrett,"Sean Kelly" <sean f4.ca> wrote in message news:fhsts6$5nn$1 digitalmars.com...shouldn't that be? votes += 8 for toString votes += 16 for toWString votes += 32 for toDString.As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.Do you want to know my single overriding reason for wanting toString instead of toUtf8? Because it's nicer-looking and easier to type. My other reasons include consistency (Java uses .toString, .Net uses .ToString, phobos uses .toString) and that "toUtf8" screams "I'm a string class and this method converts my encoding!" while "toString" says "convert this object, whatever it is, to some kind of string." votes += 8 for toString, toWString, and toDString.
Nov 19 2007
Sean Kelly wrote:I was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry....toUtf8 is ugly. toString/toWString/toDString are opaque and ugly, hard to distinguish from each other. toString, toStringW, toStringD? Still ugly. toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type. toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.
Nov 19 2007
Christopher Wright wrote:Sean Kelly wrote:I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest. In my opinion, Walter's suggestion that alternate encodings not be stored in strings is sufficient reason to not bother with the encoding format in the function name (ie. toUtf8/toUtf16/toUtf32). I might counter that I don't see any reason to lose meaning where it is so easily provided, but on the other hand, I agree that new users are more likely to know what a function named toString does than were it named toUtf8. These two points are a wash in my opinion. The remaining concerns are less substantive. I find toWString and toDString difficult to read, but those feelings hold little more weight than "toUtf8 is ugly." I also feel that the term "string" is largely meaningless in programming. But I certainly couldn't win a debate with either point. I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions. SeanI was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry....toUtf8 is ugly. toString/toWString/toDString are opaque and ugly, hard to distinguish from each other. toString, toStringW, toStringD? Still ugly. toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type. toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.
Nov 19 2007
Sean Kelly wrote:Christopher Wright wrote:My just formed opinion :-) is that any sort of toWstring/toDstring functions should be standalone things that only accept type "string" or "char" as input. Yes there will be some performance penalty in some cases, but I don't think that's significant enough to warrant creating lots of functions that do exactly the same thing, just with different encodings. --bbSean Kelly wrote:I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest. In my opinion, Walter's suggestion that alternate encodings not be stored in strings is sufficient reason to not bother with the encoding format in the function name (ie. toUtf8/toUtf16/toUtf32). I might counter that I don't see any reason to lose meaning where it is so easily provided, but on the other hand, I agree that new users are more likely to know what a function named toString does than were it named toUtf8. These two points are a wash in my opinion. The remaining concerns are less substantive. I find toWString and toDString difficult to read, but those feelings hold little more weight than "toUtf8 is ugly." I also feel that the term "string" is largely meaningless in programming. But I certainly couldn't win a debate with either point. I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions.I was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry....toUtf8 is ugly. toString/toWString/toDString are opaque and ugly, hard to distinguish from each other. toString, toStringW, toStringD? Still ugly. toUtf, toUtf16, toUtf32? Slightly less clear, but easier to type. toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.
Nov 19 2007
Sean Kelly wrote:[...] I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions.I certainly don't qualify as someone who does a "lot" of i18n programming, but I do some. Regardless, I would have to say that when I see a function called toUtfXX(), I think "Oh, that must convert a string from Latin-1 or something", rather than "Oh, that must give me the UTF-XX representation of an object". Perl is a bad example because it didn't get righteous UTF-8 support until 5.8, but whenever you see "utf8" or similar in a Perl program, it almost invariably involves an encoding/decoding operation. Perhaps it is worth noting that whenever you see "UTF-8" in Java, is most likely So it appears that the precedent is that for most other languages, when "UTF-8" is spelled out explicitly, it is usually in a transcoding context. I don't think toWString() is an ideal name, but it seems to have the right connotations to the naive programmer. Dave
Nov 20 2007
Sean Kelly wrote:Christopher Wright wrote:IMHO, the consistent alternative is pretty clear: char -> string -> toString wchar -> wstring -> toWString dchar -> dstring -> toDString The only problem seems to lie in the aesthetics of the camelCase convention, but doesn't consistency trump aesthetics?toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest.In my opinion, Walter's suggestion that alternate encodings not be stored in strings is sufficient reason to not bother with the encoding format in the function name (ie. toUtf8/toUtf16/toUtf32).I agree, but this is hardly a new suggestion. I think it has always been pretty clear that one should never store anything but UTF-encoded data in {,w,d}char[]s. Also, I have always felt Tangos toUtf{8,16,32} are a bit too explicitly named. Almost like using toSingleIEEE754 instead of toFloat.I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions.I have done quite a bit of text processing and handling of different encodings in D and while naming doesn't matter much as long as it is consistent, what I do is: * use {,w,d}char strictly for UTF data (I have sometimes cheated here, mainly to be able to use certain std.string functions, but with a good templated string/array library (such as in Tango), that is not necessary) * use unicode internally as much as possible, transcoding as early and late as possible. * when there is a reason not to use UTF internally, use typedefs like "typedef char lat1", and keep unknown encodings as ubyte[]s. Knowing that {,w,d}chars always contain UTF has never been a problem. Problems arising are instead of mistakingly using char rather than {,u}byte in C APIs and D's horrible behavior of by default crashing instead of recovering from UTF errors. A much better default behavior would be to simply substitute illegal UTF-units with a '?' and keep going. Having to remember to sanitize all untrusted unicode strings is a chore, and forgetting that at any point will lead to crashes in running code at inconvenient situations. -- Oskar
Nov 20 2007
Oskar Linde wrote:Sean Kelly wrote:It depends :-) I prefer the suggested toStringW and toStringD convention. While it doesn't exactly match the returned type name in letter order, the same information is communicated and is done in what I feel is a more readable format. Also, if the words were placed in a larger list and then sorted, they would end up adjacent to one another.Christopher Wright wrote:IMHO, the consistent alternative is pretty clear: char -> string -> toString wchar -> wstring -> toWString dchar -> dstring -> toDString The only problem seems to lie in the aesthetics of the camelCase convention, but doesn't consistency trump aesthetics?toString, toUtf16, toUtf32? Inconsistent, but readable, and it fits well with other conventions.I tend to place a tremendous amount of value on consistency, because the more consistent an API is, the more likely my guesses about it are to be correct. In my opinion, that precludes using the option you suggest.Yup. But to me, this is different from a semi-official declaration to this effect. With the latter, the suggestion is more likely to be enforceable.In my opinion, Walter's suggestion that alternate encodings not be stored in strings is sufficient reason to not bother with the encoding format in the function name (ie. toUtf8/toUtf16/toUtf32).I agree, but this is hardly a new suggestion. I think it has always been pretty clear that one should never store anything but UTF-encoded data in {,w,d}char[]s.Also, I have always felt Tangos toUtf{8,16,32} are a bit too explicitly named. Almost like using toSingleIEEE754 instead of toFloat.Fair enough :-)Darnit, I forgot about the C APIs. I'll have to replace their use of char with char_t or c_char (the latter matches c_long but the former matches wchar_t).I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions.I have done quite a bit of text processing and handling of different encodings in D and while naming doesn't matter much as long as it is consistent, what I do is: * use {,w,d}char strictly for UTF data (I have sometimes cheated here, mainly to be able to use certain std.string functions, but with a good templated string/array library (such as in Tango), that is not necessary) * use unicode internally as much as possible, transcoding as early and late as possible. * when there is a reason not to use UTF internally, use typedefs like "typedef char lat1", and keep unknown encodings as ubyte[]s. Knowing that {,w,d}chars always contain UTF has never been a problem. Problems arising are instead of mistakingly using char rather than {,u}byte in C APIs and D's horrible behavior of by default crashing instead of recovering from UTF errors.A much better default behavior would be to simply substitute illegal UTF-units with a '?' and keep going. Having to remember to sanitize all untrusted unicode strings is a chore, and forgetting that at any point will lead to crashes in running code at inconvenient situations.This is useful information. Thanks. Sean
Nov 20 2007
Sean Kelly wrote:I don't suppose there is anyone who does a lot of internationalization programming who can comment on the utility of one convention vs. the other? I would love to hear some more practical concerns regarding the naming convention for these functions.A D-wide (at least optionally *enforced*) specification that the various types of "character" arrays are really strings, not just arrays of the underlying storage types, would mean that no convention would be needed to convey the meaning, and the simpler name could be used safely (as the type system would imply the encoding). (I can't help but think that this is one more reason why string types should *not* be built-in arrays, even if they are known to the compiler, but I think my chances of persuading Walter that string==array is a mistake are three quarters of ten percent of none at all.) In the absence of a language-enforced/mandated encoding, it's up to the library to force programmers to consider these issues; in that case, names making the encoding clear (even names as ugly as toUtf8) are better than a more readable, more generic but less intention-conveying name like toString. Most of the code I see (in C, C++, Java and more) is far too sloppy about knowing which encoding is used for a given string. Unicode is now mature enough to make some sense for the default in programming languages. Ideally I'd make the encoding something akin to a template parameter, so that the compiler's type-checking could help out -- but I digress into language design (as is inevitable when high level facilities like strings are made part of the language rather than being "just" standard library features). -- James
Dec 16 2007
Sean Kelly wrote:I was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry. Currently, Tango's use of toUtf8 as the member function for returning char strings is consitent with all use of string operations in Tango. Routines that return wchar strings are named toUtf16 whether they are members of the String class or whether they are intended to perform UTF conversions, and so on. Thus, the convention is consitent and pervasive. What I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear. Given this, I decided to look to Phobos for model to follow. What I found in Phobos was that it suffers from the same situation as I found Tango in during my test conversion. Routines that convert any type by a string to a char[] are named toString, while the string equivalent is named toUTF8. Given this, I surmised that the naming convention in D is that all strings are assumed to be Unicode, except when they're not. String literals are required to be Unicode, foreach assumes strings to be UTF encoded when performing its automatic conversions, and all of the toString functions in std.string assume UTF-8 as the output format. So who bother with the name toUTF8 in std.utf? As near as I can tell, the reason for text conversion routines to be named differently is to simplify the use of routines which covert to another format. std.windows.charset, for example, has a routine called toMBSz, to distinguish from the toUTF8 routine. What I find significant about this is that it suggests that while the transport mechanism for strings is the same in each case (both routines return a char[], ie. a string),Does that even work? I would think there are some valid MBSz's that are invalid UTF sequences, and so toMBSz would have to return byte[].the underlying encoding is different. Thus there seems a clear disconnect between the name of the transport mechanism (string), and routines that generate them. With this in mind, I begin to question the point of having toString as the common name for routines that generate char strings. The encoding clearly matters in some instances and cannot be ignored, so ignoring it in others just seems to confuse things.As far as I'm concerned Utf8 is *the* encoding for text in D. Anything else is only for some special purpose like ease of manipulation (dstring for I18N text that needs fast searching / slicing) or interchange with external APIs (utf16 for working with windows).With this in mind, I will admit that I am questioning the merit of changing Tango's toUtf8 routines to be named toString. Doing so seems to sacrifice both operational consistency and clarity in an attempt to maintain consistency with the name of the transport mechanism: string. And as I have said above, while strings in D are generally expected to be Unicode, they are clearly not always Unicode, as the existence of std.windows.charset can attest.I really think toMBSz should be returning byte[] and fromMBSz should be taking a byte*. The doc for types says char is unsigned 8 bit UTF-8. Period. And you get errors from the compiler if you try to initialize a string with something that's not valid UTF-8. So MBSz data has no business parading around dressed up as char[].So I am left wondering whether someone can explain why toString is the preferred name for string-producing routines in D? I feel it is very important to establish a consistent naming mechanism for D, and as Phobos seems to be the model in this case I may well have no choice in the matter of toUtf8 vs. toString. But I would feel much better about the change if someone could provide a sound reason for doing so, since my first attempt at a conversion has left me somewhat worried about its long-term effect on code clarity. As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.Since the tango convention is to treat acronyms as single words, (the actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems there's an argument for treating wstring and dstring as single entities too. So then it would be: toString, toWstring, toDstring Don't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc. --bb
Nov 19 2007
Bill Baxter wrote:Sean Kelly wrote:It works because D performs no run-time verification that what's in a char[] is actually Unicode. You could dump binary data in a string if you really wanted to.I was looking at converting Tango's use of toUtf8 to toString today and ran into a bit of a quandry. Currently, Tango's use of toUtf8 as the member function for returning char strings is consitent with all use of string operations in Tango. Routines that return wchar strings are named toUtf16 whether they are members of the String class or whether they are intended to perform UTF conversions, and so on. Thus, the convention is consitent and pervasive. What I discovered during a test conversion of Tango was that converting all uses of toUtf8 to toString /except/ those intended to perfom UTF conversions reduced code clarity, and left me unsure as to which name I would actually use in a given situation. For example, there is quite a bit of code in the text and io packages which convert an arbitrary type to a char[] for output, etc. So by making this change I was left with some conversions using toString and others using toUtf8, toUtf16, and toUtf32, not to mention the fromXxx versions of these same functions. As this is template code, the choice between toString and toUtf8 in a given situation was unclear. Given this, I decided to look to Phobos for model to follow. What I found in Phobos was that it suffers from the same situation as I found Tango in during my test conversion. Routines that convert any type by a string to a char[] are named toString, while the string equivalent is named toUTF8. Given this, I surmised that the naming convention in D is that all strings are assumed to be Unicode, except when they're not. String literals are required to be Unicode, foreach assumes strings to be UTF encoded when performing its automatic conversions, and all of the toString functions in std.string assume UTF-8 as the output format. So who bother with the name toUTF8 in std.utf? As near as I can tell, the reason for text conversion routines to be named differently is to simplify the use of routines which covert to another format. std.windows.charset, for example, has a routine called toMBSz, to distinguish from the toUTF8 routine. What I find significant about this is that it suggests that while the transport mechanism for strings is the same in each case (both routines return a char[], ie. a string),Does that even work? I would think there are some valid MBSz's that are invalid UTF sequences, and so toMBSz would have to return byte[].I really think toMBSz should be returning byte[] and fromMBSz should be taking a byte*. The doc for types says char is unsigned 8 bit UTF-8. Period. And you get errors from the compiler if you try to initialize a string with something that's not valid UTF-8. So MBSz data has no business parading around dressed up as char[].I think you're right about toMBSz.Since the tango convention is to treat acronyms as single words, (the actual tango utf methods are called toUtf8 toUtf16 and toUtf32) it seems there's an argument for treating wstring and dstring as single entities too. So then it would be: toString, toWstring, toDstring Don't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form. Sean
Nov 19 2007
"Sean Kelly" <sean f4.ca> wrote in message [snip]Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 19 2007
Kris wrote:"Sean Kelly" <sean f4.ca> wrote in message [snip]How so? toString returns a string. toInt returns an int. toFloat returns a float. to??? returns a wstring. Seems whatever goes in the ??? place should include the letters "w-s-t-r-i-n-g" in that order. --bbBill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 19 2007
Bill Baxter wrote:Kris wrote:Only if you have recognized wstring and dstring as good names for those aliases <g> -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango"Sean Kelly" <sean f4.ca> wrote in message [snip]How so? toString returns a string. toInt returns an int. toFloat returns a float. to??? returns a wstring. Seems whatever goes in the ??? place should include the letters "w-s-t-r-i-n-g" in that order.Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 20 2007
Lars Ivar Igesund wrote:Only if you have recognized wstring and dstring as good names for those aliases <g>They'd be consistent with wchar and dchar.
Nov 20 2007
Walter Bright wrote:Lars Ivar Igesund wrote:Right ... now I don't like those either ;) -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the TangoOnly if you have recognized wstring and dstring as good names for those aliases <g>They'd be consistent with wchar and dchar.
Nov 20 2007
Lars Ivar Igesund wrote:Walter Bright wrote:What can I say? !!Lars Ivar Igesund wrote:Right ... now I don't like those either ;)Only if you have recognized wstring and dstring as good names for those aliases <g>They'd be consistent with wchar and dchar.
Nov 20 2007
"Walter Bright" <newshound1 digitalmars.com> wroteLars Ivar Igesund wrote:hehe Well, perhaps it's worth noting that all of these names are probably a cousin of "hungarian notation", since the name is being decorated with some kind of indicator of what it represents? The question perhaps should be - why is that? If we speculate, for a moment, that the language supported overload on return type: char[] toString(); wchar[] toString(); dchar[] toString(); then, there would be no issue here. Right? However, we don't have overload-on-return-type, so it seems to me that the decorated names are a means to work around that. Does that seem logical? Perhaps what we're seeing here, Walter, is a measure of distaste for the notion of decorated-names?Walter Bright wrote:What can I say? !!Lars Ivar Igesund wrote:Right ... now I don't like those either ;)Only if you have recognized wstring and dstring as good names for those aliases <g>They'd be consistent with wchar and dchar.
Nov 20 2007
Kris wrote:"Walter Bright" <newshound1 digitalmars.com> wroteclass String { char[] opImplicitCast () {} wchar[] opImplicitCast () {} dchar[] opImplicitCast () {} } String toString () {} How does that look?Lars Ivar Igesund wrote:hehe Well, perhaps it's worth noting that all of these names are probably a cousin of "hungarian notation", since the name is being decorated with some kind of indicator of what it represents? The question perhaps should be - why is that? If we speculate, for a moment, that the language supported overload on return type: char[] toString(); wchar[] toString(); dchar[] toString(); then, there would be no issue here. Right? However, we don't have overload-on-return-type, so it seems to me that the decorated names are a means to work around that. Does that seem logical? Perhaps what we're seeing here, Walter, is a measure of distaste for the notion of decorated-names?Walter Bright wrote:What can I say? !!Lars Ivar Igesund wrote:Right ... now I don't like those either ;)Only if you have recognized wstring and dstring as good names for those aliases <g>They'd be consistent with wchar and dchar.
Nov 20 2007
Christopher Wright wrote:class String { char[] opImplicitCast () {} wchar[] opImplicitCast () {} dchar[] opImplicitCast () {} } String toString () {} How does that look?Tango already has a String class with toUtf8, toUtf16, and toUtf32 member functions. This was one of our original objections to the idea of toString as a member function that must return a char[]. We will have to rename the class to something else if this change goes through. Sean
Nov 20 2007
Sean Kelly wrote:Christopher Wright wrote:It is already renamed to Text. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tangoclass String { char[] opImplicitCast () {} wchar[] opImplicitCast () {} dchar[] opImplicitCast () {} } String toString () {} How does that look?Tango already has a String class with toUtf8, toUtf16, and toUtf32 member functions. This was one of our original objections to the idea of toString as a member function that must return a char[]. We will have to rename the class to something else if this change goes through. Sean
Nov 20 2007
Lars Ivar Igesund wrote:Sean Kelly wrote:Oops!Christopher Wright wrote:It is already renamed to Text.class String { char[] opImplicitCast () {} wchar[] opImplicitCast () {} dchar[] opImplicitCast () {} } String toString () {} How does that look?Tango already has a String class with toUtf8, toUtf16, and toUtf32 member functions. This was one of our original objections to the idea of toString as a member function that must return a char[]. We will have to rename the class to something else if this change goes through.
Nov 20 2007
Kris wrote:"Sean Kelly" <sean f4.ca> wrote in message [snip]FWIW, this would be preferable to me too. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the TangoBill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 20 2007
Lars Ivar Igesund wrote:Kris wrote:+votes"Sean Kelly" <sean f4.ca> wrote in message [snip]FWIW, this would be preferable to me too.Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 20 2007
"Kris" <foo bar.com> wrote in message news:fhtru8$1no5$1 digitalmars.com..."Sean Kelly" <sean f4.ca> wrote in message [snip]Now that I've seen toWString and toStringW, I'll have to say I do like the toStringW/toStringD version better. // retract previous votes toWString.votes -= 8; toDString.votes -= 8; toStringW.votes += 334; toStringD.votes += 334;Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 20 2007
Kris wrote:"Sean Kelly" <sean f4.ca> wrote in message [snip]This conversation caught my eye and I cringed at toWString and toDString. toStringW and toStringD are acceptable though. Sean made a brief argument from psychology earlier. It made me remember this thing: Olny srmat poelpe can raed tihs. I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt! Perhaps this is important for naming conventions in general? Any similarly named entities must differ at the beginning or end of the name. I'm not sure how deeply this affects existing APIs or if it causes problems ;) It is also noteworthy that char, wchar, dchar are consistent with that naming constraint, but not with toStringW and toStringD. IMO the former matters more than the latter, simply because it is ingrained into our minds. Still, I am not entirely convinced that such a constraint is wise in general, though I do like its application here.Bill: actually, toString, toStringW and toStringD are more consistent with themselves, and with Tango convention. Even toString, toString16 and toString32 are significantly more style-consistent than toWString and toWstringDon't know if that hurts your eyes less or not, but it seems more consistent with Tango's existing naming convention to me than toWString, etc.Yeah I was thinking the same thing. It's certainly easier for me to read than the other form.
Nov 20 2007
Sean Kelly wrote:As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function. 2) On the question of toWString vs toStringW It seems to be pretty well agreed in this thread that toWString is more consistent but toStringW is prettier. I could be wrong but I think usage pattern of these W and D variants of the functions will be bimodal: either very frequent or very infrequent. In the former case I'd probably want to make a simpler alias like 'wstr'. In the latter case I'd want it to be the most consistent thing possible to be easy to remember for the few times I use it. --bb
Nov 20 2007
Bill Baxter wrote:Sean Kelly wrote:Good question. Probably toUInt, though I don't like it much :-) For these conversion routines, I'll admit I find the idea that the type name should be repeated exactly, which suggests something like to_wstring, but I don't imagine anyone finds that appealing.As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.2) On the question of toWString vs toStringW It seems to be pretty well agreed in this thread that toWString is more consistent but toStringW is prettier. I could be wrong but I think usage pattern of these W and D variants of the functions will be bimodal: either very frequent or very infrequent. In the former case I'd probably want to make a simpler alias like 'wstr'. In the latter case I'd want it to be the most consistent thing possible to be easy to remember for the few times I use it.Agreed. Sean
Nov 20 2007
"Sean Kelly" <sean f4.ca> wrote in message news:fhvsaf$2k6t$1 digitalmars.com...Bill Baxter wrote:It was resolved by having a Float module and an Integer module, containing relevant parse/format methods. The toUtf/toString() family is the only one where the type is decorated in the name (hungarian style)Sean Kelly wrote:Good question. Probably toUInt, though I don't like it much :-) For these conversion routines, I'll admit I find the idea that the type name should be repeated exactly, which suggests something like to_wstring, but I don't imagine anyone finds that appealing.As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.
Nov 20 2007
Bill Baxter wrote:Sean Kelly wrote:If it had a to uint function and a to int function and a to 'sint' function, what then? If it's only uint, then you can tell the difference quite easily. Also, 'int' is shorter than 'string'. Not a very good comparison.As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.
Nov 20 2007
Christopher Wright wrote:Bill Baxter wrote:I don't understand you. Tell the difference between what?Sean Kelly wrote:If it had a to uint function and a to int function and a to 'sint' function, what then? If it's only uint, then you can tell the difference quite easily.As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.Also, 'int' is shorter than 'string'. Not a very good comparison.What does length have to do with whether or not the naming scheme is consistent? --bb
Nov 20 2007
Bill Baxter wrote:Christopher Wright wrote:Sorry, mistyped. If it were 'to uint' and 'to int', that would be rather clear. 'to uint' and 'to sint' would be less clear, since they're the same number of letters and would have the same capitalization pattern.Bill Baxter wrote:I don't understand you. Tell the difference between what?Sean Kelly wrote:If it had a to uint function and a to int function and a to 'sint' function, what then? If it's only uint, then you can tell the difference quite easily.As an alternative, I can only suggest that toUTF8, toUTF16, and toUTF32 be named toString, toWString, and toDString, respectively, and Unicode should be assumed as the standard encoding format in D.1) On the question of toWString vs toWstring and consistency: I don't think there's any clear precedent for either in Tango right now, but my question is, if tango *had* a "to uint" function, what would it be named? toUInt or toUint? Whatever the answer to that is should be the same as the answer to how to name a "to wstring" function.Readability. I'd rather sacrifice a bit of consistency -- I can memorize a *few* inconsistencies -- for readability, whose lack will cause more trouble in the future. With shorter identifiers, smaller differences are more noticeable, but 'toWString' is a relatively long identifier.Also, 'int' is shorter than 'string'. Not a very good comparison.What does length have to do with whether or not the naming scheme is consistent?--bb
Nov 20 2007