digitalmars.D - TDPL reaches Thermopylae level
- Andrei Alexandrescu (2/2) Oct 25 2009 303 pages and counting!
- Walter Bright (2/3) Oct 25 2009 Come and get them!
- Jeremie Pelletier (2/5) Oct 26 2009 Soon the PI level, or at least 10 times PI!
- Bill Baxter (3/9) Oct 26 2009 A hundred even. ;-)
- Andrei Alexandrescu (8/17) Oct 26 2009 Coming along. I'm writing about strings and Unicode right now. I was
- Bill Baxter (7/28) Oct 26 2009 So a common way to convert wchar to char might then become ""~myWcharStr...
- Andrei Alexandrescu (10/35) Oct 26 2009 Well, I guess. In particular, to me it's not clear what type we should
- Chris Nicholson-Sauls (5/38) Oct 27 2009 My intuition would be to expect the same as adding an int to a byte: you...
- Denis Koroskin (9/47) Oct 27 2009 ubyte i = 42;
- Bill Baxter (9/67) Oct 27 2009 ar
- Andrei Alexandrescu (14/25) Oct 27 2009 Yah, I agree. The problem is, there's a big difference too: all
- Michel Fortin (8/12) Oct 27 2009 Seems the most intuitive option to me. Also, it makes "a ~= b"
- Bill Baxter (6/14) Oct 27 2009 And that kind of suggests to me that even a = b should work.
- Andrei Alexandrescu (8/23) Oct 27 2009 I agree. This one, however, will be very difficult to slide by Walter's
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (5/23) Oct 27 2009 int a;
- Bill Baxter (9/38) Oct 27 2009 by
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (3/35) Oct 27 2009 They are?
- Bill Baxter (17/62) Oct 27 2009 om>
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (2/59) Oct 27 2009 Thank you, that cleared things up for me :)
- Leandro Lucarella (12/31) Oct 27 2009 And here is a nice artible about Unicode and encodings:
- Andrei Alexandrescu (4/27) Oct 27 2009 Damn guys, with these good explanations, nobody's going to use the one
- Leandro Lucarella (9/35) Oct 27 2009 :)
- Leandro Lucarella (11/38) Oct 29 2009 BTW, seeing the explanation about Unicode in your book, one wonders why
- Justin Johansson (9/49) Oct 27 2009 Though I'm sure Shannon would say that the number of bits of intrinsic i...
- Chris Nicholson-Sauls (13/63) Oct 29 2009 Granted LTR is common enough to be expectable and acceptable. To be per...
- Justin Johansson (3/17) Oct 29 2009 Your overall reply well put. On last point: agree; cheap hacks should b...
- Nick Sabalausky (10/15) Oct 29 2009 Given that just about anything outside of D (at least as far as I've see...
- Lars T. Kyllingstad (4/21) Oct 30 2009 I think this says it all:
- Andrei Alexandrescu (8/34) Oct 30 2009 Yep, there was a frenzy when UCS-2 came about: everybody thought two
- Justin Johansson (16/52) Oct 30 2009 "I personally think UTF-8 is a better overall design though."
- Andrei Alexandrescu (11/70) Oct 30 2009 Thanks for the pointers. One of the reasons for which I like the design
- Jeremie Pelletier (10/32) Oct 26 2009 I don't know if thats a good idea, its better when string encoding is
- Andrei Alexandrescu (10/45) Oct 26 2009 The beauty of it is that reallocation with ~ occurs anyway, and with ~=
- Jeremie Pelletier (5/58) Oct 26 2009 Good points, I didn't think of the separation between characters and
- Bill Baxter (12/77) Oct 26 2009 Yeh, me too. Saving an allocation is good. And I agree that having
303 pages and counting! Andrei
Oct 25 2009
Andrei Alexandrescu wrote:303 pages and counting!Come and get them!
Oct 25 2009
Andrei Alexandrescu wrote:303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 26 2009
On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:A hundred even. ;-) --bb303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 26 2009
Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) AndreiAndrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 26 2009
On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd. Just using something like to!(char[])(myWcharString) seems less goofy to me. But that subjective reaction is all I have against it. --bbOn Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 26 2009
Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...Bill Baxter wrote:So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!Just using something like to!(char[])(myWcharString) seems less goofy to me.Problem is, an append + one transcoding requires two allocations. We could always define routines in std.string or std.utf: append(s, ws); // s ~= ws but really it's quite unambiguous what ~= should do. A nod from the language is a nice touch. Andrei
Oct 26 2009
Andrei Alexandrescu wrote:Bill Baxter wrote:My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-SaulsOn Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...Bill Baxter wrote:So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 27 2009
On Tue, 27 Oct 2009 10:04:33 +0300, Chris Nicholson-Sauls <ibisbasenji gmail.com> wrote:Andrei Alexandrescu wrote:ubyte i = 42; int j = 1; i += j; // still ubyte same here: string a = "hello"; wstring b = "world"w; a ~= b; // still stringBill Baxter wrote:My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-SaulsOn Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...Bill Baxter wrote:So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 27 2009
On Tue, Oct 27, 2009 at 4:37 AM, Denis Koroskin <2korden gmail.com> wrote:On Tue, 27 Oct 2009 10:04:33 +0300, Chris Nicholson-Sauls <ibisbasenji gmail.com> wrote:llAndrei Alexandrescu wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Bill Baxter wrote:On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=3D) of strings of different character widths. The support library could do a=Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!arof the transcoding. (I understand that concatenating an array of wchar or char with a dch=ld=A0Well, I guess. In particular, to me it's not clear what type we shou=is already in bugzilla.)So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.'s muchassign to a concatenation between a string and a wstring. With ~=3D, it=ie,easier...My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring;=As Andrei said (and maybe you missed) "With ~=3D, it's much easier...". The only question is about what "a ~ b" should do. --bbencode to the wider of the two types. -- Chris Nicholson-Saulsubyte i =3D 42; int j =3D 1; i +=3D j; // still ubyte same here: string a =3D "hello"; wstring b =3D "world"w; a ~=3D b; // still string
Oct 27 2009
Chris Nicholson-Sauls wrote:Andrei Alexandrescu wrote:[snip]Yah, I agree. The problem is, there's a big difference too: all encodings are able to represent the same information, unlike numeric widths where there's a clear inclusion relationship. It could even be argued that in pure theory UTF-16 is the least general of the three (I dislike UTF-16 from an engineering standpoint; unlike UTF-8 which I think is brilliant, I find UTF-16 is forced and uninspired - the typical outcome of a committee.) My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs. AndreiWell, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-Sauls
Oct 27 2009
On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 27 2009
On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bbMy current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:I agree. This one, however, will be very difficult to slide by Walter's watchful eye. He doesn't like hidden allocations, and a width adjustment does involve one. Andrei P.S. I got green light from my editor's marketing folks. Will release The Thermopylae Excerpt of TDPL today for free off my website. Stay tuned. It's a rough draft but I hope you will enjoy it.On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error.My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:int a; float b = 2.1; a = b; also unambiguous?On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bbMy current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.
Oct 27 2009
On Tue, Oct 27, 2009 at 12:48 PM, Pelle M=E5nsson <pelle.mansson gmail.com>= wrote:Bill Baxter wrote:byOn Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (there=hs ~making ~ consistent with ~=3D by making lhs ~=3D rhs same as lhs =3D l=lentrhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~=3D b" equiva=I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bbint a; float b =3D 2.1; a =3D b; also unambiguous?to "a =3D a ~ b" which is always nice.And that kind of suggests to me that even =A0a =3D b =A0should work. It has many of the same characteristics as ~=3D. =A0It's pretty unambiguous what you'd expect to happen if not an error. --bb
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 12:48 PM, Pelle Månsson <pelle.mansson gmail.com> wrote:They are? ...Then what is the point of wstring, dstring?Bill Baxter wrote:I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bbOn Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:int a; float b = 2.1; a = b; also unambiguous?On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bbMy current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.
Oct 27 2009
On Tue, Oct 27, 2009 at 1:06 PM, Pelle M=E5nsson <pelle.mansson gmail.com> = wrote:Bill Baxter wrote:om>On Tue, Oct 27, 2009 at 12:48 PM, Pelle M=E5nsson <pelle.mansson gmail.c=lhs ~wrote:Bill Baxter wrote:On Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:My current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~=3D by making lhs ~=3D rhs same as lhs =3D=perhs) in case lhs is a string type. If lhs is a character type, the result ty=They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element =3D one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=3Ddtqh79k_1rbxfmb --bbThey are? ...Then what is the point of wstring, dstring?I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. =A0That isn't the case with int and float. --bbint a; float b =3D 2.1; a =3D b; also unambiguous?And that kind of suggests to me that even =A0a =3D b =A0should work. It has many of the same characteristics as ~=3D. =A0It's pretty unambiguous what you'd expect to happen if not an error. --bbis obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~=3D b" equivalent to "a =3D a ~ b" which is always nice.
Oct 27 2009
Bill Baxter wrote:On Tue, Oct 27, 2009 at 1:06 PM, Pelle Månsson <pelle.mansson gmail.com> wrote:Thank you, that cleared things up for me :)Bill Baxter wrote:They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb --bbOn Tue, Oct 27, 2009 at 12:48 PM, Pelle Månsson <pelle.mansson gmail.com> wrote:They are? ...Then what is the point of wstring, dstring?Bill Baxter wrote:I'm not sure what point you're trying to make, but wstring <-> string <-> dstring are all lossless conversions. That isn't the case with int and float. --bbOn Tue, Oct 27, 2009 at 6:56 AM, Michel Fortin <michel.fortin michelf.com> wrote:int a; float b = 2.1; a = b; also unambiguous?On 2009-10-27 09:07:06 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:And that kind of suggests to me that even a = b should work. It has many of the same characteristics as ~=. It's pretty unambiguous what you'd expect to happen if not an error. --bbMy current thought is to ascribe lhs ~ rhs the same type as lhs (thereby making ~ consistent with ~= by making lhs ~= rhs same as lhs = lhs ~ rhs) in case lhs is a string type. If lhs is a character type, the result type is obviously the same as rhs.Seems the most intuitive option to me. Also, it makes "a ~= b" equivalent to "a = a ~ b" which is always nice.
Oct 27 2009
Bill Baxter, el 27 de octubre a las 13:12 me escribiste:And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.html -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- He cometido pecados, he hecho el mal, he sido vÃctima de la envidia, el egoÃsmo, la ambición, la mentira y la frivolidad, pero siempre he sido un padre argentino que quiere que su hijo triunfe en la vida. -- Ricardo VaporesoThey are? ...Then what is the point of wstring, dstring?They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
Oct 27 2009
Leandro Lucarella wrote:Bill Baxter, el 27 de octubre a las 13:12 me escribiste:Damn guys, with these good explanations, nobody's going to use the one in TDPL! AndreiAnd here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.htmlThey are? ...Then what is the point of wstring, dstring?They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
Oct 27 2009
Andrei Alexandrescu, el 27 de octubre a las 19:32 me escribiste:Leandro Lucarella wrote::) -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Vivimos en una época muy contemporánea, Don Inodoro... -- MendietaBill Baxter, el 27 de octubre a las 13:12 me escribiste:Damn guys, with these good explanations, nobody's going to use the one in TDPL!And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.htmlThey are? ...Then what is the point of wstring, dstring?They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
Oct 27 2009
Andrei Alexandrescu, el 27 de octubre a las 19:32 me escribiste:Leandro Lucarella wrote:BTW, seeing the explanation about Unicode in your book, one wonders why UTF-8, UTF-16 and UTF-32 character types are not simply called utf8, utf16 and utf32... -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Ya ni el cielo me quiere, ya ni la muerte me visita Ya ni el sol me calienta, ya ni el viento me acariciaBill Baxter, el 27 de octubre a las 13:12 me escribiste:Damn guys, with these good explanations, nobody's going to use the one in TDPL!And here is a nice artible about Unicode and encodings: http://www.joelonsoftware.com/articles/Unicode.htmlThey are? ...Then what is the point of wstring, dstring?They are all just different representations of Unicode. string, which is unicode in UTF-8, is good because it's the least wasteful for mostly ASCII text. And has a nice ASCII backwards compatibility story. dstring, which is unicode in UTF-32, is good because you have one element = one character. So it's good for doing substring and other text manipulations. wstring, which is UTF-16, is good because it lets you call Windows Unicode functions. Here's Daniel Keep's nice explanation: http://docs.google.com/View?docid=dtqh79k_1rbxfmb
Oct 29 2009
Chris Nicholson-Sauls Wrote:Andrei Alexandrescu wrote:Though I'm sure Shannon would say that the number of bits of intrinsic information contained in the same sequence of Unicode codepoints is exactly the same whether it be encoded as a string or a wstring. Accordingly my intuition is that some rule based upon left-to-right associativity would be more apt. You could then concatenate a wstring (on the rhs) to an empty string (on the lhs) to convert the wstring to a string or vica versa. Cheers Justin JohanssonBill Baxter wrote:My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-SaulsOn Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...Bill Baxter wrote:So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 27 2009
Justin Johansson wrote:Chris Nicholson-Sauls Wrote:Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF. I would argue that string ~ wstring returning string is fine, but would suggest it be a warning for those like myself who might have first guessed it would "upscale to fit". Just so long as the foreach(dchar;string) trick is still around, char/string can cover an awful lot of ground. All that said, though, I don't think I would ever use ""~wstring as a means of conversion. It just feels like "there wasn't any other way to do this, so here's a cheap hack" -- which just isn't the case. -- Chris Nicholson-SaulsAndrei Alexandrescu wrote:Though I'm sure Shannon would say that the number of bits of intrinsic information contained in the same sequence of Unicode codepoints is exactly the same whether it be encoded as a string or a wstring. Accordingly my intuition is that some rule based upon left-to-right associativity would be more apt. You could then concatenate a wstring (on the rhs) to an empty string (on the lhs) to convert the wstring to a string or vica versa. Cheers Justin JohanssonBill Baxter wrote:My intuition would be to expect the same as adding an int to a byte: you get an int. Concatenating a string and a wstring should yield a wstring; ie, encode to the wider of the two types. -- Chris Nicholson-SaulsOn Mon, Oct 26, 2009 at 11:51 AM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well, I guess. In particular, to me it's not clear what type we should assign to a concatenation between a string and a wstring. With ~=, it's much easier...Bill Baxter wrote:So a common way to convert wchar to char might then become ""~myWcharString? That seems kind of odd.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.)Andrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 29 2009
Chris Nicholson-Sauls Wrote:Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF. I would argue that string ~ wstring returning string is fine, but would suggest it be a warning for those like myself who might have first guessed it would "upscale to fit". Just so long as the foreach(dchar;string) trick is still around, char/string can cover an awful lot of ground. All that said, though, I don't think I would ever use ""~wstring as a means of conversion. It just feels like "there wasn't any other way to do this, so here's a cheap hack" -- which just isn't the case.Your overall reply well put. On last point: agree; cheap hacks should be avoided. cheers, Justin
Oct 29 2009
"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 29 2009
Nick Sabalausky wrote:"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_systems_and_environments -Lars :)Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 30 2009
Lars T. Kyllingstad wrote:Nick Sabalausky wrote:Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 30 2009
Andrei Alexandrescu Wrote:Lars T. Kyllingstad wrote:"I personally think UTF-8 is a better overall design though." recommending UTF-16 for Processing. http://unicode.org/notes/tn12/ The major claim in the TN is that Unicode is optimized for UTF-16. The rest of the argument looks like a VHS (everyone is using it i.e. UTF-16) versus Beta argument. So who's right? My personal view is that whilst they are the *Unicode Consortium*, I have great difficulty in accepting UTF-16 as the one-and-holy encoding. FWIW, there was a subthread during a discussion about the ordained features of programming languages on LtU a while back. http://lambda-the-ultimate.org/node/3166#comment-46233 What Are The Resolved Debates in General Purpose Language Design? Its a long discussion so easier to search for UTF or Unicode on the page if you're interested. cheers Justin JohanssonNick Sabalausky wrote:Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 30 2009
Justin Johansson wrote:Andrei Alexandrescu Wrote:Thanks for the pointers. One of the reasons for which I like the design of UTF-8 is its generality: it's a variable-length code for any number of 31 bits. In contrast, UTF-16 is a relies on specific dead zones inside the assigned space. But the authors of the unicode.org article do make a few good points, such as there not being any invalid UTF-16 symbol. But then that actually can be seen as a strength of UTF-8 - the binary files that are actually UTF-8 files are statistically so scarce, UTF-8 has a very solid method of checking whether a file is UTF-8 or something else. AndreiLars T. Kyllingstad wrote:"I personally think UTF-8 is a better overall design though." recommending UTF-16 for Processing. http://unicode.org/notes/tn12/ The major claim in the TN is that Unicode is optimized for UTF-16. The rest of the argument looks like a VHS (everyone is using it i.e. UTF-16) versus Beta argument. So who's right? My personal view is that whilst they are the *Unicode Consortium*, I have great difficulty in accepting UTF-16 as the one-and-holy encoding. FWIW, there was a subthread during a discussion about the ordained features of programming languages on LtU a while back. http://lambda-the-ultimate.org/node/3166#comment-46233 What Are The Resolved Debates in General Purpose Language Design? Its a long discussion so easier to search for UTF or Unicode on the page if you're interested. cheers Justin JohanssonNick Sabalausky wrote:Yep, there was a frenzy when UCS-2 came about: everybody thought two bytes will be enough for everyone. So UCS-2 was widely adopted - who wouldn't love to have constant character width? Then, the UTF-16 surrogate business came about, and the only logical step they could take was to migrate to UTF-16, which was upward compatible to UCS-2. I personally think UTF-8 is a better overall design though. Andrei"Chris Nicholson-Sauls" <ibisbasenji gmail.com> wrote in message news:hcctuf$140a$1 digitalmars.com...I think this says it all: http://en.wikipedia.org/wiki/Utf-16#Use_in_major_operating_syst ms_and_environments -Lars :)Granted LTR is common enough to be expectable and acceptable. To be perfectly honest, I don't believe I have *ever* even used wchar/wstring. Char/string gosh yes; dchar/dstring quite a bit as well, where I need the simplicity; but I've yet to feel much need for the "weirdo" middle child of UTF.Given that just about anything outside of D (at least as far as I've seen) that attempts to use unicode does so with UTF-16 (or just uses UCS-2 and pretends that's UTF-16...), wchar and wstring are great for dealing with that. For instance, my Goldie engine for GOLD currently uses wchar in a number of places because GOLD's .cfg format stores text in...well, presumably UTF-16 (I haven't tested to see if it's really UCS-2). But yea, as long as you're not dealing with anything that's already in UTF-16 or that expects it, then it does seem to be somewhat questionable.
Oct 30 2009
Andrei Alexandrescu wrote:Bill Baxter wrote:I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are. ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) AndreiAndrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!
Oct 26 2009
Jeremie Pelletier wrote:Andrei Alexandrescu wrote:The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.Bill Baxter wrote:I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) AndreiAndrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Oct 26 2009
Andrei Alexandrescu wrote:Jeremie Pelletier wrote:Good points, I didn't think of the separation between characters and encodings or the extra allocation from to. You have my vote for this feature then! JeremieAndrei Alexandrescu wrote:The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.Bill Baxter wrote:I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) AndreiAndrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Oct 26 2009
On Mon, Oct 26, 2009 at 4:05 PM, Jeremie Pelletier <jeremiep gmail.com> wrote:Andrei Alexandrescu wrote:Yeh, me too. Saving an allocation is good. And I agree that having ~= do a conversion is much more useful than just getting an error. Its one of those things you might try just hoping it will work, and it's always nice when something like that does just what you hope it will. I guess the only other thing I could worry about is that in generic array code it might cause someone headaches that for some T[], T[] ~= S[] is legal and the length of the result is not the same as the lengths of the inputs. But I can't think of any real situation where that would cause trouble. --bbJeremie Pelletier wrote:Good points, I didn't think of the separation between characters and encodings or the extra allocation from to. You have my vote for this feature then! JeremieAndrei Alexandrescu wrote:The beauty of it is that reallocation with ~ occurs anyway, and with ~= is anyway imminent, regardless of the character width you're reallocating. Allowing concatenation of strings of different widths is a nice way of acknowledging at the language level that all character widths are encodings of abstract characters.Bill Baxter wrote:I don't know if thats a good idea, its better when string encoding is explicit so you know where your reallocations are.On Mon, Oct 26, 2009 at 8:47 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:Coming along. I'm writing about strings and Unicode right now. I was wondering what people think about allowing concatenation (with ~ and ~=) of strings of different character widths. The support library could do all of the transcoding. (I understand that concatenating an array of wchar or char with a dchar is already in bugzilla.) AndreiAndrei Alexandrescu wrote:A hundred even. ;-)303 pages and counting! AndreiSoon the PI level, or at least 10 times PI!ie if I know some routine will have to convert a utf16 parameter to utf8 to append it to a string, then ill try and either make it output utf16 or input utf8. If its implicit its much harder to find and optimize these cases. to!string() is easy enough to use anyways. But it could be good to add a range type that does this with multiple opAppend/opAppendAssign overloads.One problem with s ~= to!string(someDstring); is that it does two allocations instead of one. Andrei
Oct 26 2009