digitalmars.D - Ascii matters
- bearophile (26/26) Aug 22 2012 I need to manage Unicode text, but in many cases I have lot of
- Jonathan M Davis (6/8) Aug 22 2012 It could certainly be done. In fact, doing so would be incredibly trivia...
- bearophile (13/15) Aug 22 2012 The data I am processing is not generic octets, like 8 bits
- Jonathan M Davis (14/30) Aug 22 2012 Then just use ubyte[], and if you need char[] for printing out, then cas...
- Sean Kelly (7/13) Aug 22 2012 in ASCII, and for both practical and performance reasons in D I want to ...
- bearophile (6/9) Aug 22 2012 std.algorithm is not closed
- Don Clugston (2/9) Aug 23 2012 Which operations in std.algorithm over map 0-0x7F into higher characters...
- bearophile (6/8) Aug 23 2012 The first example I've shown:
- Jonathan M Davis (16/18) Aug 22 2012 Range-based functions will treat arrays of char or wchar as forward rang...
- bearophile (6/7) Aug 22 2012 I am just asking if there is interest in it, if people see
- Sean Kelly (23/41) Aug 22 2012 strings?
- bearophile (9/14) Aug 23 2012 What's unsafe in what I have presented? The constructor verifies
- Sean Kelly (11/19) Aug 23 2012 want this in Phobos because it seems like it could cause maintenance =
- bearophile (15/18) Aug 23 2012 The cast to ubute[] doesn't perform a run-time test of the
I need to manage Unicode text, but in many cases I have lot of 7-bit or 8-bit ASCII text to process, and this has lead to this discussion, so since some time thanks to Jonathan Davis we have an efficient translate() again: http://d.puremagic.com/issues/show_bug.cgi?id=7515 The s2 array generated by this code is a dchar[] (if array() becomes pure you are probably able to assign type s2 as dstring): string s = "test string"; // UTF-8, but also 7-bit ASCII dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function To produce a char[] (or string, using assumeUnique), you are free to use a cast: auto s3 = map!(x => cast(char)x)(s).array(); But D casts are unsafe, and one thing I'm learning from Haskell is how important is to give types to your code to prevent bugs. So maybe an AsciiString wrapper (a subtype of string) range can be invented for Phobos. Its consructor verifies the input is a 7-big ASCII and its "front" method yields chars, so map.array() gives a char[]: astring a1 = "test string"; // enforced 7-bit ASCII char[] s4 = map!(x => x)(s).array(); This makes some algorithms working on ASCII text cleaner and safer, avoiding the need for casts. Is creating something like this possible and appreciated for Phobos? Bye, bearophile
Aug 22 2012
On Thursday, August 23, 2012 00:11:18 bearophile wrote:Is creating something like this possible and appreciated for Phobos?It could certainly be done. In fact, doing so would be incredibly trivial. But given that you can use ubyte[] just fine and the fact that using ASCII really shouldn't be encouraged, I don't like the idea of adding such a range to Phobos. I don't know what the general consensus on that would be though. - Jonathan M Davis
Aug 22 2012
Jonathan M Davis:But given that you can use ubyte[] just fineThe data I am processing is not generic octets, like 8 bits digitized by some old A/D converter, they are chars, and I expect to see strings when I print them :-)and the fact that using ASCII really shouldn't be encouraged,For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is in ASCII, and for both practical and performance reasons in D I want to process it as a sequence of chars (or a sequence of ubytes, as you say). So for some kinds of data that encouragement is a waste of your time. Bye, bearophile
Aug 22 2012
On Thursday, August 23, 2012 02:07:52 bearophile wrote:Jonathan M Davis:Then just use ubyte[], and if you need char[] for printing out, then cast it. And if you don't like the casting, you can ever wrap it in a function. char[] fromASCII(ubyte[] str) { return cast(char[])str; } Creating an ASCII range type will just encourage its use, when you should only be operating on ASCII when you really need it. Operating on ASCII is quite possible as it is and isn't even very hard. So, I really don't see much benefit in adding such a range, and the fact that arguably would encourage bad behavior then makes it _undesirable_ rather than just not particularly beneficial. - Jonathan M DavisBut given that you can use ubyte[] just fineThe data I am processing is not generic octets, like 8 bits digitized by some old A/D converter, they are chars, and I expect to see strings when I print them :-)and the fact that using ASCII really shouldn't be encouraged,For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is in ASCII, and for both practical and performance reasons in D I want to process it as a sequence of chars (or a sequence of ubytes, as you say). So for some kinds of data that encouragement is a waste of your time.
Aug 22 2012
On Aug 22, 2012, at 5:07 PM, bearophile <bearophileHUGS lycos.com> = wrote:Jonathan M Davis: =20in ASCII, and for both practical and performance reasons in D I want to = process it as a sequence of chars (or a sequence of ubytes, as you say). = So for some kinds of data that encouragement is a waste of your time. I'm clearly missing something. ASCII and UTF-8 are compatible. What's = stopping you from just processing these as if they were UTF-8 strings?=and the fact that using ASCII really shouldn't be encouraged,=20 For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is =
Aug 22 2012
Sean Kelly:I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 strings?std.algorithm is not closed (http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on UTF-8, its operations lead to UTF-32. Bye, bearophile
Aug 22 2012
On 23/08/12 05:05, bearophile wrote:Sean Kelly:Which operations in std.algorithm over map 0-0x7F into higher characters?I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 strings?std.algorithm is not closed (http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on UTF-8, its operations lead to UTF-32.
Aug 23 2012
Don Clugston:Which operations in std.algorithm over map 0-0x7F into higher characters?The first example I've shown: string s = "test string"; dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function Bye, bearophile
Aug 23 2012
On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 strings?Range-based functions will treat arrays of char or wchar as forward ranges of dchar. Because of the variable length of their code points, they aren't considered to have length, be random access, or have slicing and will not generally work with range-based functions which require any of those operations (though some range-based functions do specialize on strings and use those operations where they can based on proper understanding of unicode). On the other hand, if you have a string that specifically holds ASCII and you know that it only holds ASCII, you know that you can safely use length, random access, and slicing as if each code unit were a full code point. But the range-based functions don't know that your string is guaranteed to be ASCII- only, so they continue to treat it as a range of dchar rather than char. The solution is to either create a wrapper range whose element type is char or to cast the char[] to ubyte[]. And Bearophile wants such a wrapper range to be added to Phobos. - Jonathan M Davis
Aug 22 2012
Jonathan M Davis:And Bearophile wants such a wrapper range to be added to Phobos.I am just asking if there is interest in it, if people see something wrong in having it in Phobos. Surely I am not demanding it :-) Bye, bearophile
Aug 22 2012
On Aug 22, 2012, at 8:03 PM, Jonathan M Davis <jmdavisProg gmx.com> = wrote:On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:strings?I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 ==20 Range-based functions will treat arrays of char or wchar as forward =ranges of=20dchar. Because of the variable length of their code points, they =aren't=20considered to have length, be random access, or have slicing and will =not=20generally work with range-based functions which require any of those=20=operations (though some range-based functions do specialize on strings =and use=20those operations where they can based on proper understanding of =unicode). Yeah. I understand why the range-based functions use dchar, but for my = own use I generally want to work directly with a char string of UTF-8 so = I can slice buffers. Typing these as uchar buffers isn't ideal, but it = does work.On the other hand, if you have a string that specifically holds ASCII =and you=20know that it only holds ASCII, you know that you can safely use =length, random=20access, and slicing as if each code unit were a full code point. But =the=20range-based functions don't know that your string is guaranteed to be =ASCII-only, so they continue to treat it as a range of dchar rather than =char. The=20solution is to either create a wrapper range whose element type is =char or to=20cast the char[] to ubyte[]. And Bearophile wants such a wrapper range =to be=20added to Phobos.Gotcha. Despite it being something I'd use regularly, I wouldn't want = this in Phobos because it seems like it could cause maintenance = problems. I'd rather explicitly cast to ubyte as a way to flag that I = was doing something potentially unsafe.=
Aug 22 2012
Sean Kelly:Gotcha. Despite it being something I'd use regularly, I wouldn't want this in Phobos because it seems like it could cause maintenance problems. I'd rather explicitly cast to ubyte as a way to flag that I was doing something potentially unsafe.What's unsafe in what I have presented? The constructor verifies every char to be in 7 bits, and then you use the new type safely. No casts, and no need to flag something as unsafe. This usage of types to denote capabilities is quite common in functional languages, see articles I've recently linked here as: http://tomasp.net/blog/type-first-development.aspx Bye, bearophile
Aug 23 2012
On Aug 23, 2012, at 4:25 AM, bearophile <bearophileHUGS lycos.com> = wrote:Sean Kelly: =20want this in Phobos because it seems like it could cause maintenance = problems. I'd rather explicitly cast to ubyte as a way to flag that I = was doing something potentially unsafe.Gotcha. Despite it being something I'd use regularly, I wouldn't ==20 What's unsafe in what I have presented? The constructor verifies every =char to be in 7 bits, and then you use the new type safely. No casts, = and no need to flag something as unsafe.=20 This usage of types to denote capabilities is quite common in =functional languages, see articles I've recently linked here as:http://tomasp.net/blog/type-first-development.aspxSo it throws an exception if there are non-ASCII characters in the = range? Is this really better than just casting the input array to = ubyte?=
Aug 23 2012
Sean Kelly:So it throws an exception if there are non-ASCII characters in the range? Is this really better than just casting the input array to ubyte?The cast to ubute[] doesn't perform a run-time test of the validity of the input, so yeah, the exception is better. Your code is also able to catch and manage the exception (like asking the user for another valid input file). If you carry around some type as "Astring", later you don't have to cast it back to char[] to print the data as a string (this discussion is about data that is naturally text, this discussion is not about generic numerical octets). An appropriate type statically encodes in your program that you are using an ascii string. This makes your code more readable. But when in the code you see a variable of generic type ubyte[] it doesn't tell you a lot about its contents. Bye, bearophile
Aug 23 2012