digitalmars.D.learn - isAsciiString in Phobos?
- Andrej Mitrovic (9/9) Oct 07 2013 If I want to transfer some string to a C function that expects
- Adam D. Ruppe (9/10) Oct 07 2013 If you want strict ASCII, it should be <= 127 rather than 255
- Andrej Mitrovic (12/20) Oct 07 2013 Thanks. I got some useful info from Jakob from IRC, and ended up with th...
- monarch_dodra (8/31) Oct 07 2013 You can use std.string.representation to do the cast for you, and
- Andrej Mitrovic (2/6) Oct 07 2013 Clever! So I think we should definitely try and push it to the library.
- monarch_dodra (63/72) Oct 07 2013 I wrote this:
If I want to transfer some string to a C function that expects ascii-only string. What can I use to verify there are no non-ascii characters in a D string? I haven't seen anything in Phobos. I was thinking of using: bool isAscii = mystring.all!(a => a <= 0xFF); Is this safe? I'm thinking of whether a code point can consist of two code units such as [C1][C2], where C2 may be in the range 0 - 0xFF. I don't know if that's possible (not a unicode pro here..).
Oct 07 2013
On Monday, 7 October 2013 at 15:18:06 UTC, Andrej Mitrovic wrote:bool isAscii = mystring.all!(a => a <= 0xFF);If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.) You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.
Oct 07 2013
On 10/7/13, Adam D. Ruppe <destructionator gmail.com> wrote:If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.) You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.Thanks. I got some useful info from Jakob from IRC, and ended up with this: bool isAsciiString(string input) { auto data = cast(const(ubyte)[])input; return data.all!(a => a <= 0x7F); } The cast is needed to avoid decoding by the "all" function. Also there's isASCII that works on a dchar in std.ascii, but I was looking for something that works on entire strings at once. So the above function does the work for me. Should we put something like this in Phobos?
Oct 07 2013
On Monday, 7 October 2013 at 15:57:15 UTC, Andrej Mitrovic wrote:On 10/7/13, Adam D. Ruppe <destructionator gmail.com> wrote:You can use std.string.representation to do the cast for you, and you might as well just use isASCII anyways. return data.representation().all!isASCII(); If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII.If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.) You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.Thanks. I got some useful info from Jakob from IRC, and ended up with this: bool isAsciiString(string input) { auto data = cast(const(ubyte)[])input; return data.all!(a => a <= 0x7F); } The cast is needed to avoid decoding by the "all" function. Also there's isASCII that works on a dchar in std.ascii, but I was looking for something that works on entire strings at once. So the above function does the work for me.
Oct 07 2013
On 10/7/13, monarch_dodra <monarchdodra gmail.com> wrote:If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII.Clever! So I think we should definitely try and push it to the library.
Oct 07 2013
On Monday, 7 October 2013 at 16:23:12 UTC, Andrej Mitrovic wrote:On 10/7/13, monarch_dodra <monarchdodra gmail.com> wrote:I wrote this: Only lightly tested. //-------- bool isASCII(const(char[]) str) { static if (size_t.sizeof == 8) { enum size = 8; enum size_t mask = 0x80808080_80808080; enum size_t alignMask = ~cast(size_t)0b111; } else { enum size = 4; enum size_t mask = 0x80808080; enum size_t alignMask = ~cast(size_t)0b11; } if (str.length < size) { foreach (c; str) if (c & 0x80) return false; return true; } immutable start = (cast(size_t)str.ptr & alignMask) + size; immutable end = cast(size_t)(str.ptr + str.length) & alignMask; //we start with block, because it is faster //and chances the start is aligned anyways (so we check it later). for ( auto p = cast(size_t*)start ; p != cast(size_t*)end ; ++p ) if (*p & mask) return false; //Then the trailing chars. for ( auto p = cast(char*)end ; p != str.ptr + str.length ; ++p ) if (*p & 0x80) return false; //Finally, the first chars. for ( auto p = str.ptr ; p != cast(char*)start ; ++p ) if (*p & 0x80) return false; return true; } //-------- assert( "hello".isASCII()); assert( "heellohelloellohelloellohelloellohellollohello"); assert( "hellellohelloellohelloo"[3 .. $].isASCII()); assert(!"heéppellohelloellohelloellohelloellohelloellohellollo".isASCII()); assert(!"heppellohelloellohelloellohéelloellohelloellohellollo".isASCII()); assert(!"heppellohelloellohelloellohelloellohelloellohellolléo".isASCII()); //-------- What do you think? I have some doubts though: 1. Does x64 require qword alignment for size_t, or is dword enough? 2. Isn't there some built-in that'll give me the wanted alignement, isntead of doing it by hand? 3. Are those casts 100% correct?If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII.Clever! So I think we should definitely try and push it to the library.
Oct 07 2013