digitalmars.D - Regarding hex strings
- bearophile (20/20) Oct 17 2012 (Repost)
- H. S. Teoh (13/25) Oct 17 2012 [...]
- foobar (7/36) Oct 18 2012 IMO, this is a redundant feature that complicates the language
- monarch_dodra (9/15) Oct 18 2012 Have you actually ever written code that requires using code
- foobar (13/30) Oct 18 2012 I didn't try to compile it :) I just rewrote berophile's example
- bearophile (17/24) Oct 18 2012 But this code:
- foobar (8/32) Oct 18 2012 How often large binary blobs are literally spelled in the source
- foobar (9/16) Oct 18 2012 This is especially a good reason to remove this feature as it
- monarch_dodra (3/20) Oct 18 2012 Yeah, that makes sense too. I'll try to toy around on my end and
- monarch_dodra (31/33) Oct 18 2012 That was actually relatively easy!
- monarch_dodra (62/64) Oct 18 2012 //----
- monarch_dodra (38/39) Oct 18 2012 With correct-er utf string support. In theory, non-ascii
- bearophile (7/8) Oct 18 2012 It must scale up to "real world" usages. Try it with a program
- monarch_dodra (15/23) Oct 18 2012 Hum... The compilation is pretty fast actually, about 1 second,
- Marco Leise (26/58) Oct 18 2012 Hehe, I assume most of the regulars know this: DMD used to
- Jonathan M Davis (12/15) Oct 18 2012 Yes, but it didn't use it for long, because it made performance worse, a...
- Marco Leise (13/24) Oct 18 2012 He called it a FUD? Without trying to sound too patronizing, most D
- Jonathan M Davis (9/29) Oct 18 2012 I don't think that he used quite that term, but his point was that I sho...
- monarch_dodra (8/30) Oct 20 2012 I should have read your post in more detail. I thought you were
- Nick Sabalausky (11/19) Oct 18 2012 Frequency isn't the issue. The issues are "*Is* it ever needed?" and
- foobar (11/34) Oct 19 2012 Any real-world use cases to support this claim? Does C++ have
- Nick Sabalausky (23/68) Oct 20 2012 I've used it. And Denis just posted an example of where it was used to
- Kagamin (4/11) Oct 18 2012 You should use unicode directly here, that's the whole point to
- Jonathan M Davis (6/19) Oct 18 2012 It's a nice feature, but there are plenty of cases where it makes more s...
- Kagamin (2/2) Oct 18 2012 Your keyboard doesn't have ready unicode values for all
- Jonathan M Davis (13/15) Oct 18 2012 So? That doesn't make it so that it's not valuable to be able to input t...
- Don Clugston (6/40) Oct 18 2012 That is not the same. Array literals are not the same as string
- foobar (8/61) Oct 18 2012 I don't see how that detail is relevant to this discussion as I
- Don Clugston (3/53) Oct 19 2012 That doesn't compile.
- foobar (3/8) Oct 19 2012 Come on, "assuming the code points are valid". It says so 4 lines
- Don Clugston (7/15) Oct 19 2012 It isn't the same.
- foobar (27/45) Oct 19 2012 Yes, the \u requires code points and not code-units for a
- foobar (8/56) Oct 19 2012 I just re-checked and to clarify string literals support _three_
- Nick Sabalausky (7/21) Oct 20 2012 Using x"..." doesn't prevent anyone from doing that:
- Denis Shelomovskij (11/17) Oct 20 2012 Maybe. Just an example of a real world code:
- foobar (4/22) Oct 20 2012 I personally find the former more readable but I guess there
- Nick Sabalausky (4/21) Oct 20 2012 Honestly, I can't imagine how anyone wouldn't find the latter vastly
- H. S. Teoh (28/50) Oct 20 2012 If you want vastly human readable, you want heredoc hex syntax,
- foobar (6/61) Oct 20 2012 Yeah, I like this. I'd prefer brackets over quotes but it not a
- foobar (2/72) Oct 20 2012 ** not a big deal
- Nick Sabalausky (16/68) Oct 20 2012 Can't you already just do this?:
- Dejan Lekic (4/18) Oct 22 2012 Having a heredoc syntax for hex-strings that produce ubyte[]
- H. S. Teoh (6/27) Oct 22 2012 What I meant was, a syntax similar to heredoc, not an actual heredoc,
- Nick Sabalausky (6/31) Oct 18 2012 Big +1
- bearophile (5/9) Oct 18 2012 I'd like an opinion on such topics from one of the the D bosses
- monarch_dodra (11/25) Oct 18 2012 The conversion can't be done *implicitly*, but you can still get
- Dejan Lekic (4/25) Oct 22 2012 +1 on this one
- Simen Kjaeraas (5/9) Oct 22 2012 That syntax is already taken, though.
(Repost) hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[] Generally I want to use hex strings to put binary data in a program, so usually it's a ubyte[] or uint[]. So I have to use something like: auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup); So maybe the following literals are more useful in D2: ubyte[] data4 = x[A1 B2 C3 D4]; uint[] data5 = x[A1 B2 C3 D4]; ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4]; Bye, bearophile
Oct 17 2012
On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote: [...]hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[][...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T -- Who told you to swim in Crocodile Lake without life insurance??
Oct 17 2012
On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote: [...]IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[][...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
Oct 18 2012
On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare: string nihongo1 = x"e697a5 e69cac e8aa9e"; string nihongo2 = "\ue697a5\ue69cac\ue8aa9e"; ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e]; BTW, your data2 doesn't compile.
Oct 18 2012
On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:I didn't try to compile it :) I just rewrote berophile's example with 0x prefixes. How often do you actually need to write code-point _literals_ in your code? I'm not arguing that it isn't convenient. My question would be rather Anderi's "does it pull it's own weight?" meaning does the added complexity in the language and having more than one way for doing something worth that convenience? Seems to me this is in the same ballpark as the built-in complex numbers. Sure it's nice to be able to write "4+5i" instead of "complex(4,5)" but how frequently do you actually ever need the _literals_ even in complex computational heavy code?IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare: string nihongo1 = x"e697a5 e69cac e8aa9e"; string nihongo2 = "\ue697a5\ue69cac\ue8aa9e"; ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e]; BTW, your data2 doesn't compile.
Oct 18 2012
The docs say: http://dlang.org/lex.htmlHex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters.<But this code: void main() { immutable ubyte[4] data = x"F9 04 C1 E2"; } Gives me: temp.d(2): Error: Outside Unicode code space Are the docs correct? -------------------------- foobar:Seems to me this is in the same ballpark as the built-in complex numbers. Sure it's nice to be able to write "4+5i" instead of "complex(4,5)" but how frequently do you actually ever need the _literals_ even in complex computational heavy code?Compared to "oct!5151151511", one problem with code like this is that binary blobs are sometimes large, so supporting a x"" syntax is better: immutable ubyte[4] data = hex!"F9 04 C1 E2"; Bye, bearophile
Oct 18 2012
On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:The docs say: http://dlang.org/lex.htmlHow often large binary blobs are literally spelled in the source code (as opposed to just being read from a file)? In any case, I'm not opposed to such a utility library, in fact I think it's a rather good idea and we already have a precedent with "oct!" I just don't think this belongs as a built-in feature in the language.Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters.<But this code: void main() { immutable ubyte[4] data = x"F9 04 C1 E2"; } Gives me: temp.d(2): Error: Outside Unicode code space Are the docs correct? -------------------------- foobar:Seems to me this is in the same ballpark as the built-in complex numbers. Sure it's nice to be able to write "4+5i" instead of "complex(4,5)" but how frequently do you actually ever need the _literals_ even in complex computational heavy code?Compared to "oct!5151151511", one problem with code like this is that binary blobs are sometimes large, so supporting a x"" syntax is better: immutable ubyte[4] data = hex!"F9 04 C1 E2"; Bye, bearophile
Oct 18 2012
On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:This is especially a good reason to remove this feature as it breaks the principle of least surprise and I consider it a major bug, not a feature. I expect D's strings which are by definition Unicode to _only_ ever allow _valid_ Unicode. It makes no sense what so ever to allow this nasty back-door. Other text encoding should be either stored and treated as binary data (ubyte[]) or better yet stored in their own types that will ensure those encodings' invariants.The docs say: http://dlang.org/lex.htmlHex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters.<
Oct 18 2012
On Thursday, 18 October 2012 at 10:17:06 UTC, foobar wrote:On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:Yeah, that makes sense too. I'll try to toy around on my end and see if I can write an "hex".On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:This is especially a good reason to remove this feature as it breaks the principle of least surprise and I consider it a major bug, not a feature. I expect D's strings which are by definition Unicode to _only_ ever allow _valid_ Unicode. It makes no sense what so ever to allow this nasty back-door. Other text encoding should be either stored and treated as binary data (ubyte[]) or better yet stored in their own types that will ensure those encodings' invariants.The docs say: http://dlang.org/lex.htmlHex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters.<
Oct 18 2012
On Thursday, 18 October 2012 at 10:39:46 UTC, monarch_dodra wrote:Yeah, that makes sense too. I'll try to toy around on my end and see if I can write an "hex".That was actually relatively easy! Here is some usecase: //---- void main() { enum a = hex!"01 ff 7f"; enum b = hex!0x01_ff_7f; ubyte[] c = hex!"0123456789abcdef"; immutable(ubyte)[] bearophile1 = hex!"A1 B2 C3 D4"; immutable(ubyte)[] bearophile2 = hex!0xA1_B2_C3_D4; a.writeln(); b.writeln(); c.writeln(); bearophile1.writeln(); bearophile2.writeln(); } //---- And corresponding output: //---- [1, 255, 127] [1, 255, 127] [1, 35, 69, 103, 137, 171, 205, 239] [161, 178, 195, 212] [161, 178, 195, 212] //---- hex! was a very good idea actually, imo. I'll post my current impl in the next post. That said, I don't know if I'd deprecate x"", as it serves a different role, as you have already pointed out, in that it *will* validate the code points.
Oct 18 2012
On Thursday, 18 October 2012 at 11:24:04 UTC, monarch_dodra wrote:hex! was a very good idea actually, imo. I'll post my current impl in the next post.//---- import std.stdio; import std.conv; import std.ascii; template hex(string s) { enum hex = decode(s); } template hex(ulong ul) { enum hex = decode(ul); } ubyte[] decode(string s) { ubyte[] ret; size_t p; while(p < s.length) { while( s[p] == ' ' || s[p] == '_' ) { ++p; if (p == s.length) assert(0, text("Premature end of string at index ", p, "."));; } char c1 = s[p]; if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected character ", c1, " at index ", p, ".")); c1 = cast(char)std.ascii.toUpper(c1); ++p; if (p == s.length) assert(0, text("Premature end of string after ", c1, ".")); char c2 = s[p]; if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected character ", c2, " at index ", p, ".")); c2 = cast(char)std.ascii.toUpper(c2); ++p; ubyte val; if('0' <= c2 && c2 <= '9') val += (c2 - '0'); if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10); if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16); if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16); ret ~= val; } return ret; } ubyte[] decode(ulong ul) { //NOTE: This is not efficinet AT ALL (push front) //but it is ctfe, so we can live it for now ^^ //I'll optimize it if I try to push it ubyte[] ret; while(ul) { ubyte t = ul%256; ret = t ~ ret; ul /= 256; } return ret; } //---- NOT a final version.
Oct 18 2012
On Thursday, 18 October 2012 at 11:26:13 UTC, monarch_dodra wrote:NOT a final version.With correct-er utf string support. In theory, non-ascii characters are illegal, but it makes for safer code, and better diagnosis. //---- ubyte[] decode(string s) { ubyte[] ret;; while(s.length) { while( s.front == ' ' || s.front == '_' ) { s.popFront(); if (!s.length) assert(0, text("Premature end of string."));; } dchar c1 = s.front; if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected character ", c1, ".")); c1 = std.ascii.toUpper(c1); s.popFront(); if (!s.length) assert(0, text("Premature end of string after ", c1, ".")); dchar c2 = s.front; if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected character ", c2, " after ", c1, ".")); c2 = std.ascii.toUpper(c2); s.popFront(); ubyte val; if('0' <= c2 && c2 <= '9') val += (c2 - '0'); if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10); if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16); if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16); ret ~= val; } return ret; } //----
Oct 18 2012
monarch_dodra:hex! was a very good idea actually, imo.It must scale up to "real world" usages. Try it with a program composed of 3 modules each one containing a 100 KB long string. Then try it with a program with two hundred of medium sized literals, and let's see compilation times and binary sizes. Bye, bearophile
Oct 18 2012
On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:monarch_dodra:Hum... The compilation is pretty fast actually, about 1 second, provided it doesn't choke. It works for strings up to a length of 400 lines 80 chars per line, which result to approximately 16K of data. After that, I get a DMD out of memory error. DMD memory usage spikes quite quickly. To compile those 400 lines (16K), I use 800MB of memory (!). If I reach about 1GB, then it crashes. I tried using a refAppender instead of ret~, but that changed nothing. Kind of weird it would use that much memory though... Also, the memory doesn't get released. I can parse a 1x400 Line string, but if I try to parse 3 of them, DMD will choke on the second one. :(hex! was a very good idea actually, imo.It must scale up to "real world" usages. Try it with a program composed of 3 modules each one containing a 100 KB long string. Then try it with a program with two hundred of medium sized literals, and let's see compilation times and binary sizes. Bye, bearophile
Oct 18 2012
Am Thu, 18 Oct 2012 16:31:57 +0200 schrieb "monarch_dodra" <monarchdodra gmail.com>:On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:Hehe, I assume most of the regulars know this: DMD used to use a garbage collector that is disabled. Memory just isn't freed! Also it has copy on write semantics during CTFE: int bug6498(int x) { int n = 0; while (n < x) ++n; return n; } static assert(bug6498(10_000_000)==10_000_000); --> Fails with an 'out of memory' error. http://d.puremagic.com/issues/show_bug.cgi?id=6498 So, as strange as it sounds, for now try not to write often or into large blocks. Using this knowledge I was sometimes able to bring down the memory consumption considerably by caching recurring concatenations of two strings or to!string calls. That said, appending single elements to an array may actually be better than using a fixed-sized one and have DMD duplicate it on every write. :p Please remember to give Don a cookie when he manages to change the compiler to modify in-place where appropriate. -- Marcomonarch_dodra:Hum... The compilation is pretty fast actually, about 1 second, provided it doesn't choke. It works for strings up to a length of 400 lines 80 chars per line, which result to approximately 16K of data. After that, I get a DMD out of memory error. DMD memory usage spikes quite quickly. To compile those 400 lines (16K), I use 800MB of memory (!). If I reach about 1GB, then it crashes. I tried using a refAppender instead of ret~, but that changed nothing. Kind of weird it would use that much memory though... Also, the memory doesn't get released. I can parse a 1x400 Line string, but if I try to parse 3 of them, DMD will choke on the second one. :(hex! was a very good idea actually, imo.It must scale up to "real world" usages. Try it with a program composed of 3 modules each one containing a 100 KB long string. Then try it with a program with two hundred of medium sized literals, and let's see compilation times and binary sizes. Bye, bearophile
Oct 18 2012
On Friday, October 19, 2012 05:14:44 Marco Leise wrote:Hehe, I assume most of the regulars know this: DMD used to use a garbage collector that is disabled.Yes, but it didn't use it for long, because it made performance worse, and Walter didn't have the time to spend fixing it, so it was disabled. Presumably, someone will take the time to improve it at some point and then it will be re- enabled.Memory just isn't freed!That was my understanding, but the last time that I said that, Brad Roberts said that it wasn't true, and that we should stop spreading that FUD, so I don't know what the exact situation is, but it sounds like if that was true in the past, it's not true now. Regardless, it's clear that dmd still uses too much memory in many cases, especially when code uses a lot of templates or CTFE. - Jonathan M Davis
Oct 18 2012
Am Thu, 18 Oct 2012 21:03:01 -0700 schrieb Jonathan M Davis <jmdavisProg gmx.com>:On Friday, October 19, 2012 05:14:44 Marco Leise wrote:He called it a FUD? Without trying to sound too patronizing, most D programmers would really only notice DMD's memory footprint when they use CTFE features. It is always Pegged, ctRegex, etc. that make the issue come up, never basic code. And preloading the Boehm collector showed that gigabytes of CTFE memory usage can still be brought down to a few hundred MB [citation needed]. I guess we can meet somewhere in the middle. Btw. did I mix up Don and Brad in the last post ? Who is working on the memory management ? -- MarcoMemory just isn't freed!That was my understanding, but the last time that I said that, Brad Roberts said that it wasn't true, and that we should stop spreading that FUD, so I don't know what the exact situation is, but it sounds like if that was true in the past, it's not true now. Regardless, it's clear that dmd still uses too much memory in many cases, especially when code uses a lot of templates or CTFE. - Jonathan M Davis
Oct 18 2012
On Friday, October 19, 2012 07:29:46 Marco Leise wrote:Am Thu, 18 Oct 2012 21:03:01 -0700 schrieb Jonathan M Davis <jmdavisProg gmx.com>:I don't think that he used quite that term, but his point was that I shouldn't be saying that, because it wasn't true, and so I was spreading incorrect information (that and the fact that he was tired of people spreading that incorrect information, IIRC). I can't find the exact post at the moment though.On Friday, October 19, 2012 05:14:44 Marco Leise wrote:He called it a FUD?Memory just isn't freed!That was my understanding, but the last time that I said that, Brad Roberts said that it wasn't true, and that we should stop spreading that FUD, so I don't know what the exact situation is, but it sounds like if that was true in the past, it's not true now. Regardless, it's clear that dmd still uses too much memory in many cases, especially when code uses a lot of templates or CTFE. - Jonathan M DavisI guess we can meet somewhere in the middle. Btw. did I mix up Don and Brad in the last post ? Who is working on the memory management ?I don't think that you mixed anyone up. Don works primarily on CTFE. Brad works primarily on the auto tester and other infrastructure required for the dmd/Phobos folks to do what they do. - Jonathan M Davis
Oct 18 2012
On Friday, 19 October 2012 at 03:14:54 UTC, Marco Leise wrote:Hehe, I assume most of the regulars know this: DMD used to use a garbage collector that is disabled. Memory just isn't freed! Also it has copy on write semantics during CTFE: int bug6498(int x) { int n = 0; while (n < x) ++n; return n; } static assert(bug6498(10_000_000)==10_000_000); --> Fails with an 'out of memory' error. http://d.puremagic.com/issues/show_bug.cgi?id=6498 So, as strange as it sounds, for now try not to write often or into large blocks. Using this knowledge I was sometimes able to bring down the memory consumption considerably by caching recurring concatenations of two strings or to!string calls. That said, appending single elements to an array may actually be better than using a fixed-sized one and have DMD duplicate it on every write. :p Please remember to give Don a cookie when he manages to change the compiler to modify in-place where appropriate.I should have read your post in more detail. I thought you were saying that allocations are never freed, but it is indeed more than that: Every write allocates. I just spent the last hour trying to "optimize" my code, only to realize that at its "simplest" (Walk the string counting elements), I run out of memory :/ Can't do much more about it at this point.
Oct 20 2012
On Thu, 18 Oct 2012 12:11:13 +0200 "foobar" <foo bar.com> wrote:How often large binary blobs are literally spelled in the source code (as opposed to just being read from a file)?Frequency isn't the issue. The issues are "*Is* it ever needed?" and "When it is needed, is it useful enough?" The answer to both is most certainly "yes". (Remember, D is supposed to usable as a systems language, it's not merely a high-level-app-only language.) Keep in mind, the question "Does it pull it's own weight?" is for adding new features, not for going around gutting the language just because we can.In any case, I'm not opposed to such a utility library, in fact I think it's a rather good idea and we already have a precedent with "oct!" I just don't think this belongs as a built-in feature in the language.I think monarch_dodra's test proves that it definitely needs to be built-in.
Oct 18 2012
On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:On Thu, 18 Oct 2012 12:11:13 +0200 "foobar" <foo bar.com> wrote:Any real-world use cases to support this claim? Does C++ have such a feature? My limited experience with kernels is that this feature is not needed. The solution we used for this was to define an extern symbol and load it with a linker script (the binary data was of course stored in separate files).How often large binary blobs are literally spelled in the source code (as opposed to just being read from a file)?Frequency isn't the issue. The issues are "*Is* it ever needed?" and "When it is needed, is it useful enough?" The answer to both is most certainly "yes". (Remember, D is supposed to usable as a systems language, it's not merely a high-level-app-only language.)Keep in mind, the question "Does it pull it's own weight?" is for adding new features, not for going around gutting the language just because we can.Ok, I grant you that but remember that the whole thread started because the feature _doesn't_ work so lets rephrase - is it worth the effort to fix this feature?It proves that DMD has bugs that should be fixed, nothing more.In any case, I'm not opposed to such a utility library, in fact I think it's a rather good idea and we already have a precedent with "oct!" I just don't think this belongs as a built-in feature in the language.I think monarch_dodra's test proves that it definitely needs to be built-in.
Oct 19 2012
On Fri, 19 Oct 2012 15:07:09 +0200 "foobar" <foo bar.com> wrote:On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:I've used it. And Denis just posted an example of where it was used to make code far more readable.On Thu, 18 Oct 2012 12:11:13 +0200 "foobar" <foo bar.com> wrote:Any real-world use cases to support this claim?How often large binary blobs are literally spelled in the source code (as opposed to just being read from a file)?Frequency isn't the issue. The issues are "*Is* it ever needed?" and "When it is needed, is it useful enough?" The answer to both is most certainly "yes". (Remember, D is supposed to usable as a systems language, it's not merely a high-level-app-only language.)Does C++ have such a feature?It does not. As one consequence off the top of my head, including binary data into GBA homebrew became more of an awkward bloated mess than it needed to be.My limited experience with kernels is that this feature is not needed."I haven't needed it" isn't remotely sufficient to demonstrate that something doesn't "pull it's own weight".The solution we used for this was to define an extern symbol and load it with a linker script (the binary data was of course stored in separate files).Yuck! s/solution/workaround/The only bug is that it tries to validate it as UTF contrary to the spec. Making it *not* try to validate it sounds like a very minor effort. I think you're blowing it out of proportion. And yes, I think it's definitely worth it.Keep in mind, the question "Does it pull it's own weight?" is for adding new features, not for going around gutting the language just because we can.Ok, I grant you that but remember that the whole thread started because the feature _doesn't_ work so lets rephrase - is it worth the effort to fix this feature?Right so let's jettison x"..." just because *someday* CTFE might become good enough that we can bring the feature back. How does that make any sense? We already have it, it basically works (aside from only a fairly trivial issue). *When* CTFE is good enough to replace it, *then* we can have a sane debate about actually doing so. Until then, "Let's get rid of x"..." because it can be done in the library" is a pointless argument because at least for now it's NOT TRUE.It proves that DMD has bugs that should be fixed, nothing more.In any case, I'm not opposed to such a utility library, in fact I think it's a rather good idea and we already have a precedent with "oct!" I just don't think this belongs as a built-in feature in the language.I think monarch_dodra's test proves that it definitely needs to be built-in.
Oct 20 2012
On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare: string nihongo1 = x"e697a5 e69cac e8aa9e"; string nihongo2 = "\ue697a5\ue69cac\ue8aa9e"; ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e];You should use unicode directly here, that's the whole point to support it. string nihongo = "日本語";
Oct 18 2012
On Thursday, October 18, 2012 15:56:50 Kagamin wrote:On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:It's a nice feature, but there are plenty of cases where it makes more sense to use the unicode values rather than the characters themselves (e.g. your keyboard doesn't have the characters in question). It's valuable to be able to do it both ways. - Jonathan M DavisHave you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare: string nihongo1 = x"e697a5 e69cac e8aa9e"; string nihongo2 = "\ue697a5\ue69cac\ue8aa9e"; ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e];You should use unicode directly here, that's the whole point to support it. string nihongo = "日本語";
Oct 18 2012
Your keyboard doesn't have ready unicode values for all characters either.
Oct 18 2012
On Thursday, October 18, 2012 21:09:14 Kagamin wrote:Your keyboard doesn't have ready unicode values for all characters either.So? That doesn't make it so that it's not valuable to be able to input the values in hexidecimal instead of as actual unicode characters. Heck, if you want a specific character, I wouldn't trust copying the characters anyway, because it's far too easy to have two characters which look really similar but are different (e.g. there are multiple types of angle brackets in unicode), whereas with the numbers you can be sure. And with some characters (e.g. unicode whitespace characters), it generally doesn't make sense to enter the characters directly. Regardless, my point is that both approaches can be useful, so it's good to be able to do both. If you prefer to put the unicode characters in directly, then do that, but others may prefer the other way. Personally, I've done both. - Jonathan M Davis
Oct 18 2012
On 18/10/12 10:58, foobar wrote:On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:That is not the same. Array literals are not the same as string literals, they have an implicit .dup. See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote: [...]IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[][...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
Oct 18 2012
On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:On 18/10/12 10:58, foobar wrote:I don't see how that detail is relevant to this discussion as I was not arguing against string literals or array literals in general. We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dup and: ubyte[3] goo = [0xa1, 0xb2, 0xc3]; // implicit .dupOn Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:That is not the same. Array literals are not the same as string literals, they have an implicit .dup. See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote: [...]IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[][...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
Oct 18 2012
On 18/10/12 17:43, foobar wrote:On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:That doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4On 18/10/12 10:58, foobar wrote:I don't see how that detail is relevant to this discussion as I was not arguing against string literals or array literals in general. We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dupOn Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:That is not the same. Array literals are not the same as string literals, they have an implicit .dup. See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote: [...]IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[][...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string. T
Oct 19 2012
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:Come on, "assuming the code points are valid". It says so 4 lines above!We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dupThat doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Oct 19 2012
On 19/10/12 16:07, foobar wrote:On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:It isn't the same. Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char). \u makes dchars. "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.Come on, "assuming the code points are valid". It says so 4 lines above!We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dupThat doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Oct 19 2012
On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:On 19/10/12 16:07, foobar wrote:Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two. This is a very reasonable choice to prevent/reduce Unicode encoding errors. http://dlang.org/lex.html#HexString states: "Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters." I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data. Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form. For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding. In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals. Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:It isn't the same. Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char). \u makes dchars. "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.Come on, "assuming the code points are valid". It says so 4 lines above!We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dupThat doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Oct 19 2012
On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:I just re-checked and to clarify string literals support _three_ escape sequences: \x__ - a single byte \u____ - two bytes \U________ - four bytes So raw bytes _can_ be directly specified and I hope the compiler still verifies the string literal is valid Unicode.On 19/10/12 16:07, foobar wrote:Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two. This is a very reasonable choice to prevent/reduce Unicode encoding errors. http://dlang.org/lex.html#HexString states: "Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters." I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data. Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form. For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding. In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals. Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:It isn't the same. Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char). \u makes dchars. "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.Come on, "assuming the code points are valid". It says so 4 lines above!We can still have both (assuming the code points are valid...): string foo = "\ua1\ub2\uc3"; // no .dupThat doesn't compile. Error: escape hex sequence has 2 hex digits instead of 4
Oct 19 2012
On Fri, 19 Oct 2012 20:46:06 +0200For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[]Problem is, x"..." is FAR better syntax for that.or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding.Using x"..." doesn't prevent anyone from doing that: auto a = SomeAudioType(x"...");In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals. Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?The only thing ridiculous about x"..." is that somewhere along the lines it was decided that it must be a string instead of the arbitrary binary data that it *is*.
Oct 20 2012
18.10.2012 12:58, foobar пишет:IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue. -- Денис В. Шеломовский Denis V. Shelomovskij
Oct 20 2012
On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:18.10.2012 12:58, foobar пишет:I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.IMO, this is a redundant feature that complicates the language for no benefit and should be deprecated. strings already have an escape sequence for specifying code-points "\u" and for ubyte arrays you can simply use: immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4]; So basically this feature gains us nothing.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.
Oct 20 2012
On Sat, 20 Oct 2012 14:59:27 +0200 "foobar" <foo bar.com> wrote:On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Oct 20 2012
On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:On Sat, 20 Oct 2012 14:59:27 +0200 "foobar" <foo bar.com> wrote:If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic). T -- Без труда не выловишь и рыбку из пруда.On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Oct 20 2012
On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:Yeah, I like this. I'd prefer brackets over quotes but it not a big dig as the qoutes in the above are not very noticeable. It should look distinct from textual strings. As you said, this could/should be implemented as a template. Vote++On Sat, 20 Oct 2012 14:59:27 +0200 "foobar" <foo bar.com> wrote:If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic). TOn Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Oct 20 2012
On Saturday, 20 October 2012 at 21:16:44 UTC, foobar wrote:On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:** not a big dealOn Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:Yeah, I like this. I'd prefer brackets over quotes but it not a big dig as the qoutes in the above are not very noticeable. It should look distinct from textual strings. As you said, this could/should be implemented as a template. Vote++On Sat, 20 Oct 2012 14:59:27 +0200 "foobar" <foo bar.com> wrote:If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic). TOn Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Oct 20 2012
On Sat, 20 Oct 2012 14:05:21 -0700 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:Can't you already just do this?: auto blah = x" 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 "; I thought all string literals in D accepted embedded newlines?On Sat, 20 Oct 2012 14:59:27 +0200 "foobar" <foo bar.com> wrote:If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END"; (I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain. Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic).On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.Maybe. Just an example of a real world code: Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110 vs Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130 By the way, current code isn't affected by the topic issue.I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
Oct 20 2012
If you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END";Having a heredoc syntax for hex-strings that produce ubyte[] arrays is confusing for people who would (naturally) expect a string from a heredoc string. It is not named hereDOC for no reason. :)
Oct 22 2012
On Mon, Oct 22, 2012 at 01:14:21PM +0200, Dejan Lekic wrote:What I meant was, a syntax similar to heredoc, not an actual heredoc, which would be a string. T -- Knowledge is that area of ignorance that we arrange and classify. -- Ambrose BierceIf you want vastly human readable, you want heredoc hex syntax, something like this: ubyte[] = x"<<END 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65 END";Having a heredoc syntax for hex-strings that produce ubyte[] arrays is confusing for people who would (naturally) expect a string from a heredoc string. It is not named hereDOC for no reason. :)
Oct 22 2012
On Wed, 17 Oct 2012 19:49:43 -0700 "H. S. Teoh" <hsteoh quickfur.ath.cx> wrote:On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote: [...]Big +1 Having the language expect x"..." to always be a string (let alone a *valid UTF* string) is just insane. It's just too damn useful for arbitrary binary data.hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[][...] Yeah I think hex strings would be better as ubyte[] by default. More generally, though, I think *both* of the above lines should be equally accepted. If you write x"A1 B2 C3" in the context of initializing a string, then the compiler should infer the type of the literal as string, and if the same literal occurs in the context of, say, passing a ubyte[], then its type should be inferred as ubyte[], NOT string.
Oct 18 2012
Nick Sabalausky:Big +1 Having the language expect x"..." to always be a string (let alone a *valid UTF* string) is just insane. It's just too damn useful for arbitrary binary data.I'd like an opinion on such topics from one of the the D bosses :-) Bye, bearophile
Oct 18 2012
On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:(Repost) hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[] [SNIP] Bye, bearophileThe conversion can't be done *implicitly*, but you can still get your code to compile: //---- void main() { immutable(ubyte)[] data2 = cast(immutable(ubyte)[]) x"A1 B2 C3 D4"; // OK! } //---- It's a bit ugly, and I agree it should work natively, but it is a workaround.
Oct 18 2012
On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:(Repost) hex strings are useful, but I think they were invented in D1 when strings were convertible to char[]. But today they are an array of immutable UFT-8, so I think this default type is not so useful: void main() { string data1 = x"A1 B2 C3 D4"; // OK immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error } test.d(3): Error: cannot implicitly convert expression ("\xa1\xb2\xc3\xd4") of type string to ubyte[] Generally I want to use hex strings to put binary data in a program, so usually it's a ubyte[] or uint[]. So I have to use something like: auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup); So maybe the following literals are more useful in D2: ubyte[] data4 = x[A1 B2 C3 D4]; uint[] data5 = x[A1 B2 C3 D4]; ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4]; Bye, bearophile+1 on this one I also like the x[ ... ] literal because it makes it obvious that we are dealing with an array.
Oct 22 2012
On 2012-45-18 02:10, bearophile <bearophileHUGS lycos.com> wrote:So maybe the following literals are more useful in D2: ubyte[] data4 = x[A1 B2 C3 D4]; uint[] data5 = x[A1 B2 C3 D4]; ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];That syntax is already taken, though. Still, I see no reason for x"..." not to return ubyte[]. -- Simen
Oct 22 2012