digitalmars.D.learn - string to character code hex string

bitwise (7/7) Sep 02 2017 I need to convert a string of characters to a string of their hex

bitwise (10/11) Sep 02 2017 This seems to work well enough.

=?UTF-8?Q?Ali_=c3=87ehreli?= (17/28) Sep 02 2017 Lazy version, which the user can easily generate a string from by

lithium iodate (6/38) Sep 02 2017 Please correct my if i'm wrong, but it think this has issues

=?UTF-8?Q?Ali_=c3=87ehreli?= (6/33) Sep 02 2017 You're right but I think there is no intention of interpreting the

bitwise (5/11) Sep 02 2017 My intention is compute the mangling of a D template function
lithium iodate (5/10) Sep 02 2017 That is not possible, because you cannot know whether "f620" and

=?UTF-8?Q?Ali_=c3=87ehreli?= (39/49) Sep 02 2017 Ok, I see that I made a mistake but I still don't think the conversion

ag0aep6g (13/58) Sep 03 2017 You weren't converting byte-by-byte. You were only converting the

=?UTF-8?Q?Ali_=c3=87ehreli?= (6/13) Sep 03 2017 Good point.

Moritz Maxeiner (40/51) Sep 02 2017 Note: Each of those format calls is going to allocate a new

bitwise (20/24) Sep 02 2017 Yes, thanks. I'm going to go with a variation of your approach:

Moritz Maxeiner (17/41) Sep 02 2017 If you never need the individual character function, that's

bitwise (11/24) Sep 02 2017 What I intend to do is this though:

Moritz Maxeiner (7/32) Sep 02 2017 Interesting, I wasn't aware of that (though after thinking about

bitwise (22/23) Sep 02 2017 Code will eventually look something like the following.

bitwise <bitwise.pvt gmail.com> writes:

I need to convert a string of characters to a string of their hex 
representations.

"AAA" -> "414141"

This seems like something that would be in the std lib, but I 
can't find it.
Does it exist?

   Thanks

Sep 02 2017

bitwise <bitwise.pvt gmail.com> writes:

On Saturday, 2 September 2017 at 15:53:25 UTC, bitwise wrote:
 [...]

This seems to work well enough.

string toAsciiHex(string str)
{
     import std.array : appender;

     auto ret = appender!string(null);
     ret.reserve(str.length * 2);
     foreach(c; str) ret.put(format!"%x"(c));
     return ret.data;
}

Sep 02 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 09/02/2017 09:23 AM, bitwise wrote:
 On Saturday, 2 September 2017 at 15:53:25 UTC, bitwise wrote:
 [...]

 This seems to work well enough.

 string toAsciiHex(string str)
 {
     import std.array : appender;

     auto ret = appender!string(null);
     ret.reserve(str.length * 2);
     foreach(c; str) ret.put(format!"%x"(c));
     return ret.data;
 }

Lazy version, which the user can easily generate a string from by 
appending .array:

import std.stdio;

auto hexString(R)(R input) {
     import std.conv : text;
     import std.string : format;
     import std.algorithm : map, joiner;
     return input.map!(c => format("%02x", c)).joiner;
}

void main() {
     writeln("AAA".hexString);
}

To generate string:

     import std.range : array;
     writeln("AAA".hexString.array);

Ali

Sep 02 2017

lithium iodate <whatdoiknow doesntexist.net> writes:

On Saturday, 2 September 2017 at 16:52:17 UTC, Ali Çehreli wrote:
 On 09/02/2017 09:23 AM, bitwise wrote:
 On Saturday, 2 September 2017 at 15:53:25 UTC, bitwise wrote:
 [...]

 This seems to work well enough.

 string toAsciiHex(string str)
 {
     import std.array : appender;

     auto ret = appender!string(null);
     ret.reserve(str.length * 2);
     foreach(c; str) ret.put(format!"%x"(c));
     return ret.data;
 }

 Lazy version, which the user can easily generate a string from 
 by appending .array:

 import std.stdio;

 auto hexString(R)(R input) {
     import std.conv : text;
     import std.string : format;
     import std.algorithm : map, joiner;
     return input.map!(c => format("%02x", c)).joiner;
 }

 void main() {
     writeln("AAA".hexString);
 }

 To generate string:

     import std.range : array;
     writeln("AAA".hexString.array);

 Ali

Please correct my if i'm wrong, but it think this has issues 
regarding unicode.
"ö…" becomes "f62026", which, interpreted as UTF-8, is a control 
character ~ " &", so you either need to add padding or use 
.byCodeUnit so it becomes "c3b6e280a6" (correct UTF-8) instead.

Sep 02 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 09/02/2017 10:07 AM, lithium iodate wrote:

 Lazy version, which the user can easily generate a string from by
 appending .array:

 import std.stdio;

 auto hexString(R)(R input) {
     import std.conv : text;
     import std.string : format;
     import std.algorithm : map, joiner;
     return input.map!(c => format("%02x", c)).joiner;
 }

 void main() {
     writeln("AAA".hexString);
 }

 To generate string:

     import std.range : array;
     writeln("AAA".hexString.array);

 Ali

 Please correct my if i'm wrong, but it think this has issues regarding
 unicode.
 "ö…" becomes "f62026", which, interpreted as UTF-8, is a control
 character ~ " &", so you either need to add padding or use ..byCodeUnit
 so it becomes "c3b6e280a6" (correct UTF-8) instead.

You're right but I think there is no intention of interpreting the 
result as UTF-8. "f62026" is just to be used as "f62026", which can be 
converted byte-by-byte back to "ö…". That's how understand the 
requirement anyway.

Ali

Sep 02 2017

bitwise <bitwise.pvt gmail.com> writes:

On Saturday, 2 September 2017 at 17:41:34 UTC, Ali Çehreli wrote:
 
 You're right but I think there is no intention of interpreting 
 the result as UTF-8. "f62026" is just to be used as "f62026", 
 which can be converted byte-by-byte back to "ö…". That's how 
 understand the requirement anyway.

 Ali

My intention is compute the mangling of a D template function 
that takes a string as a template parameter without having the 
symbol available. I think that means that converting each byte of 
the string to hex and tacking it on would suffice.

Sep 02 2017

lithium iodate <whatdoiknow doesntexist.net> writes:

On Saturday, 2 September 2017 at 17:41:34 UTC, Ali Çehreli wrote:
 You're right but I think there is no intention of interpreting 
 the result as UTF-8. "f62026" is just to be used as "f62026", 
 which can be converted byte-by-byte back to "ö…". That's how 
 understand the requirement anyway.

 Ali

That is not possible, because you cannot know whether "f620" and 
"26" or "f6" and "2026" (or any other combination) should form a 
code point each. Additional padding to constant width (8 hex 
chars) is needed.

Sep 02 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 09/02/2017 11:02 AM, lithium iodate wrote:
 On Saturday, 2 September 2017 at 17:41:34 UTC, Ali Çehreli wrote:
 You're right but I think there is no intention of interpreting the
 result as UTF-8. "f62026" is just to be used as "f62026", which can be
 converted byte-by-byte back to "ö…". That's how understand the
 requirement anyway.

 Ali

 That is not possible, because you cannot know whether "f620" and "26" or
 "f6" and "2026" (or any other combination) should form a code point
 each. Additional padding to constant width (8 hex chars) is needed.

Ok, I see that I made a mistake but I still don't think the conversion 
is one way. If we can convert byte-by-byte, we should be able to convert 
back byte-by-byte, right? What I failed to ensure was to iterate by code 
units. The following is able to get the same string back:

import std.stdio;
import std.string;
import std.algorithm;
import std.range;
import std.utf;
import std.conv;

auto toHex(R)(R input) {
     // As Moritz Maxeiner says, this format is expensive
     return input.byCodeUnit.map!(c => format!"%02x"(c)).joiner;
}

int hexValue(C)(C c) {
     switch (c) {
     case '0': .. case '9':
         return c - '0';
     case 'a': .. case 'f':
         return c - 'a' + 10;
     default:
         assert(false);
     }
}

auto fromHex(R, Dst = char)(R input) {
     return input.chunks(2).map!((ch) {
             auto high = ch.front.hexValue * 16;
             ch.popFront();
             return high + ch.front.hexValue;
         }).map!(value => cast(Dst)value);
}

void main() {
     assert("AAA".toHex.fromHex.equal("AAA"));

     assert("ö…".toHex.fromHex.equal("ö…".byCodeUnit));
     // Alternative check:
     assert("ö…".toHex.fromHex.text.equal("ö…"));
}

Ali

Sep 02 2017

ag0aep6g <anonymous example.com> writes:

On 09/03/2017 01:39 AM, Ali Çehreli wrote:
 Ok, I see that I made a mistake but I still don't think the conversion 
 is one way. If we can convert byte-by-byte, we should be able to convert 
 back byte-by-byte, right?

You weren't converting byte-by-byte. You were only converting the 
significant bytes of the code points, throwing away leading zeroes.

 What I failed to ensure was to iterate by code 
 units.

A UTF-8 code unit is a byte, so "%02x" is enough, yes. But for UTF-16 
and UTF-32 code units, it's not. You need to match the format width to 
the size of the code unit.

Or maybe just convert everything to UTF-8 first. That also sidesteps any 
endianess issues.

 The following is able to get the same string back:
 
 import std.stdio;
 import std.string;
 import std.algorithm;
 import std.range;
 import std.utf;
 import std.conv;
 
 auto toHex(R)(R input) {
      // As Moritz Maxeiner says, this format is expensive
      return input.byCodeUnit.map!(c => format!"%02x"(c)).joiner;
 }
 
 int hexValue(C)(C c) {
      switch (c) {
      case '0': .. case '9':
          return c - '0';
      case 'a': .. case 'f':
          return c - 'a' + 10;
      default:
          assert(false);
      }
 }
 
 auto fromHex(R, Dst = char)(R input) {
      return input.chunks(2).map!((ch) {
              auto high = ch.front.hexValue * 16;
              ch.popFront();
              return high + ch.front.hexValue;
          }).map!(value => cast(Dst)value);
 }
 
 void main() {
      assert("AAA".toHex.fromHex.equal("AAA"));
 
      assert("ö…".toHex.fromHex.equal("ö…".byCodeUnit));
      // Alternative check:
      assert("ö…".toHex.fromHex.text.equal("ö…"));
 }

Still fails with UTF-16 and UTF-32 strings:

----
writeln("…"w.toHex.fromHex.text); /* prints " &" */
writeln("…"d.toHex.fromHex.text); /* prints " &" */
----

Sep 03 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 09/03/2017 03:03 AM, ag0aep6g wrote:
 On 09/03/2017 01:39 AM, Ali Çehreli wrote:
 If we can convert byte-by-byte, we should be able to
 convert back byte-by-byte, right?

 You weren't converting byte-by-byte.

In my mind I was! :o)

 Or maybe just convert everything to UTF-8 first. That also sidesteps any
 endianess issues.

Good point.

 Still fails with UTF-16 and UTF-32 strings:

I think I can make it work with a few more iterations but I'll leave it 
as an exercise for the author.

Ali

Sep 03 2017

Moritz Maxeiner <moritz ucworks.org> writes:

On Saturday, 2 September 2017 at 16:23:57 UTC, bitwise wrote:
 On Saturday, 2 September 2017 at 15:53:25 UTC, bitwise wrote:
 [...]

 This seems to work well enough.

 string toAsciiHex(string str)
 {
     import std.array : appender;

     auto ret = appender!string(null);
     ret.reserve(str.length * 2);
     foreach(c; str) ret.put(format!"%x"(c));
     return ret.data;
 }

Note: Each of those format calls is going to allocate a new 
string, followed by put copying that new string's content over 
into the appender, leaving you with \theta(str.length) tiny 
memory chunks that aren't used anymore for the GC to eventually 
collect.

If this (unnecessary waste) is of concern to you (and from the 
fact that you used ret.reserve I assume it is), then the easy fix 
is to use `sformat` instead of `format`:

---
string toHex(string str)
{
	import std.format : sformat;
	import std.exception: assumeUnique;

	auto   ret = new char[str.length * 2];
	size_t len;

	foreach (c; str)
	{
		auto slice = sformat!"%x"(ret[len..$], c);
		//auto slice = toHex(ret[len..$], c);
		assert (slice.length <= 2);
		len += slice.length;
	}
	
	return ret[0..len].assumeUnique;
}
---

If you want to cut out the format import entirely, notice the 
`auto slice = toHex...` line, which can be implemented like this 
(always returns two chars):

---
char[] toHex(char[] buf, char c)
{
	import std.ascii : lowerHexDigits;

	assert (buf.length >= 2);
	buf[0] = lowerHexDigits[(c & 0xF0) >> 4];
	buf[1] = lowerHexDigits[c & 0x0F];

	return buf[0..2];
}
---

Sep 02 2017

bitwise <bitwise.pvt gmail.com> writes:

On Saturday, 2 September 2017 at 17:45:30 UTC, Moritz Maxeiner 
wrote:
 
 If this (unnecessary waste) is of concern to you (and from the 
 fact that you used ret.reserve I assume it is), then the easy 
 fix is to use `sformat` instead of `format`:

Yes, thanks. I'm going to go with a variation of your approach:

private
string toAsciiHex(string str)
{
     import std.ascii : lowerHexDigits;
     import std.exception: assumeUnique;

     auto ret = new char[str.length * 2];
     int i = 0;

     foreach(c; str) {
         ret[i++] = lowerHexDigits[(c >> 4) & 0xF];
         ret[i++] = lowerHexDigits[c & 0xF];
     }

     return ret.assumeUnique;
}

I'm not sure how the compiler would mangle UTF8, but I intend to 
use this on one specific function (actually the 100's of 
instantiations of it). It will predictably named though.

    Thanks!

Sep 02 2017

Moritz Maxeiner <moritz ucworks.org> writes:

On Saturday, 2 September 2017 at 18:07:51 UTC, bitwise wrote:
 On Saturday, 2 September 2017 at 17:45:30 UTC, Moritz Maxeiner 
 wrote:
 
 If this (unnecessary waste) is of concern to you (and from the 
 fact that you used ret.reserve I assume it is), then the easy 
 fix is to use `sformat` instead of `format`:

 Yes, thanks. I'm going to go with a variation of your approach:

 private
 string toAsciiHex(string str)
 {
     import std.ascii : lowerHexDigits;
     import std.exception: assumeUnique;

     auto ret = new char[str.length * 2];
     int i = 0;

     foreach(c; str) {
         ret[i++] = lowerHexDigits[(c >> 4) & 0xF];
         ret[i++] = lowerHexDigits[c & 0xF];
     }

     return ret.assumeUnique;
 }

If you never need the individual character function, that's 
probably the best in terms of readability, though with a decent 
compiler, that and the two functions one should result in the 
same opcode (except for bitshift&bitmask swap).

 I'm not sure how the compiler would mangle UTF8, but I intend 
 to use this on one specific function (actually the 100's of 
 instantiations of it).

In UTF8:

--- utfmangle.d ---
void fun_ༀ() {}
pragma(msg, fun_ༀ.mangleof);
-------------------

---
$ dmd -c utfmangle.d
_D6mangle7fun_ༀFZv
---

Only universal character names for identifiers are allowed, 
though, as per [1]

[1] https://dlang.org/spec/lex.html#identifiers

Sep 02 2017

bitwise <bitwise.pvt gmail.com> writes:

On Saturday, 2 September 2017 at 18:28:02 UTC, Moritz Maxeiner 
wrote:
 
 In UTF8:

 --- utfmangle.d ---
 void fun_ༀ() {}
 pragma(msg, fun_ༀ.mangleof);
 -------------------

 ---
 $ dmd -c utfmangle.d
 _D6mangle7fun_ༀFZv
 ---

 Only universal character names for identifiers are allowed, 
 though, as per [1]

 [1] https://dlang.org/spec/lex.html#identifiers

What I intend to do is this though:

void fun(string s)() {}
pragma(msg, fun!"ༀ".mangleof);

which gives:
_D7mainMod21__T3funVAyaa3_e0bc80Z3funFNaNbNiNfZv

where "e0bc80" is the 3 bytes of "ༀ".

The function will be internal to my library. The only thing 
provided from outside will be the string template argument, which 
is meant to represent a fully qualified type name.

Sep 02 2017

Moritz Maxeiner <moritz ucworks.org> writes:

On Saturday, 2 September 2017 at 20:02:37 UTC, bitwise wrote:
 On Saturday, 2 September 2017 at 18:28:02 UTC, Moritz Maxeiner 
 wrote:
 
 In UTF8:

 --- utfmangle.d ---
 void fun_ༀ() {}
 pragma(msg, fun_ༀ.mangleof);
 -------------------

 ---
 $ dmd -c utfmangle.d
 _D6mangle7fun_ༀFZv
 ---

 Only universal character names for identifiers are allowed, 
 though, as per [1]

 [1] https://dlang.org/spec/lex.html#identifiers

 What I intend to do is this though:

 void fun(string s)() {}
 pragma(msg, fun!"ༀ".mangleof);

 which gives:
 _D7mainMod21__T3funVAyaa3_e0bc80Z3funFNaNbNiNfZv

 where "e0bc80" is the 3 bytes of "ༀ".

Interesting, I wasn't aware of that (though after thinking about 
it, it does make sense, as identifiers can only have visible 
characters in them, while a string could have things such as 
control characters inside), thanks! That behaviour is defined 
here [1], btw (the line `CharWidth Number _ HexDigits`).

[1] https://dlang.org/spec/abi.html#Value

Sep 02 2017

bitwise <bitwise.pvt gmail.com> writes:

On Saturday, 2 September 2017 at 18:28:02 UTC, Moritz Maxeiner 
wrote:
 [...]

Code will eventually look something like the following.
The point is to be able to retrieve the exported function at 
runtime only by knowing what the template arg would have been.

export extern(C) const(Reflection) dummy(string fqn)(){ ... }

int main(string[] argv)
{
     enum ARG = "AAAAAA";
     auto hex = toAsciiHex(ARG);

     // original
     writeln(dummy!ARG.mangleof);

     // reconstructed at runtime
     auto remangled = dummy!"".mangleof;

     remangled = remangled.replaceFirst(
         "_D7mainMod17", "_D7mainMod" ~ (17 + 
hex.length).to!string);

     remangled = remangled.replaceFirst(
         "VAyaa0_", "VAyaa" ~ ARG.length.to!string ~ "_" ~ hex);

     writeln(remangled);

     return 0;
}

Sep 02 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - string to character code hex string