digitalmars.D.learn - maketrans and translate
- bearophile (12/12) Feb 12 2012 In the online docs I've seen that std.string.maketrans() is (going to be...
-
Stewart Gordon
(5/7)
Feb 13 2012
- bearophile (19/21) Feb 13 2012 A new version of the code:
- Jonathan M Davis (4/34) Feb 13 2012 Do you have data to backup that there is a significant speed difference?...
- bearophile (73/75) Feb 13 2012 I have just written a small benchmark, it contains both the URL to the t...
- Jonathan M Davis (12/36) Feb 13 2012 I'll look into the best way to handle it. Those functions aren't being
- bearophile (5/7) Feb 15 2012 I have added the topic to Bugzilla to avoid its oblivion:
In the online docs I've seen that std.string.maketrans() is (going to be) deprecated. How do you adapt this code to the new regime? import std.stdio, std.string; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ", "0111222333445566677788899901112223334455666777888999"); auto text2 = text.translate(tab, "\""); writeln(text2); } Bye, bearophile
Feb 12 2012
On 13/02/2012 03:04, bearophile wrote:In the online docs I've seen that std.string.maketrans() is (going to be) deprecated. How do you adapt this code to the new regime?<snip> Use an associative array for the translation table. Or write your own functions that work in the same way as maketrans/translate. Stewart.
Feb 13 2012
Stewart Gordon:Use an associative array for the translation table.A new version of the code: import std.stdio, std.string, std.range; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; auto text3 = text.translate(aa, "\""); writeln(text3); }Or write your own functions that work in the same way as maketrans/translate.The code with the associative array is longer, and every char in the original string requires one or two associative array lookups. This is quite inefficient for ASCII strings. Instead of writing my own string functions, or copying them from an older Phobos, what about the idea of keeping both the old maketrans and translate functions too and to un-deprecate them (maybe moving them in std.ascii)? They are not UTF-safe, but I have large amounts of biological data text that is not Unicode. Bye, bearophile
Feb 13 2012
On Monday, February 13, 2012 08:09:06 bearophile wrote:Stewart Gordon:Do you have data to backup that there is a significant speed difference? We're trying to not have any ASCII-only string stuff. - Jonathan M DavisUse an associative array for the translation table.A new version of the code: import std.stdio, std.string, std.range; void main() { char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; auto text3 = text.translate(aa, "\""); writeln(text3); }Or write your own functions that work in the same way as maketrans/translate.The code with the associative array is longer, and every char in the original string requires one or two associative array lookups. This is quite inefficient for ASCII strings. Instead of writing my own string functions, or copying them from an older Phobos, what about the idea of keeping both the old maketrans and translate functions too and to un-deprecate them (maybe moving them in std.ascii)? They are not UTF-safe, but I have large amounts of biological data text that is not Unicode.
Feb 13 2012
Jonathan M Davis:Do you have data to backup that there is a significant speed difference?I have just written a small benchmark, it contains both the URL to the test data and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me. version3 uses the old translate. version4 uses the new translate. version1InPlace is a reference point, it works in-place. version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache). If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string). import std.stdio, std.string, std.file, std.exception; char[] version1InPlace(char[] text) { immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ"; immutable t2 = "0111222333445566677788899901112223334455666777888999"; immutable delchars = "\""; char[256] transtab; foreach (i, ref char c; transtab) c = cast(char)i; foreach (i, char c; t1) transtab[c] = t2[i]; bool[256] deltab = false; foreach (char c; delchars) deltab[c] = true; int count; foreach (char c; text) if (!deltab[c]) { text[count] = transtab[c]; count++; } return text[0 .. count]; } char[] version2InPlace(char[] text) { immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ"; immutable t2 = "0111222333445566677788899901112223334455666777888999"; immutable delchars = "\""; char[256] transtab; foreach (i, ref char c; transtab) c = cast(char)i; foreach (i, char c; t1) transtab[c] = t2[i]; int count; outer: foreach (char c; text) { foreach (char d; delchars) if (c == d) continue outer; text[count] = transtab[c]; count++; } return text[0 .. count]; } string version3(char[] text) { auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ", "0111222333445566677788899901112223334455666777888999"); return text.translate(tab, "\""); } string version4(string text) { dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; return text.translate(aa, "\""); } void main() { // http://patriot.net/~bmcgin/kjv12.txt auto txt = cast(char[])read("kjv12.txt"); assert(version1InPlace(txt.dup) == version2InPlace(txt.dup)); assert(version1InPlace(txt.dup) == version3(txt.dup)); assert(version1InPlace(txt.dup) == version4(txt.idup)); static if (1) version1InPlace(txt); // 0.16 seconds static if (0) version2InPlace(txt); // 0.18 seconds static if (0) version3(txt); // 0.20 seconds static if (0) version4(assumeUnique(txt)); // 0.97 seconds }We're trying to not have any ASCII-only string stuff.Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good. Bye, bearophile
Feb 13 2012
On Monday, February 13, 2012 14:08:16 bearophile wrote:Jonathan M Davis:I'll look into the best way to handle it. Those functions aren't being deprecated this release regardless.Do you have data to backup that there is a significant speed difference?I have just written a small benchmark, it contains both the URL to the test data and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me. version3 uses the old translate. version4 uses the new translate. version1InPlace is a reference point, it works in-place. version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache). If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string).In most cases, it should make no difference. You could use 1, 2, 3, and 4 as easily as you could use A, T, C, and G. Most algorithms don't care one whit whether you're dealing with characters or numbers. The only problem that I see is cases where there's an algorithm which would normally only be done with characters and not numbers and which therefore is in std.string and not std.array or std.algorithm - and translate falls into that category. Regardless, in general, Phobos is not going to have ASCII-specific string processing. - Jonathan M DavisWe're trying to not have any ASCII-only string stuff.Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good.
Feb 13 2012
Jonathan M Davis:I'll look into the best way to handle it. Those functions aren't being deprecated this release regardless.I have added the topic to Bugzilla to avoid its oblivion: http://d.puremagic.com/issues/show_bug.cgi?id=7515 Bye, bearophile
Feb 15 2012