www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 7515] New: The new std.string.translate is slow for ASCII text

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515

           Summary: The new std.string.translate is slow for ASCII text
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc



std.string.maketrans() is going to be deprecated. This is a demo program, it's
meant to work on a good amount of pure 7-bit ASCII text.

The conversion table is the same as the one in this page:
http://digitalmars.com/d/1.0/lisp-java-d.html


import std.stdio, std.string;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    auto tab =
maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ",
                        
"0111222333445566677788899901112223334455666777888999");
    auto text2 = text.translate(tab, "\"");
    writeln(text2);
}


Using the new std.string.maketrans the code becomes something like:


import std.stdio, std.string, std.range;
void main() {
    char[] text = "dssdadsdasdas".dup; // lots of MBs of pure 7 bit ASCII text
    dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6',
    'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9',
    'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7',
    'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8',
    'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3',
    'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2',
    'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9'];
    auto text3 = text.translate(aa, "\"");
    writeln(text3);
}


The code with the associative array is _uglier_, _longer_ to write and read,
and every char in the original string requires one or two associative array
lookups. This is quite inefficient for sizeable ASCII strings.



Jonathan M Davis has asked me a benchmark:

 Do you have data to backup that there is a significant speed difference?
So I have written the following small benchmark, it contains both the URL to the test data (bible) and the timings I am seeing on a slow PC. If your PC is faster feel free to use it on more data (creating a larger input file on disk, concatenating many copies of the same test data). If you see bugs in the benchmark please tell me. Notes: - version3 uses the old translate; - version4 uses the new translate; - version1InPlace is a reference point, it works in-place; - version2InPlace is similar, uses a linear scan for the chars to remove, and it seems a bit slower than version1InPlace (probably because the small deltab array fits well in the L1 cache). If both the benchmark and the timings are correct, the old translate seems almost 5 times faster than the new one (and this is not strange, the new translate performs two associative array lookups for each char of the input string). import std.stdio, std.string, std.file, std.exception; char[] version1InPlace(char[] text) { immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ"; immutable t2 = "0111222333445566677788899901112223334455666777888999"; immutable delchars = "\""; char[256] transtab; foreach (i, ref char c; transtab) c = cast(char)i; foreach (i, char c; t1) transtab[c] = t2[i]; bool[256] deltab = false; foreach (char c; delchars) deltab[c] = true; int count; foreach (char c; text) if (!deltab[c]) { text[count] = transtab[c]; count++; } return text[0 .. count]; } char[] version2InPlace(char[] text) { immutable t1 = "ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ"; immutable t2 = "0111222333445566677788899901112223334455666777888999"; immutable delchars = "\""; char[256] transtab; foreach (i, ref char c; transtab) c = cast(char)i; foreach (i, char c; t1) transtab[c] = t2[i]; int count; outer: foreach (char c; text) { foreach (char d; delchars) if (c == d) continue outer; text[count] = transtab[c]; count++; } return text[0 .. count]; } string version3(char[] text) { auto tab = maketrans("ejnqrwxdsyftamcivbkulopghzEJNQRWXDSYFTAMCIVBKULOPGHZ", "0111222333445566677788899901112223334455666777888999"); return text.translate(tab, "\""); } string version4(string text) { dchar[dchar] aa = ['A':'5', 'a':'5', 'B':'7', 'b':'7', 'C':'6', 'c':'6', 'D':'3', 'd':'3', 'E':'0', 'e':'0', 'F':'4', 'f':'4', 'G':'9', 'g':'9', 'H':'9', 'h':'9', 'I':'6', 'i':'6', 'J':'1', 'j':'1', 'K':'7', 'k':'7', 'L':'8', 'l':'8', 'M':'5', 'm':'5', 'N':'1', 'n':'1', 'O':'8', 'o':'8', 'P':'8', 'p':'8', 'Q':'1', 'q':'1', 'R':'2', 'r':'2', 'S':'3', 's':'3', 'T':'4', 't':'4', 'U':'7', 'u':'7', 'V':'6', 'v':'6', 'W':'2', 'w':'2', 'X':'2', 'x':'2', 'Y':'3', 'y':'3', 'Z':'9', 'z':'9']; return text.translate(aa, "\""); } void main() { // http://patriot.net/~bmcgin/kjv12.txt auto txt = cast(char[])read("kjv12.txt"); assert(version1InPlace(txt.dup) == version2InPlace(txt.dup)); assert(version1InPlace(txt.dup) == version3(txt.dup)); assert(version1InPlace(txt.dup) == version4(txt.idup)); static if (1) version1InPlace(txt); // 0.16 seconds static if (0) version2InPlace(txt); // 0.18 seconds static if (0) version3(txt); // 0.20 seconds static if (0) version4(assumeUnique(txt)); // 0.97 seconds } Jonathan M Davis:
 We're trying to not have any ASCII-only string stuff.
Around there is a large amount of data that is essentially text, but it's ASCII, it's not Unicode. Like biological data, text data coming out of instruments, etc. Handling it as binary data is not good. A possibility is to keep both the old maketrans and translate functions, and un-deprecate them (maybe moving them in std.ascii?). Other ideas are welcome. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 15 2012
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg gmx.com



PST ---
https://github.com/D-Programming-Language/phobos/pull/478

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 06 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515





 https://github.com/D-Programming-Language/phobos/pull/478
Thank you. Your patch seems nice. Two suggestions: 1) If they don't already say it, then I suggest to add in the ddoc of the functions that the translation arrays have a length 128. This for both old D programmers and Python programmers. 2) This code in translate(): + bool[128] remTable; + + remTable[] = false; I suggest to write it like this, that's shorter and avoids a double initialization: + bool[128] remTable = false; -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




PST ---
 1) If they don't already say it, then I suggest to add in the ddoc of the
 functions that the translation arrays have a length 128. This for both old D
 programmers and Python programmers.
I don't know. It seems to me that that's an implementation detail. makeTrans doesn't even return a static array. It returns a dynamic one. So, as long as the array passed to translate comes from makeTrans (as it should), it's complete non-issue. If anything, I'd argue for creating a struct (e.g. TransTable) which just held the dynamic array without giving access to it so that no one _can_ screw with it (since screwing with it is just going to break code). It seems to me that the only reason that makeTrans exists in the first place rather than just putting it in translate is so that you can reuse the translation table. No one should be messing with it or passing anything else to that version of translate. So, why does the length of the array matter to the user of makeTrans or translate? And actually, I'm _really_ tempted to just go and change it so that it returns a struct which you can't mess with and is only created by makeTrans. That completely resolves any possible issues caused by someone passing the wrong array to translate. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515






 It seems to me that the only reason that makeTrans exists in the first place
 rather than just putting it in translate is so that you can reuse the
 translation table. No one should be messing with it or passing anything else to
 that version of translate.
As several other string functions, the two functions we are talking about here come from Python2: http://docs.python.org/library/string.html#string-functions http://docs.python.org/library/string.html#string.translate The Python docs tell the size of the table too: string.translate(s, table[, deletechars]) Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed. In Python some times I have built the translation table manually. If makeTrans returns an opaque or privately defined struct this is not possible. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 06 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




PST ---
And _why_ would you ever need to build it manually? That's what makeTrans is
for.

The docs would have to explain exactly how the array is laid out for it to be
at all reasonable for the user to be screwing with it, and right now, they
definitely treat it as completely opaque. And without a really good reason why
the user would be screwing with the translation table on their own, I would
definitely argue that having it be opaque is a huge benefit, since it stops a
whole possible set of bugs caused by translate being passed an array which
doesn't conform to what it requires.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 06 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




I have thought some more about this and I have changed my mind: there is no
point to artificially restrict the usefulness of this function. So I suggest:

1) To not use a opaque struct, but leave the simple translation array.
2) To use a translation array of 256 items. So translate is usable for an
ubyte[] array input too (with a cast of the ubyte[] to char[] in the function
call if necessary).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 11 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




PDT ---
If you want a truly generic mapping function which operates on ubyte and
creates an array which holds all of the possible values of that type, then it
makes no sense to be putting it in std.string. It should be more generic and be
in std.array (though given that it wouldn't really make sense to do what
makeTrans and translate are doing with anything larger than a ubyte, I question
how good an idea that would lbe).

If what you care about is strings, then there is no point in having an array
greater than 128 elements long, because there aren't any ASCII characters
greater than that. Having 256 is pointless and wasteful, because there won't be
any characters that large. Not to mention, the _whole_ reason that you were
complaining about this in the first place was because the unicode-aware version
of translate was slower than using maketrans/trnslate when dealing with pure
ASCII, and using 128 elements is going to be faster.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 11 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515






Thank you for your answers.

 If you want a truly generic mapping function which operates on ubyte and
 creates an array which holds all of the possible values of that type, then it
 makes no sense to be putting it in std.string. It should be more generic and be
 in std.array (though given that it wouldn't really make sense to do what
 makeTrans and translate are doing with anything larger than a ubyte, I question
 how good an idea that would lbe).
I have not asked for a truly generic function for std.array. So this problem doesn't happen.
 If what you care about is strings, then there is no point in having an array
 greater than 128 elements long, because there aren't any ASCII characters
 greater than that. Having 256 is pointless and wasteful, because there won't be
 any characters that large. Not to mention, the _whole_ reason that you were
 complaining about this in the first place was because the unicode-aware version
 of translate was slower than using maketrans/trnslate when dealing with pure
 ASCII, and using 128 elements is going to be faster.
Now I have had to use translate() for a different job: I have a good amount of 8-bit ASCII text in my language to process (http://en.wikipedia.org/wiki/Extended_ASCII ). It's not Unicode, and I don't think I can convert it to Unicode. I have to use translate() on this text. A 128-elements translate() isn't enough. The "waste" caused by using a 256 items array instead of a 128 items array is probably not significant. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Mar 11 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




PDT ---
Okay. It seems to me that this is getting a bit ridiculous. Phobos doesn't
support Extendend ASCII _anywhere_. Its support of ASCII-specific stuff is
already minimal. So, to set up a string function to take a string which is
_completely_ invalid as far as the rest of Phobos is concerned seems really off
to me. All you have to do with an ASCII-only version is make sure that you
don't pass it unicode characters. But with an Extended ASCII version, all of a
sudden you have invalid unicode sequences. The only support that Phobos gives
for non-unicode and non-ASCII encodings is std.encoding, and almost all of the
ASCII-specific stuff is in std.ascii.

The _only_ reason that I see to support this is the fact that the
implementation that we've had happens to.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 11 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




PDT ---
The fact that the version that's going away deals with a 256 element array and
therefore just so happens to work with Extended ASCII may be enough to justify
just making the new one do that as well, but I honestly, don't see Extended
ASCII as much of an argument when none of the other functions support it and
are likely to blow up if you use it. The general idea with Phobos functions is
that you will convert any non-unicode strings that you have to unicode before
processing them, and then when you're done, if you want a different encoding,
you convert it before writing it out to file or whatever you're doing with it.
It's really not the intention that Phobos in general will support non-unicode
encodings. Dealing with unicode is already complicated enough.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 11 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




PDT ---
Okay. After some rethinking the issue, I've closed the old pull request and
created a new one:

https://github.com/D-Programming-Language/phobos/pull/609

It retains the ability to handle 256 code units rather than the 128 that ASCII
actually uses. What finally convinced me was the fact that if it takes the full
256 code units, then it could operate on full-on UTF-8 strings, replacing the
ASCII characters and leaving the unicode ones alone.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 28 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515






 It retains the ability to handle 256 code units rather than the 128 that ASCII
 actually uses.
Thank you :-) Please take also a look at Issue 8141 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
May 28 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515




Commits pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/08acc16b830b35572e6ea004d936a7b2522801b8
Fix issue 7515

I created an adjusted version of translate which is ASCII-only and
renamed maketrans to makeTrans with some minor changes. Instead of
deprecating the old translate and maketrans, maketrans is now a
deprecated alias of makeTrans, and the new ASCII-only translate should
be compatible with the old one.

https://github.com/D-Programming-Language/phobos/commit/94c06fb2469c0d98be842438b749e61571b42fe3


Fix issue 7515

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7515


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2012