digitalmars.D - Suggestion to improve phobos std.string.icmp function

Chad J (48/48) Jul 17 2006 I am modifying gphobos to support WinCE, and it just so happens that

Lionello Lunesu (8/8) Jul 18 2006 Hi,

Jarrett Billingsley (3/5) Jul 18 2006 In D, character types are always unsigned.

Chad J (6/17) Jul 18 2006 Yeah I was wondering about that. I know the characters I was testing on...

James Pelcis (5/10) Jul 18 2006 The spec does say that they are unsigned. There are probably other

Lionello Lunesu (6/13) Jul 18 2006 :-.

Chad J (14/25) Jul 18 2006 I took a look at std.uni.toUniLower. It looks susceptible to the same

Chad J <gamerChad _spamIsBad_gmail.com> writes:

I am modifying gphobos to support WinCE, and it just so happens that 
memicmp (used in std.string.icmp) is not available with the cegcc/newlib 
libraries that I have at my disposal for WinCE.  Then I noticed that 
memicmp is only used on Win32 platforms.  Unix platforms (linux in dmd 
phobos) seem to have a perfectly good fully cross-platform alternative 
written in D.  So my first suggestion is this: ditch the C dependency 
and what seems like an unnecessary version statement.

Then I started thinking, the memicmp thing might be an optimization or 
something if memicmp can outperform the D alternative.  So I decided to 
test it out.  I also thought it would be fun to try and write a version 
that is even more optimized than either of the two before, and I think 
I've succeeded.

The source file should be attached.  I compiled with dmd and build using 
the line 'build main.d -clean -O -inline -release -ofmain'.  I've tried 
without -release and it does change things (for me); not sure if Phobos 
is usually compiled with -release.

This is what my main.d gave me on the slower of the two computers I 
tested on (AMD Athlon XP 2600+ with WinXP):

The following tests involve a worst-case scenario where strings that 
have highly divergent capitalization are compared.
739947 - current Phobos on Win32 systems using memicmp function.
962118 - current Phobos on *nix systems using D implementation.
482975 - possible new implementation.
Begin tests where the strings have identical capitalization already.
410813 - current Phobos on Win32 systems using memicmp function.
410238 - current Phobos on *nix systems using D implementation.
238644 - possible new implementation.

This is what my main.d gave me on the other computer I tested on (AMD 
Turion 64 MT40 with WinXP):

The following tests involve a worst-case scenario where strings that 
have highly divergent capitalization are compared.
825528 - current Phobos on Win32 systems using memicmp function.
678905 - current Phobos on *nix systems using D implementation.
442500 - possible new implementation.
Begin tests where the strings have identical capitalization already.
355304 - current Phobos on Win32 systems using memicmp function.
369101 - current Phobos on *nix systems using D implementation.
309204 - possible new implementation.

I ran it a few more times on each system and the numbers stay pretty 
consistant.  The downside of the new implementation is that it takes an 
extra 256 bytes at runtime for a table lookup, and the table is 
generated at starttime.  Anyhow, I suggest/propose that this new routine 
(icmpNew in the source file, would be renamed icmp) replace icmp in 
std.string.

Also, the table it uses might be able to speedup other functions there 
(like tolower! hehe) but I haven't looked into that, and I probably 
won't unless someone asks me to or I run into another missing dependency 
in gphobos.

Jul 17 2006

Lionello Lunesu <lio lunesu.remove.com> writes:

Hi,

Beware that using a char to index into an array can cause access 
violation if the char's negative. You'll get [-1] and such. cast it to a 
byte.

Otherwise, seems like a good optimization. Either your code, or some 
more complicate function using the tolower from std.uni. Will be tricky 
to optimize that one.

L.

Jul 18 2006

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Lionello Lunesu" <lio lunesu.remove.com> wrote in message 
news:e9isbu$11ta$1 digitaldaemon.com...

 Beware that using a char to index into an array can cause access violation 
 if the char's negative. You'll get [-1] and such. cast it to a byte.

In D, character types are always unsigned.

Jul 18 2006

Chad J <gamerChad _spamIsBad_gmail.com> writes:

Jarrett Billingsley wrote:
 "Lionello Lunesu" <lio lunesu.remove.com> wrote in message 
 news:e9isbu$11ta$1 digitaldaemon.com...
 
 
Beware that using a char to index into an array can cause access violation 
if the char's negative. You'll get [-1] and such. cast it to a byte.

 
 
 In D, character types are always unsigned. 
 
 

Yeah I was wondering about that.  I know the characters I was testing on 
would all be negative if they were signed.  I worry though, if this 
can't be relied on then they should all be casted to ubytes, just to 
make sure.  Otherwise, if the spec says somewhere that chars are 
unsigned (I couldn't find it), then it's probably best to leave it.

Jul 18 2006

James Pelcis <jpelcis gmail.com> writes:

The spec does say that they are unsigned.  There are probably other 
places, but one place is in 
http://www.digitalmars.com/d/interfaceToC.html under "Data Type 
Compatibility."

Chad J wrote:
 Yeah I was wondering about that.  I know the characters I was testing on 
 would all be negative if they were signed.  I worry though, if this 
 can't be relied on then they should all be casted to ubytes, just to 
 make sure.  Otherwise, if the spec says somewhere that chars are 
 unsigned (I couldn't find it), then it's probably best to leave it.

Jul 18 2006

Lionello Lunesu <lio lunesu.remove.com> writes:

Jarrett Billingsley wrote:
 "Lionello Lunesu" <lio lunesu.remove.com> wrote in message 
 news:e9isbu$11ta$1 digitaldaemon.com...
 
 Beware that using a char to index into an array can cause access violation 
 if the char's negative. You'll get [-1] and such. cast it to a byte.

 
 In D, character types are always unsigned. 

:-.

That's what you get, mixing D and C++ on a daily basis. I should have 
known: I have the same problems with Romanian and Italian, mixing words, 
causing confusion.

L.

Jul 18 2006

Chad J <gamerChad _spamIsBad_gmail.com> writes:

Lionello Lunesu wrote:
 Hi,
 
 Beware that using a char to index into an array can cause access 
 violation if the char's negative. You'll get [-1] and such. cast it to a 
 byte.
 
 Otherwise, seems like a good optimization. Either your code, or some 
 more complicate function using the tolower from std.uni. Will be tricky 
 to optimize that one.
 
 L.

I took a look at std.uni.toUniLower.  It looks susceptible to the same 
optimization.  All but 2 if those range checks could be eliminated with 
a 2732 byte long table (an array of ushorts, going out to 0x0556). 
Getting rid of those last two ranges seems harder as it would require a 
much bigger table, well at least the last one (0xFF21 <= x <= 0xFF3A) 
would require a big table to include it.  There would need to be another 
2830 byte long table to handle toUniUpper as well, if that's desired. 
I'd use the current code in a the table generator that would be called 
at starttime (no bloat on the exe).  Just gotta know if the memory loss 
from the tables is acceptable.

Anyhow, I'm not going to write it and benchmark it now because I really 
should be plugging away at my arm-wince-pe port of D.  I'm getting 
distracted as usual :)

Jul 18 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Suggestion to improve phobos std.string.icmp function