www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Shouldn't Phobes have a non-case sensitive find() and rfind()

reply David L. Davis <SpottedTiger yahoo.com> writes:
Shouldn't there be a non-case sensitive version of find (ifind) and rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of course we
really have this, and I've missed it somehow looking thru the "D" docs.

Below I wrote some sample code I've tested, which sort of show its usefulness.
I've indented the code this time with tabs, so hopeful it all appears readable 
in my posted message. <*crosses-fingers*> ;)

import std.string;

int main()
{
char[] sStr = "ApO 123355 PO Box 23, Waterpool Street Portland, Texas";

printf("Case Insensitive ifind and irfind tests\n");
printf("Test String sStr=%.*s\n\n", sStr);

printf("Default = 0, ifind \'PO\' in sStr, result=%d\n", ifind( sStr, "PO" ) );
printf("StartPos= 2, ifind \'PO\' in sStr, result=%d\n", ifind( sStr, "PO", 2 )
);
printf("StartPos=15, ifind \'PO\' in sStr, result=%d\n", ifind( sStr, "PO", 15 )
);
printf("StartPos=33, ifind \'PO\' in sStr, result=%d\n\n", ifind( sStr, "PO", 33
) );

printf("Default = sStr.length - 1, irfind \'PO\' in sStr2, result=%d\n", irfind(
sStr, "PO" ) );
printf("StartPos= 2, irfind \'PO\' in sStr, result=%d\n", irfind( sStr, "PO", 2
) );
printf("StartPos=15, irfind \'PO\' in sStr, result=%d\n", irfind( sStr, "PO", 15
) );
printf("StartPos=33, irfind \'PO\' in sStr, result=%d\n", irfind( sStr, "PO", 33
) );

return 0;

} // end-function int main( void )

// Case insensitive version of std.string.find
int ifind
(
in char[] sStr,
in char[] sSubStr
)
{
return ifind( sStr, sSubStr, 0 );

} // end-function int ifind( char[], char[] )

// Case insensitive version of std.string.find 
// with an optional "String Start Position" parameter.
int ifind
(
in char[] sStr,
in char[] sSubStr,
in int    iStartPos
)
{
char[] sTmpStr;
int    iRtnVal;

// Out of Boundary return not found
if ( iStartPos > sStr.length - 1 ) return -1;
if ( iStartPos < 0 ) return - 1;

sTmpStr = tolower( sStr[ iStartPos .. sStr.length ] );

if ( iStartPos == 0 ) 
return find( sTmpStr, tolower( sSubStr ) );   
else
{
iRtnVal = find( sTmpStr, tolower( sSubStr ) );

if ( iRtnVal != -1 ) 
return iStartPos + iRtnVal;
else
return -1;
// end-if

} // end-if

} // end-function int ifind( char[],char[], int ) 

// Case insensitive version of std.string.rfind
int irfind
(
in char[] sStr,
in char[] sSubStr
)
{
return irfind( sStr, sSubStr, sStr.length - 1 );

} // end-function int irfind( char[], char[] )

// Case insensitive version of std.string.rfind 
// with an optional "String End Position" parameter
int irfind
(
in char[] sStr,
in char[] sSubStr,
in int    iEndPos
)
{
char[] sTmpStr;

// If "Out of Boundary" return not found
if ( iEndPos > sStr.length - 1 ) return -1;
if ( iEndPos < 0 ) return - 1;

sTmpStr = tolower( sStr[ 0 .. iEndPos + 1 ] );

return rfind( sTmpStr, tolower( sSubStr ) );   

} // end-function int irfind( char[],char[], int ) 
Jun 03 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <c9ntu5$kch$1 digitaldaemon.com>, David L. Davis says...
Shouldn't there be a non-case sensitive version of find (ifind) and rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of course we
really have this, and I've missed it somehow looking thru the "D" docs.
How about allowing the user to pass a comparison delegate? Case means different in different languages. Sean
Jun 03 2004
next sibling parent reply David L. Davis <SpottedTiger yahoo.com> writes:
In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...
In article <c9ntu5$kch$1 digitaldaemon.com>, David L. Davis says...
Shouldn't there be a non-case sensitive version of find (ifind) and rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of course we
really have this, and I've missed it somehow looking thru the "D" docs.
How about allowing the user to pass a comparison delegate? Case means different in different languages. Sean
Sean: I hate to ask you this, but could you explain this a bit more. I'm not sure I follow you. Thxs in advance. :)
Jun 03 2004
next sibling parent reply "Ivan Senji" <ivan.senji public.srce.hr> writes:
"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:c9o1k8$pp2$1 digitaldaemon.com...
 In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...
In article <c9ntu5$kch$1 digitaldaemon.com>, David L. Davis says...
Shouldn't there be a non-case sensitive version of find (ifind) and
rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of
course we
really have this, and I've missed it somehow looking thru the "D" docs.
How about allowing the user to pass a comparison delegate? Case means
different
in different languages.

Sean
Sean: I hate to ask you this, but could you explain this a bit more. I'm
not
 sure I follow you. Thxs in advance. :)
The problem is that universal case sensitive function can't be written because many languages have special letters (like cczsd, i don't know if you will see these correct). So the idea would be something like: findCaseSensitive(char[] str, char[] search,bool delegate(char,char) comparefunc); i'm not shure about the prototype but the idea is to give the find function the comparefunc which decides if two characters are considered same or not.
Jun 03 2004
parent Oskar Linde <d98-oliRE.MO.VE nada.kth.se> writes:
Ivan Senji wrote:

 The problem is that universal case sensitive function can't be written
 because
 many languages have special letters (like cczsd, i don't know if you will
 see these
 correct).
 
 So the idea would be something like:
 
 findCaseSensitive(char[] str, char[] search,bool delegate(char,char)
 comparefunc);
 
 i'm not shure about the prototype but the idea is to give the find function
 the
 comparefunc which decides if two characters are considered same or not.
Since a D char[] is considered UTF-8 encoded, a delegate(char,char) won't be enough. Sure, std.string.icmp(char[], char[]) only considers English ascii but that behavior seems broken as char[]s really are supposed to be UTF-8. Also std.string.cmp(char[], char[]) is broken in that respect too (uses memcmp()). However, some general kind of locale-handling is needed even for UTF-8. Some letters have different ordering in different languages. One solution is to use something like the C librarys setlocale() and change the comparison functions correspondingly. That way, findCaseSensitive could be defined as findCaseSensitive(char[] str, char[] search) together with a global (hidden) locale-state. I'm not sure how thread-safe the C library locale is in the case of using multiple locales. A different solution would be the use of a String class template to keep track of the locale a string is represented in. The best solution is probably to use delegate comparison functions taking dchars or whatever and also make cmp and icmp versions taking such. /Oskar
Jun 03 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <c9o1k8$pp2$1 digitaldaemon.com>, David L. Davis says...
In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...

How about allowing the user to pass a comparison delegate?  Case means different
in different languages.
Sean: I hate to ask you this, but could you explain this a bit more. I'm not sure I follow you. Thxs in advance. :)
Some languages don't have upper and lowercase letters. And many others don't convert properly using the default routines, even if the ASCII character set contains all the appropriate symbols. So tolower(x)==tolower(y) may yield the incorrect result if the string contains characters beyond the usual 52 ASCII English values. I'd like to assume that a D string is a sequence of characters, unicode or otherwise, and I think it would be a mistake to provide methods that don't work properly outside of ASCII English. While I'm not much of an expert on localization, I do think that the library should be designed with localization in mind. For a more thorough explanation, Scott Meyers discusses the problem in one of his "Effective C++" books, the second one IIRC. Sean
Jun 03 2004
parent David L. Davis <SpottedTiger yahoo.com> writes:
In article <c9o7nn$1360$1 digitaldaemon.com>, Sean Kelly says...
In article <c9o1k8$pp2$1 digitaldaemon.com>, David L. Davis says...
In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...

How about allowing the user to pass a comparison delegate?  Case means different
in different languages.
Sean: I hate to ask you this, but could you explain this a bit more. I'm not sure I follow you. Thxs in advance. :)
Some languages don't have upper and lowercase letters. And many others don't convert properly using the default routines, even if the ASCII character set contains all the appropriate symbols. So tolower(x)==tolower(y) may yield the incorrect result if the string contains characters beyond the usual 52 ASCII English values. I'd like to assume that a D string is a sequence of characters, unicode or otherwise, and I think it would be a mistake to provide methods that don't work properly outside of ASCII English. While I'm not much of an expert on localization, I do think that the library should be designed with localization in mind. For a more thorough explanation, Scott Meyers discusses the problem in one of his "Effective C++" books, the second one IIRC. Sean
Sean: Correct me if I'm wrong, but looking at the Phobes std.string html online information, it looks like the std.string mainly handles "ASCII English" with the way the consts are defined. Plus I've been playing around with these functions for a few weeks, converting my VB6 ProperCase() to a "D" propercase() function...and toupper() and tolower() use the below consts to do their work. If a string is say, all lower-case and it's passed in into the tolower() function it will do nothing since none of the characters in the string are upper-case based off the uppercase const. Plus, it also works the other way around, if an all upper-case string to passed into toupper() it will use the lowercase const. const char[] lowercase; "abcdefghijklmnopqrstuvwxyz" const char[] uppercase; "ABCDEFGHIJKLMNOPQRSTUVWXYZ" const char[] letters; "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" const char[] whitespace; " \t\v\r\n\f" Off the subject just a little bit, why are the consts defined in lowercase characters, and not in the standard "C\C++" way that all consts should be defined in uppercase characters?
Jun 03 2004
prev sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...
How about allowing the user to pass a comparison delegate?  Case means different
in different languages.

Sean
The Unicode Standard defines the uppercasing, lowercasing and titlecasing of all Unicode characters. There are regional variations only for Lithuanian, Turkish and Azeri, otherwise it's a world standard. Casing rules can found in the document http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt, with explanatory text on the Unicode web site. One interesting feature of Unicode casing is that sometimes one character becomes two characters. For instance, from SpecialCasing.txt: This shows that the German eszett character (Unicode codepoint 0x00DF) uppercases to the two-character sequence "SS". Perhaps even more useful would be the Unicode Collation algorithm. Collation - as opposed to casing - is definitely regional, so here you do need to specify the manner in which you need to do your collating. But anyway, the algorithms are all there. All we need is for someone to implement them. Arcane Jill
Jun 03 2004
prev sibling parent "KTC" <me here.com> writes:
 Below I wrote some sample code I've tested, which sort of show its
usefulness.
 I've indented the code this time with tabs, so hopeful it all appears
readable
 in my posted message. <*crosses-fingers*> ;)
Tabs doesn't show up on some newsreader like OE. I'm afraid you need to use space... (Can't actually comment on anything to do with D as I haven't actually look at a single line of D code yet coz of university work and my lack of knowledge of programming :S)
Jun 03 2004