digitalmars.D - Shouldn't Phobes have a non-case sensitive find() and rfind()

David L. Davis (91/91) Jun 03 2004 Shouldn't there be a non-case sensitive version of find (ifind) and rfin...

Sean Kelly (4/8) Jun 03 2004 How about allowing the user to pass a comparison delegate? Case means d...

David L. Davis (3/12) Jun 03 2004 Sean: I hate to ask you this, but could you explain this a bit more. I'm...

Ivan Senji (17/34) Jun 03 2004 rfind

Oskar Linde (19/33) Jun 03 2004 Since a D char[] is considered UTF-8 encoded, a delegate(char,char)

Sean Kelly (13/18) Jun 03 2004 Some languages don't have upper and lowercase letters. And many others ...

David L. Davis (21/42) Jun 03 2004 Sean: Correct me if I'm wrong, but looking at the Phobes std.string html...

Arcane Jill (17/20) Jun 03 2004 The Unicode Standard defines the uppercasing, lowercasing and titlecasin...

KTC (7/10) Jun 03 2004 Tabs doesn't show up on some newsreader like OE. I'm afraid you need to ...

David L. Davis <SpottedTiger yahoo.com> writes:

Shouldn't there be a non-case sensitive version of find (ifind) and rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of course we
really have this, and I've missed it somehow looking thru the "D" docs.

Below I wrote some sample code I've tested, which sort of show its usefulness.
I've indented the code this time with tabs, so hopeful it all appears readable 
in my posted message. <*crosses-fingers*> ;)

import std.string;

int main()
{
char[] sStr = "ApO 123355 PO Box 23, Waterpool Street Portland, Texas";

printf("Case Insensitive ifind and irfind tests\n");
printf("Test String sStr=%.*s\n\n", sStr);

printf("Default = 0, ifind \'PO\' in sStr, result=%d\n", ifind( sStr, "PO" ) );
printf("StartPos= 2, ifind \'PO\' in sStr, result=%d\n", ifind( sStr, "PO", 2 )
);
printf("StartPos=15, ifind \'PO\' in sStr, result=%d\n", ifind( sStr, "PO", 15 )
);
printf("StartPos=33, ifind \'PO\' in sStr, result=%d\n\n", ifind( sStr, "PO", 33
) );

printf("Default = sStr.length - 1, irfind \'PO\' in sStr2, result=%d\n", irfind(
sStr, "PO" ) );
printf("StartPos= 2, irfind \'PO\' in sStr, result=%d\n", irfind( sStr, "PO", 2
) );
printf("StartPos=15, irfind \'PO\' in sStr, result=%d\n", irfind( sStr, "PO", 15
) );
printf("StartPos=33, irfind \'PO\' in sStr, result=%d\n", irfind( sStr, "PO", 33
) );

return 0;

} // end-function int main( void )

// Case insensitive version of std.string.find
int ifind
(
in char[] sStr,
in char[] sSubStr
)
{
return ifind( sStr, sSubStr, 0 );

} // end-function int ifind( char[], char[] )

// Case insensitive version of std.string.find 
// with an optional "String Start Position" parameter.
int ifind
(
in char[] sStr,
in char[] sSubStr,
in int    iStartPos
)
{
char[] sTmpStr;
int    iRtnVal;

// Out of Boundary return not found
if ( iStartPos > sStr.length - 1 ) return -1;
if ( iStartPos < 0 ) return - 1;

sTmpStr = tolower( sStr[ iStartPos .. sStr.length ] );

if ( iStartPos == 0 ) 
return find( sTmpStr, tolower( sSubStr ) );   
else
{
iRtnVal = find( sTmpStr, tolower( sSubStr ) );

if ( iRtnVal != -1 ) 
return iStartPos + iRtnVal;
else
return -1;
// end-if

} // end-if

} // end-function int ifind( char[],char[], int ) 

// Case insensitive version of std.string.rfind
int irfind
(
in char[] sStr,
in char[] sSubStr
)
{
return irfind( sStr, sSubStr, sStr.length - 1 );

} // end-function int irfind( char[], char[] )

// Case insensitive version of std.string.rfind 
// with an optional "String End Position" parameter
int irfind
(
in char[] sStr,
in char[] sSubStr,
in int    iEndPos
)
{
char[] sTmpStr;

// If "Out of Boundary" return not found
if ( iEndPos > sStr.length - 1 ) return -1;
if ( iEndPos < 0 ) return - 1;

sTmpStr = tolower( sStr[ 0 .. iEndPos + 1 ] );

return rfind( sTmpStr, tolower( sSubStr ) );   

} // end-function int irfind( char[],char[], int )

Jun 03 2004

Sean Kelly <sean f4.ca> writes:

In article <c9ntu5$kch$1 digitaldaemon.com>, David L. Davis says...
Shouldn't there be a non-case sensitive version of find (ifind) and rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of course we
really have this, and I've missed it somehow looking thru the "D" docs.

How about allowing the user to pass a comparison delegate?  Case means different
in different languages.

Sean

Jun 03 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...
In article <c9ntu5$kch$1 digitaldaemon.com>, David L. Davis says...
Shouldn't there be a non-case sensitive version of find (ifind) and rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of course we
really have this, and I've missed it somehow looking thru the "D" docs.

How about allowing the user to pass a comparison delegate?  Case means different
in different languages.

Sean

Sean: I hate to ask you this, but could you explain this a bit more. I'm not
sure I follow you. Thxs in advance. :)

Jun 03 2004

"Ivan Senji" <ivan.senji public.srce.hr> writes:

"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:c9o1k8$pp2$1 digitaldaemon.com...
 In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...
In article <c9ntu5$kch$1 digitaldaemon.com>, David L. Davis says...
Shouldn't there be a non-case sensitive version of find (ifind) and



rfind
(irfind) in the Phobes std.string library? Also an additional position
parameter, would also make the function(s) even more useful, unless of



course we
really have this, and I've missed it somehow looking thru the "D" docs.

How about allowing the user to pass a comparison delegate?  Case means


different
in different languages.

Sean

 Sean: I hate to ask you this, but could you explain this a bit more. I'm

not
 sure I follow you. Thxs in advance. :)

The problem is that universal case sensitive function can't be written
because
many languages have special letters (like cczsd, i don't know if you will
see these
correct).

So the idea would be something like:

findCaseSensitive(char[] str, char[] search,bool delegate(char,char)
comparefunc);

i'm not shure about the prototype but the idea is to give the find function
the
comparefunc which decides if two characters are considered same or not.

Jun 03 2004

Oskar Linde <d98-oliRE.MO.VE nada.kth.se> writes:

Ivan Senji wrote:

 The problem is that universal case sensitive function can't be written
 because
 many languages have special letters (like cczsd, i don't know if you will
 see these
 correct).
 
 So the idea would be something like:
 
 findCaseSensitive(char[] str, char[] search,bool delegate(char,char)
 comparefunc);
 
 i'm not shure about the prototype but the idea is to give the find function
 the
 comparefunc which decides if two characters are considered same or not.

Since a D char[] is considered UTF-8 encoded, a delegate(char,char) 
won't be enough. Sure, std.string.icmp(char[], char[]) only considers 
English ascii but that behavior seems broken as char[]s really are 
supposed to be UTF-8. Also std.string.cmp(char[], char[]) is broken in 
that respect too (uses memcmp()). However, some general kind of 
locale-handling is needed even for UTF-8. Some letters have different 
ordering in different languages.

One solution is to use something like the C librarys setlocale() and 
change the comparison functions correspondingly. That way, 
findCaseSensitive could be defined as
findCaseSensitive(char[] str, char[] search)
together with a global (hidden) locale-state. I'm not sure how 
thread-safe the C library locale is in the case of using multiple 
locales.  A different solution would be the use of a String class 
template to keep track of the locale a string is represented in.

The best solution is probably to use delegate comparison functions 
taking dchars or whatever and also make cmp and icmp versions taking such.

/Oskar

Jun 03 2004

Sean Kelly <sean f4.ca> writes:

In article <c9o1k8$pp2$1 digitaldaemon.com>, David L. Davis says...
In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...

How about allowing the user to pass a comparison delegate?  Case means different
in different languages.

Sean: I hate to ask you this, but could you explain this a bit more. I'm not
sure I follow you. Thxs in advance. :)

Some languages don't have upper and lowercase letters.  And many others don't
convert properly using the default routines, even if the ASCII character set
contains all the appropriate symbols.  So tolower(x)==tolower(y) may yield the
incorrect result if the string contains characters beyond the usual 52 ASCII
English values.  I'd like to assume that a D string is a sequence of characters,
unicode or otherwise, and I think it would be a mistake to provide methods that
don't work properly outside of ASCII English.  While I'm not much of an expert
on localization, I do think that the library should be designed with
localization in mind.

For a more thorough explanation, Scott Meyers discusses the problem in one of
his "Effective C++" books, the second one IIRC.

Sean

Jun 03 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <c9o7nn$1360$1 digitaldaemon.com>, Sean Kelly says...
In article <c9o1k8$pp2$1 digitaldaemon.com>, David L. Davis says...
In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...

How about allowing the user to pass a comparison delegate?  Case means different
in different languages.

Sean: I hate to ask you this, but could you explain this a bit more. I'm not
sure I follow you. Thxs in advance. :)

Some languages don't have upper and lowercase letters.  And many others don't
convert properly using the default routines, even if the ASCII character set
contains all the appropriate symbols.  So tolower(x)==tolower(y) may yield the
incorrect result if the string contains characters beyond the usual 52 ASCII
English values.  I'd like to assume that a D string is a sequence of characters,
unicode or otherwise, and I think it would be a mistake to provide methods that
don't work properly outside of ASCII English.  While I'm not much of an expert
on localization, I do think that the library should be designed with
localization in mind.

For a more thorough explanation, Scott Meyers discusses the problem in one of
his "Effective C++" books, the second one IIRC.

Sean

Sean: Correct me if I'm wrong, but looking at the Phobes std.string html online
information, it looks like the std.string mainly handles "ASCII English" with
the way the consts are defined. Plus I've been playing around with these
functions for a few weeks, converting my VB6 ProperCase() to a "D" propercase()
function...and toupper() and tolower() use the below consts to do their work. If
a string is say, all lower-case and it's passed in into the tolower() function
it will do nothing since none of the characters in the string are upper-case
based off the uppercase const. Plus, it also works the other way around, if an
all upper-case string to passed into toupper() it will use the lowercase const. 

const char[] lowercase; 
"abcdefghijklmnopqrstuvwxyz" 

const char[] uppercase; 
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" 

const char[] letters; 
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 

const char[] whitespace; 
" \t\v\r\n\f" 

Off the subject just a little bit, why are the consts defined in lowercase
characters, and not in the standard "C\C++" way that all consts should be
defined in uppercase characters?

Jun 03 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9nuts$ls2$1 digitaldaemon.com>, Sean Kelly says...
How about allowing the user to pass a comparison delegate?  Case means different
in different languages.

Sean

The Unicode Standard defines the uppercasing, lowercasing and titlecasing of all
Unicode characters. There are regional variations only for Lithuanian, Turkish
and Azeri, otherwise it's a world standard. Casing rules can found in the
document http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt, with
explanatory text on the Unicode web site.

One interesting feature of Unicode casing is that sometimes one character
becomes two characters. For instance, from SpecialCasing.txt:



This shows that the German eszett character (Unicode codepoint 0x00DF)
uppercases to the two-character sequence "SS".

Perhaps even more useful would be the Unicode Collation algorithm. Collation -
as opposed to casing - is definitely regional, so here you do need to specify
the manner in which you need to do your collating.

But anyway, the algorithms are all there. All we need is for someone to
implement them.

Arcane Jill

Jun 03 2004

"KTC" <me here.com> writes:

 Below I wrote some sample code I've tested, which sort of show its

usefulness.
 I've indented the code this time with tabs, so hopeful it all appears

readable
 in my posted message. <*crosses-fingers*> ;)

Tabs doesn't show up on some newsreader like OE. I'm afraid you need to use
space...

(Can't actually comment on anything to do with D as I haven't actually look
at a single line of D code yet coz of university work and my lack of
knowledge of programming :S)

Jun 03 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Shouldn't Phobes have a non-case sensitive find() and rfind()