www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - arrays and strings

reply Berin Loritsch <bloritsch d-haven.org> writes:
All this talk about unicode made it clear that using a straight array
may not be the right tool for string handling.  Sure the most common
operations can be done on an array (concatenation, sub-arrays, etc.).
However, if we are to assume any kind of encoding support other than
ASCII, it is simply not safe unless we are talking about "dchar" arrays.

For example, logically speaking I may want to get the second and third
characters of this string (UTF8): 彼は来る (only four characters).  It
is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
arts, so I can't get away from the Japanese language (it is tied to
what I study)--even though I can't really speak a lick.

Now, tell me what I would get in a UTF8 environment:

char[] kyokimasu = "彼は来る";
char[] test = kyokimasu[1..3];

assert "は来" == test;

I guarantee you the assertion would fail.  Why?  because strict array
slicing does not take into account multibyte encoding.  Essentially I
will get part of the first character's encoding alone.

Any UTF aware system would either need to build this knowlege into the
language (bad idea IMO), or have a string to take care of that info
for you.  Things are a bit better with wchar[], (I'm not sure, but I
think the above will pass)--but there are still some cases of multibyte
encoding.

Not to mention the UTF8 string listed above would be more than 8 bytes
long (the wchar[] version).

The only way to make it work seamlessly is to have a string class that
would make the proper adjustments.  Of course this would also affect
the speed deamons here.

I think having something generally useful for internationalization is
very important, or we shoot ourselves in the foot (we want D to succeed,
as long as you speak English does not make sense).  General purpose i18n
and l10n is not easy to do by any stretch--but I think it is generally
agreed that it would have to be done in libraries.

I just don't think we can rely on D's native (up to now) way of dealing
with String manipulation.
Aug 31 2004
next sibling parent reply Sebastian Beschke <s.beschke gmx.de> writes:
Berin Loritsch wrote:
 For example, logically speaking I may want to get the second and third
 characters of this string (UTF8): 彼は来る (only four characters).  It
 is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
 arts, so I can't get away from the Japanese language (it is tied to
 what I study)--even though I can't really speak a lick.
Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array. -Sebastian
Aug 31 2004
next sibling parent Sebastian Beschke <s.beschke gmx.de> writes:
Sebastian Beschke wrote:
 don't think the kanji 彼 can be pronounced "kyo", if you look at this 
 page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D
Whoops, the link doesn't work. Nevermind.
Aug 31 2004
prev sibling parent Berin Loritsch <bloritsch d-haven.org> writes:
Sebastian Beschke wrote:

 Berin Loritsch wrote:
 
 For example, logically speaking I may want to get the second and third
 characters of this string (UTF8): 彼は来る (only four characters).  It
 is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
 arts, so I can't get away from the Japanese language (it is tied to
 what I study)--even though I can't really speak a lick.
Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array. -Sebastian
Blasted electronic translators...
Aug 31 2004
prev sibling next sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
This is what dchar[] is for. With dchar[] array indexing === character
indexing.
A couple of helper function in std.string
 char[] slice(char[] str, int a, int b); % slice characters a to b, not
index a to b
 wchar[] slice(wchar[] str, int a, int b);
would also be nice for those cases when one doesn't want to convert to
dchar[]. Maybe such functions area already in phobos somewhere? I haven't
looked too hard.

"Berin Loritsch" <bloritsch d-haven.org> wrote in message
news:ch24jt$rs0$1 digitaldaemon.com...
 All this talk about unicode made it clear that using a straight array
 may not be the right tool for string handling.  Sure the most common
 operations can be done on an array (concatenation, sub-arrays, etc.).
 However, if we are to assume any kind of encoding support other than
 ASCII, it is simply not safe unless we are talking about "dchar" arrays.

 For example, logically speaking I may want to get the second and third
 characters of this string (UTF8): ???? (only four characters).  It
 is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
 arts, so I can't get away from the Japanese language (it is tied to
 what I study)--even though I can't really speak a lick.

 Now, tell me what I would get in a UTF8 environment:

 char[] kyokimasu = "????";
 char[] test = kyokimasu[1..3];

 assert "??" == test;

 I guarantee you the assertion would fail.  Why?  because strict array
 slicing does not take into account multibyte encoding.  Essentially I
 will get part of the first character's encoding alone.

 Any UTF aware system would either need to build this knowlege into the
 language (bad idea IMO), or have a string to take care of that info
 for you.  Things are a bit better with wchar[], (I'm not sure, but I
 think the above will pass)--but there are still some cases of multibyte
 encoding.

 Not to mention the UTF8 string listed above would be more than 8 bytes
 long (the wchar[] version).

 The only way to make it work seamlessly is to have a string class that
 would make the proper adjustments.  Of course this would also affect
 the speed deamons here.

 I think having something generally useful for internationalization is
 very important, or we shoot ourselves in the foot (we want D to succeed,
 as long as you speak English does not make sense).  General purpose i18n
 and l10n is not easy to do by any stretch--but I think it is generally
 agreed that it would have to be done in libraries.

 I just don't think we can rely on D's native (up to now) way of dealing
 with String manipulation.
Aug 31 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
actually now that I think about it another way to slice from character a to
b is to have a function that returns the index of the nth character:
 int character(char[] str, int n);
and then slicing is
 str[character(a) .. character(b)];
That is probably better than special slicing functions.

"Ben Hinkle" <bhinkle mathworks.com> wrote in message
news:ch26as$sl7$1 digitaldaemon.com...
 This is what dchar[] is for. With dchar[] array indexing === character
 indexing.
 A couple of helper function in std.string
  char[] slice(char[] str, int a, int b); % slice characters a to b, not
 index a to b
  wchar[] slice(wchar[] str, int a, int b);
 would also be nice for those cases when one doesn't want to convert to
 dchar[]. Maybe such functions area already in phobos somewhere? I haven't
 looked too hard.

 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:ch24jt$rs0$1 digitaldaemon.com...
 All this talk about unicode made it clear that using a straight array
 may not be the right tool for string handling.  Sure the most common
 operations can be done on an array (concatenation, sub-arrays, etc.).
 However, if we are to assume any kind of encoding support other than
 ASCII, it is simply not safe unless we are talking about "dchar" arrays.

 For example, logically speaking I may want to get the second and third
 characters of this string (UTF8): ???? (only four characters).  It
 is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
 arts, so I can't get away from the Japanese language (it is tied to
 what I study)--even though I can't really speak a lick.

 Now, tell me what I would get in a UTF8 environment:

 char[] kyokimasu = "????";
 char[] test = kyokimasu[1..3];

 assert "??" == test;

 I guarantee you the assertion would fail.  Why?  because strict array
 slicing does not take into account multibyte encoding.  Essentially I
 will get part of the first character's encoding alone.

 Any UTF aware system would either need to build this knowlege into the
 language (bad idea IMO), or have a string to take care of that info
 for you.  Things are a bit better with wchar[], (I'm not sure, but I
 think the above will pass)--but there are still some cases of multibyte
 encoding.

 Not to mention the UTF8 string listed above would be more than 8 bytes
 long (the wchar[] version).

 The only way to make it work seamlessly is to have a string class that
 would make the proper adjustments.  Of course this would also affect
 the speed deamons here.

 I think having something generally useful for internationalization is
 very important, or we shoot ourselves in the foot (we want D to succeed,
 as long as you speak English does not make sense).  General purpose i18n
 and l10n is not easy to do by any stretch--but I think it is generally
 agreed that it would have to be done in libraries.

 I just don't think we can rely on D's native (up to now) way of dealing
 with String manipulation.
Aug 31 2004
next sibling parent "Ben Hinkle" <bhinkle mathworks.com> writes:
OK - enough replying to myself, I know, I know. Here's the code implementing
what I'm talking about:

import std.utf;

size_t character(char[] str, size_t n) {
  size_t i = 0;
  while (n--) {
    decode(str,i);
  }
  return i;
}

size_t character(wchar[] str, size_t n) {
  size_t i = 0;
  while (n--) {
    decode(str,i);
  }
  return i;
}


"Ben Hinkle" <bhinkle mathworks.com> wrote in message
news:ch26je$sq4$1 digitaldaemon.com...
 actually now that I think about it another way to slice from character a
to
 b is to have a function that returns the index of the nth character:
  int character(char[] str, int n);
 and then slicing is
  str[character(a) .. character(b)];
 That is probably better than special slicing functions.

 "Ben Hinkle" <bhinkle mathworks.com> wrote in message
 news:ch26as$sl7$1 digitaldaemon.com...
 This is what dchar[] is for. With dchar[] array indexing === character
 indexing.
 A couple of helper function in std.string
  char[] slice(char[] str, int a, int b); % slice characters a to b, not
 index a to b
  wchar[] slice(wchar[] str, int a, int b);
 would also be nice for those cases when one doesn't want to convert to
 dchar[]. Maybe such functions area already in phobos somewhere? I
haven't
 looked too hard.

 "Berin Loritsch" <bloritsch d-haven.org> wrote in message
 news:ch24jt$rs0$1 digitaldaemon.com...
 All this talk about unicode made it clear that using a straight array
 may not be the right tool for string handling.  Sure the most common
 operations can be done on an array (concatenation, sub-arrays, etc.).
 However, if we are to assume any kind of encoding support other than
 ASCII, it is simply not safe unless we are talking about "dchar"
arrays.
 For example, logically speaking I may want to get the second and third
 characters of this string (UTF8): ???? (only four characters).  It
 is the Japanese text for "kyo kimasu" (he comes).  I'm into martial
 arts, so I can't get away from the Japanese language (it is tied to
 what I study)--even though I can't really speak a lick.

 Now, tell me what I would get in a UTF8 environment:

 char[] kyokimasu = "????";
 char[] test = kyokimasu[1..3];

 assert "??" == test;

 I guarantee you the assertion would fail.  Why?  because strict array
 slicing does not take into account multibyte encoding.  Essentially I
 will get part of the first character's encoding alone.

 Any UTF aware system would either need to build this knowlege into the
 language (bad idea IMO), or have a string to take care of that info
 for you.  Things are a bit better with wchar[], (I'm not sure, but I
 think the above will pass)--but there are still some cases of
multibyte
 encoding.

 Not to mention the UTF8 string listed above would be more than 8 bytes
 long (the wchar[] version).

 The only way to make it work seamlessly is to have a string class that
 would make the proper adjustments.  Of course this would also affect
 the speed deamons here.

 I think having something generally useful for internationalization is
 very important, or we shoot ourselves in the foot (we want D to
succeed,
 as long as you speak English does not make sense).  General purpose
i18n
 and l10n is not easy to do by any stretch--but I think it is generally
 agreed that it would have to be done in libraries.

 I just don't think we can rely on D's native (up to now) way of
dealing
 with String manipulation.
Aug 31 2004
prev sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <ch26je$sq4$1 digitaldaemon.com>, Ben Hinkle says...
actually now that I think about it another way to slice from character a to
b is to have a function that returns the index of the nth character:
 int character(char[] str, int n);
and then slicing is
 str[character(a) .. character(b)];
That is probably better than special slicing functions.
It's more flexible, but it is slightly slower. The two calls to character() will parse the string once each, while a splice() function could do it in one run. Nick
Aug 31 2004
next sibling parent Nick <Nick_member pathlink.com> writes:
In article <ch2i6t$13ma$1 digitaldaemon.com>, Nick says...
It's more flexible, but it is slightly slower. The two calls to character() will
parse the string once each, while a splice() function could do it in one run.
^^^^^^ Err, that should be slice() :-) Nick
Aug 31 2004
prev sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Nick" <Nick_member pathlink.com> wrote in message
news:ch2i6t$13ma$1 digitaldaemon.com...
 In article <ch26je$sq4$1 digitaldaemon.com>, Ben Hinkle says...
actually now that I think about it another way to slice from character a
to
b is to have a function that returns the index of the nth character:
 int character(char[] str, int n);
and then slicing is
 str[character(a) .. character(b)];
That is probably better than special slicing functions.
It's more flexible, but it is slightly slower. The two calls to
character() will
 parse the string once each, while a splice() function could do it in one
run.
 Nick
good point. plus it is less typing. So here's version 2: import std.utf; size_t character(char[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } char[] slice(char[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } wchar[] slice(wchar[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
Aug 31 2004
next sibling parent reply Berin Loritsch <bloritsch d-haven.org> writes:
Considering the code is not as straight forward as I am used to,
what the character() method is doing is decoding the string byte
by byte using the passed in index.  The index (i) is only used
to resume where you may have left off.  Ok.  So we have a little
optimization here so that we don't double-decode something...

It seemed a bit odd to me to do the b-a subtraction in the slice
method, but then I realized what you were doing (resuming from the
last point).

Of course this also assumes that someone didn't put in bad data
like:

slice(mystr, 5, 4);

Not to mention you could genericise the functions since they are
identical except for the element type of the array.

I suppose that is why C++ string object is templated (so you can
use wchar instead of char).

The decode method would actually be different though based on the
type.

Ben Hinkle wrote:

 import std.utf;
 
 size_t character(char[] str, size_t n, size_t i = 0) {
   while (n--) {
     decode(str,i);
   }
   return i;
 }
 
 size_t character(wchar[] str, size_t n, size_t i = 0) {
   while (n--) {
     decode(str,i);
   }
   return i;
 }
 
 char[] slice(char[] str, size_t a, size_t b) {
   size_t ai = character(str,a);
   size_t bi = character(str,b-a,ai);
   return str[ai .. bi];
 }
 
 wchar[] slice(wchar[] str, size_t a, size_t b) {
   size_t ai = character(str,a);
   size_t bi = character(str,b-a,ai);
   return str[ai .. bi];
 }
 
 
Aug 31 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 31 Aug 2004 17:10:50 -0400, Berin Loritsch <bloritsch d-haven.org> 
wrote:
 Considering the code is not as straight forward as I am used to,
 what the character() method is doing is decoding the string byte
 by byte using the passed in index.  The index (i) is only used
 to resume where you may have left off.  Ok.  So we have a little
 optimization here so that we don't double-decode something...
Clever optimisation.
 It seemed a bit odd to me to do the b-a subtraction in the slice
 method, but then I realized what you were doing (resuming from the
 last point).
Yeah.. it took me a while too.
 Of course this also assumes that someone didn't put in bad data
 like:

 slice(mystr, 5, 4);
A perfect oppotunity for DbC eg. char[] slice(char[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
 Not to mention you could genericise the functions since they are
 identical except for the element type of the array.
Yep. template character(Type : Type[]) { size_t character(Type[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } } template slice(Type : Type[]) { Type[] slice(Type[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } } or something like that.
 I suppose that is why C++ string object is templated (so you can
 use wchar instead of char).
Probably.
 The decode method would actually be different though based on the
 type.
True. Regan
 Ben Hinkle wrote:

 import std.utf;

 size_t character(char[] str, size_t n, size_t i = 0) {
   while (n--) {
     decode(str,i);
   }
   return i;
 }

 size_t character(wchar[] str, size_t n, size_t i = 0) {
   while (n--) {
     decode(str,i);
   }
   return i;
 }

 char[] slice(char[] str, size_t a, size_t b) {
   size_t ai = character(str,a);
   size_t bi = character(str,b-a,ai);
   return str[ai .. bi];
 }

 wchar[] slice(wchar[] str, size_t a, size_t b) {
   size_t ai = character(str,a);
   size_t bi = character(str,b-a,ai);
   return str[ai .. bi];
 }
-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 31 2004
next sibling parent Regan Heath <regan netwin.co.nz> writes:
This sort of useful code should go into the standard library, the 
'phoenix' (or whatever we call it) library should include this..

On Wed, 01 Sep 2004 11:17:25 +1200, Regan Heath <regan netwin.co.nz> wrote:

 On Tue, 31 Aug 2004 17:10:50 -0400, Berin Loritsch 
 <bloritsch d-haven.org> wrote:
 Considering the code is not as straight forward as I am used to,
 what the character() method is doing is decoding the string byte
 by byte using the passed in index.  The index (i) is only used
 to resume where you may have left off.  Ok.  So we have a little
 optimization here so that we don't double-decode something...
Clever optimisation.
 It seemed a bit odd to me to do the b-a subtraction in the slice
 method, but then I realized what you were doing (resuming from the
 last point).
Yeah.. it took me a while too.
 Of course this also assumes that someone didn't put in bad data
 like:

 slice(mystr, 5, 4);
A perfect oppotunity for DbC eg. char[] slice(char[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
 Not to mention you could genericise the functions since they are
 identical except for the element type of the array.
Yep. template character(Type : Type[]) { size_t character(Type[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } } template slice(Type : Type[]) { Type[] slice(Type[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } } or something like that.
 I suppose that is why C++ string object is templated (so you can
 use wchar instead of char).
Probably.
 The decode method would actually be different though based on the
 type.
True. Regan
 Ben Hinkle wrote:

 import std.utf;

 size_t character(char[] str, size_t n, size_t i = 0) {
   while (n--) {
     decode(str,i);
   }
   return i;
 }

 size_t character(wchar[] str, size_t n, size_t i = 0) {
   while (n--) {
     decode(str,i);
   }
   return i;
 }

 char[] slice(char[] str, size_t a, size_t b) {
   size_t ai = character(str,a);
   size_t bi = character(str,b-a,ai);
   return str[ai .. bi];
 }

 wchar[] slice(wchar[] str, size_t a, size_t b) {
   size_t ai = character(str,a);
   size_t bi = character(str,b-a,ai);
   return str[ai .. bi];
 }
-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 31 2004
prev sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <opsdmdnblz5a2sq9 digitalmars.com>, Regan Heath says...
[...]

template slice(Type : Type[])
{
   Type[] slice(Type[] str, size_t a, size_t b)
   in {
     assert(b > a); // b >= a?
   }
   body {
     size_t ai = character(str,a);
     size_t bi = character(str,b-a,ai);
     return str[ai .. bi];
   }
}

or something like that.

 I suppose that is why C++ string object is templated (so you can
 use wchar instead of char).
Nice. Except now you have to add a !(char[]) for every slice operation, since D doesn't auto detect types :-( A workaround could be something like: template slice_template(Type: Type[]) {...} alias slice_template!(char[]) slice; alias slice_template!(wchar[]) slice; Nick
Sep 01 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 1 Sep 2004 12:50:28 +0000 (UTC), Nick <Nick_member pathlink.com> 
wrote:
 In article <opsdmdnblz5a2sq9 digitalmars.com>, Regan Heath says...
 [...]

 template slice(Type : Type[])
 {
   Type[] slice(Type[] str, size_t a, size_t b)
   in {
     assert(b > a); // b >= a?
   }
   body {
     size_t ai = character(str,a);
     size_t bi = character(str,b-a,ai);
     return str[ai .. bi];
   }
 }

 or something like that.

 I suppose that is why C++ string object is templated (so you can
 use wchar instead of char).
Nice. Except now you have to add a !(char[]) for every slice operation, since D doesn't auto detect types :-( A workaround could be something like: template slice_template(Type: Type[]) {...} alias slice_template!(char[]) slice; alias slice_template!(wchar[]) slice;
Does that work? (I haven't tried it, but I'd expect the second to over-rule the first?) The other option is to then write wrapper functions eg. char[] slice(char[] str, size_t a, size_t b) { return slice!(char[])(str,a,b); } wchar[] slice(wchar[] str, size_t a, size_t b) { return slice!(wchar[])(str,a,b); } dchar[] slice(dchar[] str, size_t a, size_t b) { return slice!(dchar[])(str,a,b); } Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 01 2004
next sibling parent Sean Kelly <sean f4.ca> writes:
In article <opsdn8ouhn5a2sq9 digitalmars.com>, Regan Heath says...
On Wed, 1 Sep 2004 12:50:28 +0000 (UTC), Nick <Nick_member pathlink.com> 
wrote:

 alias slice_template!(char[]) slice;
 alias slice_template!(wchar[]) slice;
Does that work? (I haven't tried it, but I'd expect the second to over-rule the first?)
Yes, it works because the prototypes are different. I used this trick at some point in my std.stream rewrite, though I think I tossed all the template code before I posted the verison that's available now. Sean
Sep 01 2004
prev sibling parent Nick <Nick_member pathlink.com> writes:
In article <opsdn8ouhn5a2sq9 digitalmars.com>, Regan Heath says...
On Wed, 1 Sep 2004 12:50:28 +0000 (UTC), Nick <Nick_member pathlink.com> 
wrote:
 alias slice_template!(char[]) slice;
 alias slice_template!(wchar[]) slice;
Does that work? (I haven't tried it, but I'd expect the second to over-rule the first?)
Yep, it works. The second does not over-rule the first, it over-*loads* it, meaning slice() is subject to normal function overloading rules. I use this on almost all my templates, I find it makes the code less rough on the eyes and means less typing as well. Nick
Sep 02 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
Nice work! Can I add it to std.string? Or should it go in std.utf?
Aug 31 2004
parent Ben Hinkle <bhinkle4 juno.com> writes:
Walter wrote:

 Nice work! Can I add it to std.string? Or should it go in std.utf?
cool, thanks. I think most people would look in std.string since the target of the operations are to index and slice strings - the encoding is somewhat secondary.
Sep 01 2004
prev sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ch24jt$rs0$1 digitaldaemon.com>, Berin Loritsch says...

I think having something generally useful for internationalization is
very important, or we shoot ourselves in the foot (we want D to succeed,
as long as you speak English does not make sense).  General purpose i18n
and l10n is not easy to do by any stretch--but I think it is generally
agreed that it would have to be done in libraries.
ICU has the class UnicodeString to encapsulate strings, as well as the abstract class CharacterIterator for iterating over characters, with concrete implementations UCharCharacterIterator and StringCharacterIterator. It also has a lot more besides. Check out the API guide at http://oss.software.ibm.com/icu/apiref/classes.html. All of this will be a part of D (yes, via a library) in the not-too-distant future.
I just don't think we can rely on D's native (up to now) way of dealing
with String manipulation.
That's why I'm wrapping ICU as we speak. Arcane Jill
Sep 01 2004