digitalmars.D - arrays and strings
- Berin Loritsch (34/34) Aug 31 2004 All this talk about unicode made it clear that using a straight array
- Sebastian Beschke (9/14) Aug 31 2004 Not really. From what my limited Japanese abilities, this should
- Sebastian Beschke (2/4) Aug 31 2004 Whoops, the link doesn't work. Nevermind.
- Berin Loritsch (2/21) Aug 31 2004 Blasted electronic translators...
- Ben Hinkle (11/45) Aug 31 2004 This is what dchar[] is for. With dchar[] array indexing === character
- Ben Hinkle (8/63) Aug 31 2004 actually now that I think about it another way to slice from character a...
- Ben Hinkle (26/92) Aug 31 2004 OK - enough replying to myself, I know, I know. Here's the code implemen...
- Nick (4/10) Aug 31 2004 It's more flexible, but it is slightly slower. The two calls to characte...
- Nick (4/6) Aug 31 2004 ^^^^^^
- Ben Hinkle (29/40) Aug 31 2004 to
- Berin Loritsch (18/47) Aug 31 2004 Considering the code is not as straight forward as I am used to,
- Regan Heath (41/88) Aug 31 2004 Clever optimisation.
- Regan Heath (5/93) Aug 31 2004 This sort of useful code should go into the standard library, the
- Nick (9/25) Sep 01 2004 Nice. Except now you have to add a !(char[]) for every slice operation, ...
- Regan Heath (20/48) Sep 01 2004 Does that work? (I haven't tried it, but I'd expect the second to
- Sean Kelly (5/11) Sep 01 2004 Yes, it works because the prototypes are different. I used this trick a...
- Nick (6/13) Sep 02 2004 Yep, it works. The second does not over-rule the first, it over-*loads* ...
- Walter (1/1) Aug 31 2004 Nice work! Can I add it to std.string? Or should it go in std.utf?
- Ben Hinkle (4/5) Sep 01 2004 cool, thanks. I think most people would look in std.string since the tar...
- Arcane Jill (10/17) Sep 01 2004 ICU has the class UnicodeString to encapsulate strings, as well as the a...
All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. For example, logically speaking I may want to get the second and third characters of this string (UTF8): 彼は来る (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick. Now, tell me what I would get in a UTF8 environment: char[] kyokimasu = "彼は来る"; char[] test = kyokimasu[1..3]; assert "は来" == test; I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding. Not to mention the UTF8 string listed above would be more than 8 bytes long (the wchar[] version). The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. I just don't think we can rely on D's native (up to now) way of dealing with String manipulation.
Aug 31 2004
Berin Loritsch wrote:For example, logically speaking I may want to get the second and third characters of this string (UTF8): 彼は来る (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick.Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array. -Sebastian
Aug 31 2004
Sebastian Beschke wrote:don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1DWhoops, the link doesn't work. Nevermind.
Aug 31 2004
Sebastian Beschke wrote:Berin Loritsch wrote:Blasted electronic translators...For example, logically speaking I may want to get the second and third characters of this string (UTF8): 彼は来る (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick.Not really. From what my limited Japanese abilities, this should actually be "kare wa kiru", which means the same thing (he comes). I don't think the kanji 彼 can be pronounced "kyo", if you look at this page: http://www.csse.monash.edu.au/cgi-bin/cgiwrap/jwb/wwwjdic?1D Of course this is nitpicking (I'm sorry :D ) and doesn't make your point invalid. I agree that a string should be somewhat more "intelligent" than an array. -Sebastian
Aug 31 2004
This is what dchar[] is for. With dchar[] array indexing === character indexing. A couple of helper function in std.string char[] slice(char[] str, int a, int b); % slice characters a to b, not index a to b wchar[] slice(wchar[] str, int a, int b); would also be nice for those cases when one doesn't want to convert to dchar[]. Maybe such functions area already in phobos somewhere? I haven't looked too hard. "Berin Loritsch" <bloritsch d-haven.org> wrote in message news:ch24jt$rs0$1 digitaldaemon.com...All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. For example, logically speaking I may want to get the second and third characters of this string (UTF8): ???? (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick. Now, tell me what I would get in a UTF8 environment: char[] kyokimasu = "????"; char[] test = kyokimasu[1..3]; assert "??" == test; I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding. Not to mention the UTF8 string listed above would be more than 8 bytes long (the wchar[] version). The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. I just don't think we can rely on D's native (up to now) way of dealing with String manipulation.
Aug 31 2004
actually now that I think about it another way to slice from character a to b is to have a function that returns the index of the nth character: int character(char[] str, int n); and then slicing is str[character(a) .. character(b)]; That is probably better than special slicing functions. "Ben Hinkle" <bhinkle mathworks.com> wrote in message news:ch26as$sl7$1 digitaldaemon.com...This is what dchar[] is for. With dchar[] array indexing === character indexing. A couple of helper function in std.string char[] slice(char[] str, int a, int b); % slice characters a to b, not index a to b wchar[] slice(wchar[] str, int a, int b); would also be nice for those cases when one doesn't want to convert to dchar[]. Maybe such functions area already in phobos somewhere? I haven't looked too hard. "Berin Loritsch" <bloritsch d-haven.org> wrote in message news:ch24jt$rs0$1 digitaldaemon.com...All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar" arrays. For example, logically speaking I may want to get the second and third characters of this string (UTF8): ???? (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick. Now, tell me what I would get in a UTF8 environment: char[] kyokimasu = "????"; char[] test = kyokimasu[1..3]; assert "??" == test; I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases of multibyte encoding. Not to mention the UTF8 string listed above would be more than 8 bytes long (the wchar[] version). The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. I just don't think we can rely on D's native (up to now) way of dealing with String manipulation.
Aug 31 2004
OK - enough replying to myself, I know, I know. Here's the code implementing what I'm talking about: import std.utf; size_t character(char[] str, size_t n) { size_t i = 0; while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n) { size_t i = 0; while (n--) { decode(str,i); } return i; } "Ben Hinkle" <bhinkle mathworks.com> wrote in message news:ch26je$sq4$1 digitaldaemon.com...actually now that I think about it another way to slice from character atob is to have a function that returns the index of the nth character: int character(char[] str, int n); and then slicing is str[character(a) .. character(b)]; That is probably better than special slicing functions. "Ben Hinkle" <bhinkle mathworks.com> wrote in message news:ch26as$sl7$1 digitaldaemon.com...haven'tThis is what dchar[] is for. With dchar[] array indexing === character indexing. A couple of helper function in std.string char[] slice(char[] str, int a, int b); % slice characters a to b, not index a to b wchar[] slice(wchar[] str, int a, int b); would also be nice for those cases when one doesn't want to convert to dchar[]. Maybe such functions area already in phobos somewhere? Iarrays.looked too hard. "Berin Loritsch" <bloritsch d-haven.org> wrote in message news:ch24jt$rs0$1 digitaldaemon.com...All this talk about unicode made it clear that using a straight array may not be the right tool for string handling. Sure the most common operations can be done on an array (concatenation, sub-arrays, etc.). However, if we are to assume any kind of encoding support other than ASCII, it is simply not safe unless we are talking about "dchar"multibyteFor example, logically speaking I may want to get the second and third characters of this string (UTF8): ???? (only four characters). It is the Japanese text for "kyo kimasu" (he comes). I'm into martial arts, so I can't get away from the Japanese language (it is tied to what I study)--even though I can't really speak a lick. Now, tell me what I would get in a UTF8 environment: char[] kyokimasu = "????"; char[] test = kyokimasu[1..3]; assert "??" == test; I guarantee you the assertion would fail. Why? because strict array slicing does not take into account multibyte encoding. Essentially I will get part of the first character's encoding alone. Any UTF aware system would either need to build this knowlege into the language (bad idea IMO), or have a string to take care of that info for you. Things are a bit better with wchar[], (I'm not sure, but I think the above will pass)--but there are still some cases ofsucceed,encoding. Not to mention the UTF8 string listed above would be more than 8 bytes long (the wchar[] version). The only way to make it work seamlessly is to have a string class that would make the proper adjustments. Of course this would also affect the speed deamons here. I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D toi18nas long as you speak English does not make sense). General purposedealingand l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries. I just don't think we can rely on D's native (up to now) way ofwith String manipulation.
Aug 31 2004
In article <ch26je$sq4$1 digitaldaemon.com>, Ben Hinkle says...actually now that I think about it another way to slice from character a to b is to have a function that returns the index of the nth character: int character(char[] str, int n); and then slicing is str[character(a) .. character(b)]; That is probably better than special slicing functions.It's more flexible, but it is slightly slower. The two calls to character() will parse the string once each, while a splice() function could do it in one run. Nick
Aug 31 2004
In article <ch2i6t$13ma$1 digitaldaemon.com>, Nick says...It's more flexible, but it is slightly slower. The two calls to character() will parse the string once each, while a splice() function could do it in one run.^^^^^^ Err, that should be slice() :-) Nick
Aug 31 2004
"Nick" <Nick_member pathlink.com> wrote in message news:ch2i6t$13ma$1 digitaldaemon.com...In article <ch26je$sq4$1 digitaldaemon.com>, Ben Hinkle says...toactually now that I think about it another way to slice from character acharacter() willb is to have a function that returns the index of the nth character: int character(char[] str, int n); and then slicing is str[character(a) .. character(b)]; That is probably better than special slicing functions.It's more flexible, but it is slightly slower. The two calls toparse the string once each, while a splice() function could do it in onerun.Nickgood point. plus it is less typing. So here's version 2: import std.utf; size_t character(char[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } char[] slice(char[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } wchar[] slice(wchar[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
Aug 31 2004
Considering the code is not as straight forward as I am used to, what the character() method is doing is decoding the string byte by byte using the passed in index. The index (i) is only used to resume where you may have left off. Ok. So we have a little optimization here so that we don't double-decode something... It seemed a bit odd to me to do the b-a subtraction in the slice method, but then I realized what you were doing (resuming from the last point). Of course this also assumes that someone didn't put in bad data like: slice(mystr, 5, 4); Not to mention you could genericise the functions since they are identical except for the element type of the array. I suppose that is why C++ string object is templated (so you can use wchar instead of char). The decode method would actually be different though based on the type. Ben Hinkle wrote:import std.utf; size_t character(char[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } char[] slice(char[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } wchar[] slice(wchar[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
Aug 31 2004
On Tue, 31 Aug 2004 17:10:50 -0400, Berin Loritsch <bloritsch d-haven.org> wrote:Considering the code is not as straight forward as I am used to, what the character() method is doing is decoding the string byte by byte using the passed in index. The index (i) is only used to resume where you may have left off. Ok. So we have a little optimization here so that we don't double-decode something...Clever optimisation.It seemed a bit odd to me to do the b-a subtraction in the slice method, but then I realized what you were doing (resuming from the last point).Yeah.. it took me a while too.Of course this also assumes that someone didn't put in bad data like: slice(mystr, 5, 4);A perfect oppotunity for DbC eg. char[] slice(char[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }Not to mention you could genericise the functions since they are identical except for the element type of the array.Yep. template character(Type : Type[]) { size_t character(Type[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } } template slice(Type : Type[]) { Type[] slice(Type[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } } or something like that.I suppose that is why C++ string object is templated (so you can use wchar instead of char).Probably.The decode method would actually be different though based on the type.True. ReganBen Hinkle wrote:-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/import std.utf; size_t character(char[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } char[] slice(char[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } wchar[] slice(wchar[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
Aug 31 2004
This sort of useful code should go into the standard library, the 'phoenix' (or whatever we call it) library should include this.. On Wed, 01 Sep 2004 11:17:25 +1200, Regan Heath <regan netwin.co.nz> wrote:On Tue, 31 Aug 2004 17:10:50 -0400, Berin Loritsch <bloritsch d-haven.org> wrote:-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/Considering the code is not as straight forward as I am used to, what the character() method is doing is decoding the string byte by byte using the passed in index. The index (i) is only used to resume where you may have left off. Ok. So we have a little optimization here so that we don't double-decode something...Clever optimisation.It seemed a bit odd to me to do the b-a subtraction in the slice method, but then I realized what you were doing (resuming from the last point).Yeah.. it took me a while too.Of course this also assumes that someone didn't put in bad data like: slice(mystr, 5, 4);A perfect oppotunity for DbC eg. char[] slice(char[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }Not to mention you could genericise the functions since they are identical except for the element type of the array.Yep. template character(Type : Type[]) { size_t character(Type[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } } template slice(Type : Type[]) { Type[] slice(Type[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } } or something like that.I suppose that is why C++ string object is templated (so you can use wchar instead of char).Probably.The decode method would actually be different though based on the type.True. ReganBen Hinkle wrote:import std.utf; size_t character(char[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } size_t character(wchar[] str, size_t n, size_t i = 0) { while (n--) { decode(str,i); } return i; } char[] slice(char[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } wchar[] slice(wchar[] str, size_t a, size_t b) { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; }
Aug 31 2004
In article <opsdmdnblz5a2sq9 digitalmars.com>, Regan Heath says...[...] template slice(Type : Type[]) { Type[] slice(Type[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } } or something like that.Nice. Except now you have to add a !(char[]) for every slice operation, since D doesn't auto detect types :-( A workaround could be something like: template slice_template(Type: Type[]) {...} alias slice_template!(char[]) slice; alias slice_template!(wchar[]) slice; NickI suppose that is why C++ string object is templated (so you can use wchar instead of char).
Sep 01 2004
On Wed, 1 Sep 2004 12:50:28 +0000 (UTC), Nick <Nick_member pathlink.com> wrote:In article <opsdmdnblz5a2sq9 digitalmars.com>, Regan Heath says...Does that work? (I haven't tried it, but I'd expect the second to over-rule the first?) The other option is to then write wrapper functions eg. char[] slice(char[] str, size_t a, size_t b) { return slice!(char[])(str,a,b); } wchar[] slice(wchar[] str, size_t a, size_t b) { return slice!(wchar[])(str,a,b); } dchar[] slice(dchar[] str, size_t a, size_t b) { return slice!(dchar[])(str,a,b); } Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/[...] template slice(Type : Type[]) { Type[] slice(Type[] str, size_t a, size_t b) in { assert(b > a); // b >= a? } body { size_t ai = character(str,a); size_t bi = character(str,b-a,ai); return str[ai .. bi]; } } or something like that.Nice. Except now you have to add a !(char[]) for every slice operation, since D doesn't auto detect types :-( A workaround could be something like: template slice_template(Type: Type[]) {...} alias slice_template!(char[]) slice; alias slice_template!(wchar[]) slice;I suppose that is why C++ string object is templated (so you can use wchar instead of char).
Sep 01 2004
In article <opsdn8ouhn5a2sq9 digitalmars.com>, Regan Heath says...On Wed, 1 Sep 2004 12:50:28 +0000 (UTC), Nick <Nick_member pathlink.com> wrote:Yes, it works because the prototypes are different. I used this trick at some point in my std.stream rewrite, though I think I tossed all the template code before I posted the verison that's available now. Seanalias slice_template!(char[]) slice; alias slice_template!(wchar[]) slice;Does that work? (I haven't tried it, but I'd expect the second to over-rule the first?)
Sep 01 2004
In article <opsdn8ouhn5a2sq9 digitalmars.com>, Regan Heath says...On Wed, 1 Sep 2004 12:50:28 +0000 (UTC), Nick <Nick_member pathlink.com> wrote:Yep, it works. The second does not over-rule the first, it over-*loads* it, meaning slice() is subject to normal function overloading rules. I use this on almost all my templates, I find it makes the code less rough on the eyes and means less typing as well. Nickalias slice_template!(char[]) slice; alias slice_template!(wchar[]) slice;Does that work? (I haven't tried it, but I'd expect the second to over-rule the first?)
Sep 02 2004
Nice work! Can I add it to std.string? Or should it go in std.utf?
Aug 31 2004
Walter wrote:Nice work! Can I add it to std.string? Or should it go in std.utf?cool, thanks. I think most people would look in std.string since the target of the operations are to index and slice strings - the encoding is somewhat secondary.
Sep 01 2004
In article <ch24jt$rs0$1 digitaldaemon.com>, Berin Loritsch says...I think having something generally useful for internationalization is very important, or we shoot ourselves in the foot (we want D to succeed, as long as you speak English does not make sense). General purpose i18n and l10n is not easy to do by any stretch--but I think it is generally agreed that it would have to be done in libraries.ICU has the class UnicodeString to encapsulate strings, as well as the abstract class CharacterIterator for iterating over characters, with concrete implementations UCharCharacterIterator and StringCharacterIterator. It also has a lot more besides. Check out the API guide at http://oss.software.ibm.com/icu/apiref/classes.html. All of this will be a part of D (yes, via a library) in the not-too-distant future.I just don't think we can rely on D's native (up to now) way of dealing with String manipulation.That's why I'm wrapping ICU as we speak. Arcane Jill
Sep 01 2004