digitalmars.D.learn - Get Character At?
- okibi (2/2) Apr 24 2007 Is there a getCharAt() function for D?
- Derek Parnell (7/8) Apr 24 2007 Get a character from what? A string, a file, a console screen, ... ?
- okibi (6/17) Apr 24 2007 Such as this:
- Tomas Lindquist Olsen (4/27) Apr 24 2007 Why not just do:
- okibi (2/34) Apr 24 2007 Because it isn't working for me. That was what I was trying to do seeing...
- BCS (2/5) Apr 24 2007 How about a little more code. What I've seen so far should work.
- Tomas Lindquist Olsen (9/20) Apr 24 2007 import std.stdio;
- okibi (2/25) Apr 24 2007 That fixed the problem, thanks!
- Clay Smith (2/31) Apr 24 2007 text[5] will return the sixth element in the array.
- Tomas Lindquist Olsen (2/4) Apr 24 2007 He never said anything about getCharAt starting at one...
- Clay Smith (4/27) Apr 24 2007 Just use
- Derek Parnell (34/57) Apr 24 2007 Because char[] represents a UTF-8 encoded unicode string, to get the Nth
- Chris Nicholson-Sauls (7/68) Apr 24 2007 Which is why I tend to try and bite the bullet and just use dchar[] for ...
- Daniel Keep (19/87) Apr 24 2007 I was going to post a link to my old Text In D article[1], but I guess
- Derek Parnell (71/74) Apr 25 2007 It seems that your routine is about 3 times slower than the one I had
- Frits van Bommel (25/68) Apr 25 2007 How is it unclear? Postfix-increment clearly means that the value before...
- Daniel Keep (15/95) Apr 25 2007 Yoikes! I'm rather amazed that the "simple" foreach method is that much
- Derek Parnell (25/42) Apr 25 2007 Yes, I know what it is supposed to do, but when written as it is, it can
- Frits van Bommel (39/50) Apr 25 2007 I was just mentioning that you seemed to be over-complicating the code,
On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
Derek Parnell Wrote:On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
okibi wrote:Derek Parnell Wrote:Why not just do: char[] text = "some text"; char num5 = text[5];On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
Tomas Lindquist Olsen Wrote:okibi wrote:Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.Derek Parnell Wrote:Why not just do: char[] text = "some text"; char num5 = text[5];On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
okibi wrote:Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.How about a little more code. What I've seen so far should work.
Apr 24 2007
okibi wrote:import std.stdio; void main() { char[] text = "this is a sentence"; int loc = 5; writefln("%s", typeid(typeof(text[loc]))); } this prints 'char' as expected...Why not just do: char[] text = "some text"; char num5 = text[5];Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.
Apr 24 2007
Tomas Lindquist Olsen Wrote:okibi wrote:That fixed the problem, thanks!import std.stdio; void main() { char[] text = "this is a sentence"; int loc = 5; writefln("%s", typeid(typeof(text[loc]))); } this prints 'char' as expected...Why not just do: char[] text = "some text"; char num5 = text[5];Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.
Apr 24 2007
Tomas Lindquist Olsen wrote:okibi wrote:text[5] will return the sixth element in the array.Derek Parnell Wrote:Why not just do: char[] text = "some text"; char num5 = text[5];On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
Clay Smith wrote:text[5] will return the sixth element in the array.He never said anything about getCharAt starting at one...
Apr 24 2007
okibi wrote:Derek Parnell Wrote:Just use char num5 = text[loc-1]; ?On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:Derek Parnell Wrote:Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnellOn Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
Derek Parnell wrote:On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:Which is why I tend to try and bite the bullet and just use dchar[] for general purpose things. I only use char[] in cases where I know it's "safe" to do so (that is, cases where I know what the input will be, and know it will be within the single-byte character range). That said, its a darn good thing Phobos has std.utf and Tango has tango.utils.Utf, otherwise we'd often be in a pickle. (Avoiding potential tango.io joke.) -- Chris Nicholson-SaulsDerek Parnell Wrote:Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments.On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
Derek Parnell wrote:On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:I was going to post a link to my old Text In D article[1], but I guess that'd be redundant now :P Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:Derek Parnell Wrote:Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments.On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.Is there a getCharAt() function for D?Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnelldchar nthCharacter(char[] string, int n) { int curChar = 0; foreach( dchar cp ; string ) if( curChar++ == n ) return cp; return dchar.init; }I'm curious since I don't want to recommend a slow solution if I can help it :) -- Daniel [1] http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/
Apr 24 2007
On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear. I also changed my routine to output a dchar rather than a char[] and to test for invalid position input. //----------------------------- import std.perf; import std.stdio; import std.utf; dchar getCharAt(T)(T pText, int pPos) { size_t lUTF_Index; uint lStride; if (pPos < 0 || pPos >= pText.length) return dchar.init; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return std.utf.toUTF32( pText[lUTF_Index .. lUTF_Index + lStride])[0]; } dchar nthCharacter(T)(T string, int n) { int curChar = 0; foreach( dchar cp ; string ) { if( curChar == n ) return cp; curChar++; } return dchar.init; } void main() { char[] text = "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg1" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg2" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg3" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg4" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg5" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg6" ; // Test must locate the last character. int loc = std.utf.toUTF32(text).length-1; assert(getCharAt(text, loc) == '6'); assert(nthCharacter(text, loc) == '6'); PerformanceCounter counter = new PerformanceCounter(); counter.start(); volatile for(int i = 0; i < 10_000_000; ++i) { getCharAt(text, loc); } counter.stop(); writefln("Derek Parnell: %10d", counter.microseconds()); counter.start(); volatile for(int i = 0; i < 10_000_000; ++i) { nthCharacter(text, loc); } counter.stop(); writefln(" Daniel Keep: %10d", counter.microseconds()); } //----------------------------- On my machine (Intel Core 2 6600 2.40GHz, 2GB RAM) I got this result ... c:\temp>test Derek Parnell: 7939664 Daniel Keep: 26683373 -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 25 2007
Derek Parnell wrote:On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.I also changed my routine to output a dchar rather than a char[] and to test for invalid position input. //----------------------------- import std.perf; import std.stdio; import std.utf; dchar getCharAt(T)(T pText, int pPos) { size_t lUTF_Index; uint lStride; if (pPos < 0 || pPos >= pText.length) return dchar.init; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos);// Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return std.utf.toUTF32( pText[lUTF_Index .. lUTF_Index + lStride])[0];I think you can change these last two statements to just: --- return pText.decode(lUTF_Index); --- (that's std.utf.decode, just to be clear) That changes the index variable passed, but that doesn't matter here.}[snip]//----------------------------- On my machine (Intel Core 2 6600 2.40GHz, 2GB RAM) I got this result ... c:\temp>test Derek Parnell: 7939664 Daniel Keep: 26683373With mine added: (and obviously on _my_ machine) --- urxae urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 17693368 Daniel Keep: 54037341 Frits van Bommel: 12045495 urxae urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 19567337 Daniel Keep: 26750383 Frits van Bommel: 14332419 --- (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64) So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.
Apr 25 2007
Frits van Bommel wrote:Derek Parnell wrote:Yoikes! I'm rather amazed that the "simple" foreach method is that much slower. I'll add the faster version to the article as soon as I get the chance. Thanks, guys. -- Daniel -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.I also changed my routine to output a dchar rather than a char[] and to test for invalid position input. //----------------------------- import std.perf; import std.stdio; import std.utf; dchar getCharAt(T)(T pText, int pPos) { size_t lUTF_Index; uint lStride; if (pPos < 0 || pPos >= pText.length) return dchar.init; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos);// Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return std.utf.toUTF32( pText[lUTF_Index .. lUTF_Index + lStride])[0];I think you can change these last two statements to just: --- return pText.decode(lUTF_Index); --- (that's std.utf.decode, just to be clear) That changes the index variable passed, but that doesn't matter here.}[snip]//----------------------------- On my machine (Intel Core 2 6600 2.40GHz, 2GB RAM) I got this result ... c:\temp>test Derek Parnell: 7939664 Daniel Keep: 26683373With mine added: (and obviously on _my_ machine) --- urxae urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 17693368 Daniel Keep: 54037341 Frits van Bommel: 12045495 urxae urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 19567337 Daniel Keep: 26750383 Frits van Bommel: 14332419 --- (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64) So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.
Apr 25 2007
On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote:Derek Parnell wrote:Yes, I know what it is supposed to do, but when written as it is, it can either be mistakenly thought that the variable gets incremented before the comparision or requires that extra bit of thinking to 'see' the process flow. For that reason, I prefer to either have ++ written as its own statement or write it so the casual reader can explicitly see the process flow. For example, in the original code by Daniel, I was unsure as to whether he was using a 0-based index or a 1-based index, as I had done in my example. The code he supplied assumed a 0-based if the ++ worked as you describe but it assumed a 1-based index if it worked the other way. As my example was 1-based, and I assumed that Daniel knew how to use ++ correctly, I figured he had thus changed my definition of the Position parameter. But the point is, because it was not absolutely clear what the *intention* of the Daniel was, I decided to coded it so the intention was more clear.On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.I think you can change these last two statements to just:...So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.Well, if we were really into a pissing contest, we'd both remove the calls to library routines and code it inline, in assembler etc ... but that was not the point. Daniel's code is another example of 'foreach' not producing the best machine code to solve the problem at hand. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 25 2007
Derek Parnell wrote:On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote:I was just mentioning that you seemed to be over-complicating the code, and as a side-benefit the simpler code was faster as well.I think you can change these last two statements to just:...So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.Well, if we were really into a pissing contest, we'd both remove the calls to library routines and code it inline, in assembler etc ... but that wasnot the point. Daniel's code is another example of 'foreach' not producing the best machine code to solve the problem at hand.Well to be fair, I don't think that's purely the fault of 'foreach' implementation problems in this case. 'foreach' is doing genuinely more work in this case. Specifically, the foreach loop is decoding all characters up to the one it returns while the getCharAt() variants only actually decode the character asked for, using no more than the stride of the preceding ones. What the foreach version does is therefore more like the following: ----- dchar nthCharacter2(T)(T string, int n) { int curChar = 0; for(size_t index = 0 ; index < string.length ; string.decode(index)) { if( curChar == n ) return string.decode(index); // return _next_ char curChar++; } return dchar.init; } ----- Which is also on the slow side. (Though on DMD this version is still faster than the 'foreach' version :( ) The results with this added as well: ===== urxae urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 14416041 Frits van Bommel: 9803830 Daniel Keep: 37386228 for-decode: 33767606 urxae urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 17267995 Frits van Bommel: 11836242 Daniel Keep: 21390295 for-decode: 25339226 ===== ("for-decode" is the code above)
Apr 25 2007