digitalmars.D.learn - toUTFz and WinAPI GetTextExtentPoint32W
- Andre (11/11) Sep 20 2011 Hi,
- Trass3r (1/5) Sep 20 2011 toUTFz returns a wchar*, not a wchar[].
- Andre (7/13) Sep 20 2011 I am not familiar with pointers. I know I have to
- Timon Gehr (8/21) Sep 20 2011 Are you sure that the call requires the string to be null terminated? I
- Trass3r (2/9) Sep 20 2011 It doesn't need to be null-terminated for that function.
- Timon Gehr (5/15) Sep 20 2011 It has to be copied anyway, so there is no real difference. I just did
- Timon Gehr (7/31) Sep 20 2011 sry, should have read:
- Andre (4/39) Sep 20 2011 thanks a lot for your help.
- Andrej Mitrovic (10/10) Sep 20 2011 Don't use length, use std.utf.count, ala:
- Jonathan M Davis (6/11) Sep 20 2011 Or std.range.walkLength. I don't know why we really have std.utf.count. ...
- Andrej Mitrovic (7/14) Sep 20 2011 I don't think having better-named aliases is a bad thing. Although now
- Andrej Mitrovic (5/5) Sep 20 2011 One other thing, count can only take an array which seems too
- Jonathan M Davis (18/39) Sep 20 2011 We specifically avoid having aliases in Phobos simply for having alterna...
- travert phare.normalesup.org (Christophe) (30/44) Sep 20 2011 std.utf.count has on advantage: someone looking for the function will
- Timon Gehr (4/48) Sep 20 2011 Very good point, you might want to file an enhancement request. It would...
- travert phare.normalesup.org (Christophe) (5/31) Sep 20 2011 I would be glad to do so, but I am quite new here, so I don't know how
- Timon Gehr (4/33) Sep 21 2011 http://d.puremagic.com/issues/
- Dmitry Olshansky (9/65) Sep 21 2011 Actually, I don't buy it. I guess the reason it's faster is that it
- Timon Gehr (5/70) Sep 21 2011 Most of these could be caught by a final check. I think having the
- travert phare.normalesup.org (Christophe) (65/71) Sep 21 2011 Why should it ? The documentation of std.utf.count says the string must
- Dmitry Olshansky (19/88) Sep 21 2011 Yeah, a brain malfunction on my part.
- zeljkog (9/19) Sep 21 2011 Here is a more readable and a bit faster version on dmd windows:
- travert phare.normalesup.org (Christophe Travert) (16/25) Sep 21 2011 Nice. It is better with gdc linux 64bits too. I wanted to avoid
- zeljkog (2/5) Sep 21 2011 It is not compiled in as conditional jump.
- Andrej Mitrovic (7/10) Sep 20 2011 And function names have to be useful to library users. walkLength is
- Jonathan M Davis (12/24) Sep 20 2011 In this case, if there's a problem it's not how generic the function is,...
Hi, I want something like: bool test(HDC dc, string str, int len, SIZE* s) { wchar[] wstr = toUTFz!(wchar*)str; GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s); ... I get the wchar[] stuff not working. I am struggling with pointer to array. Could you give some advice? Kind regards Andre
Sep 20 2011
bool test(HDC dc, string str, int len, SIZE* s) { wchar[] wstr = toUTFz!(wchar*)str; GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);toUTFz returns a wchar*, not a wchar[].
Sep 20 2011
Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andrebool test(HDC dc, string str, int len, SIZE* s) { wchar[] wstr = toUTFz!(wchar*)str; GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);toUTFz returns a wchar*, not a wchar[].
Sep 20 2011
On 09/20/2011 08:07 PM, Andre wrote:Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andrebool test(HDC dc, string str, int len, SIZE* s) { wchar[] wstr = toUTFz!(wchar*)str; GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);toUTFz returns a wchar*, not a wchar[].
Sep 20 2011
Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...It doesn't need to be null-terminated for that function. Shouldn't you use to!wstring though?!
Sep 20 2011
On 09/20/2011 08:34 PM, Trass3r wrote:It has to be copied anyway, so there is no real difference. I just did not know the signature of that function, and if it had been missing the const, wstring would not have worked. But if there is a const, wstring is indeed superior because shorter and clearer.Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...It doesn't need to be null-terminated for that function. Shouldn't you use to!wstring though?!
Sep 20 2011
On 09/20/2011 08:24 PM, Timon Gehr wrote:On 09/20/2011 08:07 PM, Andre wrote:sry, should have read: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])(str); GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andrebool test(HDC dc, string str, int len, SIZE* s) { wchar[] wstr = toUTFz!(wchar*)str; GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);toUTFz returns a wchar*, not a wchar[].
Sep 20 2011
Am Tue, 20 Sep 2011 20:44:40 +0200 schrieb Timon Gehr:On 09/20/2011 08:24 PM, Timon Gehr wrote:thanks a lot for your help. Kind regards AndreOn 09/20/2011 08:07 PM, Andre wrote:sry, should have read: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])(str); GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andrebool test(HDC dc, string str, int len, SIZE* s) { wchar[] wstr = toUTFz!(wchar*)str; GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);toUTFz returns a wchar*, not a wchar[].
Sep 20 2011
Don't use length, use std.utf.count, ala: import std.utf; alias toUTFz!(const(wchar)*, string) toUTF16z; GetTextExtentPoint32W(str.toUTF16z, std.utf.count(str), s); I like to keep that alias for my code since I was already using it beforehand. I'm pretty sure (ok maybe 80% sure) that GetTextExtentPoint32W asks for the count of characters and not code units. The WinAPI docs are a bit fuzzy when it comes to these things, some functions take the character count, others code-unit count. I've used this function in a D port of a Neatpad project a while ago.
Sep 20 2011
On Tuesday, September 20, 2011 14:27 Andrej Mitrovic wrote:Don't use length, use std.utf.count, ala: import std.utf; alias toUTFz!(const(wchar)*, string) toUTF16z; GetTextExtentPoint32W(str.toUTF16z, std.utf.count(str), s);Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M Davis
Sep 20 2011
On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M DavisI don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function. What exactly is the "static if (E.sizeof < 4)" in there for btw? When would the element type exceed 4 bytes while still passing the isSomeChar contract, and then why not stop compilation at that point instead of return "s.length"?
Sep 20 2011
One other thing, count can only take an array which seems too restrictive since walkLength can take any range at all. So maybe count should be just an alias to walkLength or it should possibly be removed (I'm against fully removing it because I already use it in code and I think the name does make sense).
Sep 20 2011
On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:We specifically avoid having aliases in Phobos simply for having alternate function names. Aliases need to actually be useful, or they shouldn't be there.Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M DavisI don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.What exactly is the "static if (E.sizeof < 4)" in there for btw? When would the element type exceed 4 bytes while still passing the isSomeChar contract, and then why not stop compilation at that point instead of return "s.length"?The static if is there to special-case narrow strings. It's unnecessary (though it does eliminate a function call when -inline isn't used). It would have been necessary prior to count just forwarding to walkLength, but it isn't now.One other thing, count can only take an array which seems too restrictive since walkLength can take any range at all. So maybe count should be just an alias to walkLength or it should possibly be removed (I'm against fully removing it because I already use it in code and I think the name does make sense).I don't know if we're going to remove std.utf.count or not, but it _is_ the kind of thing that we've been removing. It doesn't add any real value. It's just another function which does exactly the same thing as walkLength except that it's restricted to strings, and we don't generally like having pointless aliases around (or pointless function wrappers, which amounts to pretty much the same thing). So, it wouldn't surprise me at all if it goes away, but if/when it does, it'll go through the proper deprecation cycle rather than just being removed, so if/when we do that, it's not like your code would immediately break. - Jonathan M Davis
Sep 20 2011
"Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a écrit :On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M DavisI don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.
Sep 20 2011
On 09/21/2011 01:57 AM, Christophe wrote:"Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a écrit :Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M DavisI don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.
Sep 20 2011
Timon Gehr , dans le message (digitalmars.D.learn:29641), a écrit :I would be glad to do so, but I am quite new here, so I don't know how to. A little pointer could help. -- ChristopheLast point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.
Sep 20 2011
On 09/21/2011 02:15 AM, Christophe wrote:Timon Gehr , dans le message (digitalmars.D.learn:29641), a écrit :http://d.puremagic.com/issues/ You can tick 'Severity: enhancement request'. Probably it would be best if it throws if the final result is larger than text.length though.I would be glad to do so, but I am quite new here, so I don't know how to. A little pointer could help.Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.
Sep 21 2011
On 21.09.2011 4:04, Timon Gehr wrote:On 09/21/2011 01:57 AM, Christophe wrote:Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid. In fact you can easily get ridiculous overflowed "negative" lengths. Maybe we can put it here as unsafe and fast version though. Also check std.utf.stride to see if you can get it better, it's the beast behind narrow string popFront. -- Dmitry Olshansky"Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a écrit :Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M DavisI don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.
Sep 21 2011
On 09/21/2011 12:37 PM, Dmitry Olshansky wrote:On 21.09.2011 4:04, Timon Gehr wrote:Most of these could be caught by a final check. I think having the option of a version that is so much faster would be nice. Chances are pretty high that code actually manipulating the string will throw eventually if it is invalid.On 09/21/2011 01:57 AM, Christophe wrote:Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid. In fact you can easily get ridiculous overflowed "negative" lengths."Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a écrit :Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M DavisI don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.Maybe we can put it here as unsafe and fast version though. Also check std.utf.stride to see if you can get it better, it's the beast behind narrow string popFront.
Sep 21 2011
Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid.Why should it ? The documentation of std.utf.count says the string must be validly encoded, not that it will enforce that it is. Checking a string is valid everytime you use it would be very expensive. Actually, std.range.walkLength does not check the sequence is valid. See this test: void main() { string text = "aléluyah"; char[] text2 = text.dup; text2[3] = 'a'; writeln(walkLength(text2)); // outputs: 8 writeln(text2); // outputs: al\303aluyah } There is probably a way to check an utf sequence is valid with an unrollable loop.In fact you can easily get ridiculous overflowed "negative" lengths. Maybe we can put it here as unsafe and fast version though.Unless I am mistaken, the minimum length myCount can return is 0 even if the string is invalid.Also check std.utf.stride to see if you can get it better, it's the beast behind narrow string popFront.stride does not make much checking. It can even return 5 or 6, which is not possible for a valid utf-8 string ! The equivalent of myCount to stride would be: size_t myStride(char c) { // optional: // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5)) // throw new UtfException("Not the start of the UTF-8 sequence"); return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4); } That I compared to: size_t utfLikeStride(char c) { // optional: // immutable result = UTF8stride[c]; // if (result == 0xFF) // throw new UtfException("Not the start of the UTF-8 sequence"); // return result; return UTF8stride[c]; } One table lookup is replaced by byte some arythmetic in myStride. I also took only one char as input, since stride only looked at the i-th character. Actually, if stride signature is kept to uint "stride(char[] s, int i)", I did not find any change with -O3. Average times for "a lot" of calls: (compiled with gcc, tested with -O3 and a homogenous distribution of "valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4') myStride no throws: 1112ms. utfLikeStride no throws: 1433ms. utfLikeStride throws: 1868ms. (the current implementation). myStride throws: 8269ms. Removing throws from utfLikeStride makes it about 25% faster. Removing throws from myStride makes it about 7 times faster. With -O0, myStride gets less 10% slower than utfLikeStride (no throws). In conclusion, the fastest implementation is myStride without throws, and it beats the current implementation by about 40%. Changing std.utf.stride may be desirable. As I said earlier, the throws do not enforce the validity of the string. Really checking the validity of the string would cost much more, which may not be desirable, so why bother checking at all? A more serious benchmark could justify to change std.utf.stride. The improvement could be even better in real situation, because the lookup table of utfLikeStride may not be always at hand - this actually really depends on what the compiler does. In any case, this may not improve walkLength by more than a few percents. -- Christophe now I'll go back to my real work...
Sep 21 2011
On 21.09.2011 18:47, Christophe wrote:Ouch, the checking is apparently very loosy.Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid.Why should it ? The documentation of std.utf.count says the string must be validly encoded, not that it will enforce that it is. Checking a string is valid everytime you use it would be very expensive. Actually, std.range.walkLength does not check the sequence is valid. See this test: void main() { string text = "aléluyah"; char[] text2 = text.dup; text2[3] = 'a'; writeln(walkLength(text2)); // outputs: 8 writeln(text2); // outputs: al\303aluyah }There is probably a way to check an utf sequence is valid with an unrollable loop.Yeah, a brain malfunction on my part.In fact you can easily get ridiculous overflowed "negative" lengths. Maybe we can put it here as unsafe and fast version though.Unless I am mistaken, the minimum length myCount can return is 0 even if the string is invalid.I wonder what impact may have if any changing 0xff to 0x00 in implementation of utfLikeStride. It should amount to cmp vs test, not sure if it matters much.Also check std.utf.stride to see if you can get it better, it's the beast behind narrow string popFront.stride does not make much checking. It can even return 5 or 6, which is not possible for a valid utf-8 string ! The equivalent of myCount to stride would be: size_t myStride(char c) { // optional: // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5)) // throw new UtfException("Not the start of the UTF-8 sequence"); return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4); } That I compared to: size_t utfLikeStride(char c) { // optional: // immutable result = UTF8stride[c]; // if (result == 0xFF) // throw new UtfException("Not the start of the UTF-8 sequence"); // return result; return UTF8stride[c]; } One table lookup is replaced by byte some arythmetic in myStride. I also took only one char as input, since stride only looked at the i-th character. Actually, if stride signature is kept to uint "stride(char[] s, int i)", I did not find any change with -O3. Average times for "a lot" of calls: (compiled with gcc, tested with -O3 and a homogenous distribution of "valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4') myStride no throws: 1112ms. utfLikeStride no throws: 1433ms. utfLikeStride throws: 1868ms. (the current implementation). myStride throws: 8269ms.Removing throws from utfLikeStride makes it about 25% faster. Removing throws from myStride makes it about 7 times faster. With -O0, myStride gets less 10% slower than utfLikeStride (no throws). In conclusion, the fastest implementation is myStride without throws, and it beats the current implementation by about 40%. Changing std.utf.stride may be desirable. As I said earlier, the throws do not enforce the validity of the string. Really checking the validity of the string would cost much more, which may not be desirable, so why bother checking at all?The truth is I'd checked this in the past (though I used some bsr black magic) and if I kept check in place the end result was always slower then current. But since the check is not very accurate anyway, maybe it can be replaced. It's problematic if some code happen to depend on it. (given the doc it should not)A more serious benchmark could justify to change std.utf.stride. The improvement could be even better in real situation, because the lookup table of utfLikeStride may not be always at hand - this actually really depends on what the compiler does.Yes and no, I think it would be hard to find app that bottlenecks at traversing UTF, on decoding - maybe. Generally if you do a lot calls to stride it's in cache, if not it doesn't matter much(?). Though I'd prefer non-tabulated versionIn any case, this may not improve walkLength by more than a few percents.Then specializing walkLength to do your unrollable version seems like good idea. -- Dmitry Olshansky
Sep 21 2011
On 21.09.2011 01:57, Christophe wrote:size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; }Here is a more readable and a bit faster version on dmd windows: size_t utfCount(string text) { size_t n = 0; for (uint i=0; i<text.length; ++i) n += ((text[i]>>6)^0b10)? 1: 0; return n; }
Sep 21 2011
Here is a more readable and a bit faster version on dmd windows: size_t utfCount(string text) { size_t n = 0; for (uint i=0; i<text.length; ++i) n += ((text[i]>>6)^0b10)? 1: 0; return n; }Nice. It is better with gdc linux 64bits too. I wanted to avoid conditional expressions like ?: but it's actually slightly faster that way. And now people can't tell it is dangerous because it could return a fuzzy number. Even faster, through less readable: size_t utfLength(string text) { size_t n=0; for (size_t i=0; i<text.length; ++i) n += (((text[i]>>6)^0b10) != 0); return n; } Let's see how we can boost std.utf.stride that way... -- Christophe
Sep 21 2011
On 21.09.2011 19:12, Christophe Travert wrote:Nice. It is better with gdc linux 64bits too. I wanted to avoid conditional expressions like ?: but it's actually slightly faster that way.It is not compiled in as conditional jump.
Sep 21 2011
On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:We specifically avoid having aliases in Phobos simply for having alternate function names. Aliases need to actually be useful, or they shouldn't be there.And function names have to be useful to library users. walkLength is an awful name for something that returns the character count. If you ask a GUI developer to look for a function that creates a rectangle path, you can be sure he'll start looking for Rectangle or DrawRect or something similar, and not "ClosedShapePointN!4" or something that generic.
Sep 20 2011
On Tuesday, September 20, 2011 15:10 Andrej Mitrovic wrote:On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:In this case, if there's a problem it's not how generic the function is, it's the name walkLength. There's nothing special about strings which makes the name count better for them than it is for other ranges. The function is returning the number of elements in the range - be they code points or integers or whatever. The name walkLength works just as well for strings as it does for anything else. So, if there's a problem it's that the name walkLength isn't necessarily all that great. Strings aren't so special that they merit their own function name for the same functionality. So, if count stays, it's simply because it's been around for a while, not because it's inherently better to have a separate count function. - Jonathan M DavisWe specifically avoid having aliases in Phobos simply for having alternate function names. Aliases need to actually be useful, or they shouldn't be there.And function names have to be useful to library users. walkLength is an awful name for something that returns the character count. If you ask a GUI developer to look for a function that creates a rectangle path, you can be sure he'll start looking for Rectangle or DrawRect or something similar, and not "ClosedShapePointN!4" or something that generic.
Sep 20 2011