digitalmars.D - New UTF-8 stride function
- Dmitry Olshansky (75/75) May 26 2013 If there is anything that come out of UTF-8 discussion is that I decided...
- Kiith-Sa (14/89) May 26 2013 WRT to the worse Linux64 case:
- Dmitry Olshansky (8/19) May 27 2013 Just tried it. Now I at least see that in 32bit my version is faster,
- Vladimir Panteleev (15/20) May 26 2013 For such cases, I found Agner's benchmarking utilities to be very
- Kiith-Sa (8/8) May 26 2013 You don't need any instrumentation for perf and can get similar
- Vladimir Panteleev (5/13) May 26 2013 Benchmarking and profiling are slightly different tasks. I've
- Dmitry Olshansky (7/25) May 27 2013 Yes, Agner is da man. Just hoped I could postpone this but... welcome to...
- Juan Manuel Cabo (42/46) May 26 2013 Ok, I just tested on my old trusty linux 64bit (Ubuntu 12.04).
- Juan Manuel Cabo (25/25) May 26 2013 And these are the results for the same linux 64bit system but
- Dmitry Olshansky (6/31) May 27 2013 This is mostly in agreement with what I have on my 4-core AMD Phenom.
- Martin Nowak (8/21) May 27 2013 https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-...
- Dmitry Olshansky (11/26) May 27 2013 Cool, I'm not alone in this :)
- Martin Nowak (4/8) May 27 2013 This will not detect 0xFF as invalid UTF-8 sequence.
- Dmitry Olshansky (13/21) May 28 2013 First of all there is a minor bug in std.utf in a sense that it accepts
If there is anything that come out of UTF-8 discussion is that I decided to dust off my experimental implementation of UTF-8 stride function. Just for fun. The key difference vs std is in handling non-ASCII case. I'm replacing bsr intrinsic with a what I call an "in-register lookup table" (neat stuff that is a piece of cake, thx to CTFE). See unittest/benchmark here: https://gist.github.com/blackwhale/5653927 I'm running tests against wiki titles dumps. For me the results are mixed but in 2 of 3 _builds_ my version consistently wins (sometimes by as much as 50%). 1. build is win32 dmd with -release -O -inline -noboundscheck my version is consistently faster (results below) 2. On linux x64 with the same config my version is consistently slower 3. LDC on linux x64 with ldc2 -O3 -d-noboundscheck my version is again faster with large margin It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot). Thus I encourage curious folks to measure/analyze it and report back (don't forget to include your processor model). The unbeatable advantage is however that my version doesn't require bsr/lzcount instruction :) BTW do ARM/PowerPC have any analog of it? Test files I used: https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0 Some dumps of my test runs win32 runs (time taken in usec): fast_stride ruwiki-latest-all-titles-in-ns0 stride 313756 myStride 229650 myStride 235091 stride 312563 fast_stride enwiki-latest-all-titles-in-ns0 stride 346577 myStride 279915 myStride 278684 stride 348902 fast_stride enwiki-latest-all-titles-in-ns0 stride 345866 myStride 280902 myStride 279780 stride 345653 fast_stride arwiki-latest-all-titles-in-ns0 stride 46548 myStride 33840 myStride 34959 stride 46342 fast_stride dewiki-latest-all-titles-in-ns0 stride 79715 myStride 64719 myStride 64672 stride 79848 dmd linux 64 runs ./fast_stride enwiki-latest-all-titles-in-ns0 stride 377258 myStride 630367 myStride 633262 stride 378523 ./fast_stride arwiki-latest-all-titles-in-ns0 stride 33924 myStride 38807 myStride 47708 stride 40160 ./fast_stride arwiki-latest-all-titles-in-ns0 stride 35110 myStride 39750 myStride 49942 stride 33597 -- Dmitry Olshansky
May 26 2013
WRT to the worse Linux64 case: I recommend infinite-cycling it and testing in perf top. (If you're on Ubuntu/derivative or maybe Debian, just type "perf top", it will tell you what package to install, and once installed, "perf top" again, while the benchmark is running) You'll get a precise real-time line-wise (with ability to drill down to ASM) profile (like "top", but for functions). With some command-line options (google "linux perf"), you can also look at cache misses, branch mispredictions, and so on. Compare that with the original version and you might find why it's slower. (Don't have time to test anything right now) On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:If there is anything that come out of UTF-8 discussion is that I decided to dust off my experimental implementation of UTF-8 stride function. Just for fun. The key difference vs std is in handling non-ASCII case. I'm replacing bsr intrinsic with a what I call an "in-register lookup table" (neat stuff that is a piece of cake, thx to CTFE). See unittest/benchmark here: https://gist.github.com/blackwhale/5653927 I'm running tests against wiki titles dumps. For me the results are mixed but in 2 of 3 _builds_ my version consistently wins (sometimes by as much as 50%). 1. build is win32 dmd with -release -O -inline -noboundscheck my version is consistently faster (results below) 2. On linux x64 with the same config my version is consistently slower 3. LDC on linux x64 with ldc2 -O3 -d-noboundscheck my version is again faster with large margin It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot). Thus I encourage curious folks to measure/analyze it and report back (don't forget to include your processor model). The unbeatable advantage is however that my version doesn't require bsr/lzcount instruction :) BTW do ARM/PowerPC have any analog of it? Test files I used: https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0 https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0 Some dumps of my test runs win32 runs (time taken in usec): fast_stride ruwiki-latest-all-titles-in-ns0 stride 313756 myStride 229650 myStride 235091 stride 312563 fast_stride enwiki-latest-all-titles-in-ns0 stride 346577 myStride 279915 myStride 278684 stride 348902 fast_stride enwiki-latest-all-titles-in-ns0 stride 345866 myStride 280902 myStride 279780 stride 345653 fast_stride arwiki-latest-all-titles-in-ns0 stride 46548 myStride 33840 myStride 34959 stride 46342 fast_stride dewiki-latest-all-titles-in-ns0 stride 79715 myStride 64719 myStride 64672 stride 79848 dmd linux 64 runs ./fast_stride enwiki-latest-all-titles-in-ns0 stride 377258 myStride 630367 myStride 633262 stride 378523 ./fast_stride arwiki-latest-all-titles-in-ns0 stride 33924 myStride 38807 myStride 47708 stride 40160 ./fast_stride arwiki-latest-all-titles-in-ns0 stride 35110 myStride 39750 myStride 49942 stride 33597
May 26 2013
27-May-2013 01:04, Kiith-Sa пишет:WRT to the worse Linux64 case: I recommend infinite-cycling it and testing in perf top.(If you're on Ubuntu/derivative or maybe Debian, just type "perf top", it will tell you what package to install, and once installed, "perf top" again, while the benchmark is running) You'll get a precise real-time line-wise (with ability to drill down to ASM) profile (like "top", but for functions). With some command-line options (google "linux perf"), you can also look at cache misses, branch mispredictions, and so on. Compare that with the original version and you might find why it's slower. (Don't have time to test anything right now)Just tried it. Now I at least see that in 32bit my version is faster, whereas on 64bit it isn't (that is on DMD). One curiosity is that the code for ASCII case is the same yet even on English text the difference is about the same. Another one is that both function are not even partially inlined. -- Dmitry Olshansky
May 27 2013
On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot).For such cases, I found Agner's benchmarking utilities to be very useful. They print exact CPU statistics, such as numbers of micro-ops, cache misses, mispredicted branches, etc. I've used them very successfully when tuning my appender implementation. To use them with D, I modified his C++ program to load a DLL and call a function, taking the DLL and function names from the command line. Original program: http://www.agner.org/optimize/testp.zip My patch (to load a DLL): http://dump.thecybershadow.net/5f55e8be5f8cd38ad60f218957ef24bb/PMCTestB.diff Usage example (sort of): https://github.com/CyberShadow/DAppenderResearch/blob/master/go-dll.bat Hope this helps :)
May 26 2013
You don't need any instrumentation for perf and can get similar output (even in real time). (Don't intend to start a profiler war, but recommend looking at perf before messings with DLLs and the like) (although perf is is Linux-only) I think the Windows version of AMD CodeAnalyst might have similar features, again without instrumentation, but never tried it (CodeAnalyst is gratis). Intel VTune definitely does, but it's not gratis.
May 26 2013
On Sunday, 26 May 2013 at 21:19:12 UTC, Kiith-Sa wrote:You don't need any instrumentation for perf and can get similar output (even in real time). (Don't intend to start a profiler war, but recommend looking at perf before messings with DLLs and the like) (although perf is is Linux-only) I think the Windows version of AMD CodeAnalyst might have similar features, again without instrumentation, but never tried it (CodeAnalyst is gratis). Intel VTune definitely does, but it's not gratis.Benchmarking and profiling are slightly different tasks. I've used AMD CodeAnalyst as well (although you need an AMD processor to use all of its features), but no profiler beats running one batch file and instantly getting a number to compare against.
May 26 2013
27-May-2013 01:13, Vladimir Panteleev пишет:On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:Yes, Agner is da man. Just hoped I could postpone this but... welcome to the micro-optimization world I guess.It's the kind of thing that is tremendously hard to measure accurately since it depends on the workload, architecture and the time spent is very small. So don't take it by word I'm almost certain that something is amiss (compiler switches and whatnot).For such cases, I found Agner's benchmarking utilities to be very useful. They print exact CPU statistics, such as numbers of micro-ops, cache misses, mispredicted branches, etc. I've used them very successfully when tuning my appender implementation.To use them with D, I modified his C++ program to load a DLL and call a function, taking the DLL and function names from the command line. Original program: http://www.agner.org/optimize/testp.zip My patch (to load a DLL): http://dump.thecybershadow.net/5f55e8be5f8cd38ad60f218957ef24bb/PMCTestB.diff Usage example (sort of): https://github.com/CyberShadow/DAppenderResearch/blob/master/go-dll.bat Hope this helps :)Thanks, I'll try it out. But jezz I have win8 so I'd start with linux version :) -- Dmitry Olshansky
May 27 2013
On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:[..] Thus I encourage curious folks to measure/analyze it and report back (don't forget to include your processor model). [..]Ok, I just tested on my old trusty linux 64bit (Ubuntu 12.04). I had to download those *wiki* files from this url: https://github.com/blackwhale/gsoc-bench-2012/archive/master.zip because github gave an error trying to download the ones that are more than 10Mb. I ran the tests multiple times before copying them here. And here are the results: $ uname -a UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ grep 'model.name\|cpu.MHz' /proc/cpuinfo model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ cpu MHz : 2109.518 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ cpu MHz : 2109.518 $ grep MemTotal /proc/meminfo MemTotal: 4049780 kB $ dmd -O -inline -release -noboundscheck fast_stride.d $ for a in *wiki*; do echo; echo $a: ; ./fast_stride $a; done arwiki-latest-all-titles-in-ns0: stride 67681 myStride 55908 myStride 66026 stride 66328 dewiki-latest-all-titles-in-ns0: stride 155071 myStride 195449 myStride 196586 stride 154627 enwiki-latest-all-titles-in-ns0: stride 688482 myStride 879950 myStride 879451 stride 689087 ruwiki-latest-all-titles-in-ns0: stride 449133 myStride 364808 myStride 512485 stride 448841 --jm
May 26 2013
And these are the results for the same linux 64bit system but compiling with -m32: $ dmd -m32 -O -inline -release -noboundscheck fast_stride.d $ for a in *wiki*; do echo ; echo $a: ; ./fast_stride $a; done arwiki-latest-all-titles-in-ns0: stride 89362 myStride 49974 myStride 51140 stride 88308 dewiki-latest-all-titles-in-ns0: stride 138381 myStride 116971 myStride 116662 stride 139681 enwiki-latest-all-titles-in-ns0: stride 584787 myStride 490681 myStride 490909 stride 584694 ruwiki-latest-all-titles-in-ns0: stride 585372 myStride 333905 myStride 341274 stride 585050 --jm
May 26 2013
27-May-2013 01:50, Juan Manuel Cabo пишет:And these are the results for the same linux 64bit system but compiling with -m32:This is mostly in agreement with what I have on my 4-core AMD Phenom. About the same on Core i5-3550 at work. Looks like I indeed need 'clock for clock' analysis to draw any conclusion.$ dmd -m32 -O -inline -release -noboundscheck fast_stride.d $ for a in *wiki*; do echo ; echo $a: ; ./fast_stride $a; done arwiki-latest-all-titles-in-ns0: stride 89362 myStride 49974 myStride 51140 stride 88308 dewiki-latest-all-titles-in-ns0: stride 138381 myStride 116971 myStride 116662 stride 139681 enwiki-latest-all-titles-in-ns0: stride 584787 myStride 490681 myStride 490909 stride 584694 ruwiki-latest-all-titles-in-ns0: stride 585372 myStride 333905 myStride 341274 stride 585050 --jm-- Dmitry Olshansky
May 27 2013
On 05/26/2013 10:49 PM, Dmitry Olshansky wrote:If there is anything that come out of UTF-8 discussion is that I decided to dust off my experimental implementation of UTF-8 stride function. Just for fun. The key difference vs std is in handling non-ASCII case. I'm replacing bsr intrinsic with a what I call an "in-register lookup table" (neat stuff that is a piece of cake, thx to CTFE). See unittest/benchmark here: https://gist.github.com/blackwhale/5653927Looks promising.Test files I used:https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0These are huge and most likely the performance is limited by the memory bandwith.
May 27 2013
27-May-2013 23:21, Martin Nowak пишет:On 05/26/2013 10:49 PM, Dmitry Olshansky wrote: > If there is anything that come out of UTF-8 discussion is that I decided > to dust off my experimental implementation of UTF-8 stride function. > Just for fun. > > The key difference vs std is in handling non-ASCII case. > I'm replacing bsr intrinsic with a what I call an "in-register lookup > table" (neat stuff that is a piece of cake, thx to CTFE). > > See unittest/benchmark here: > https://gist.github.com/blackwhale/5653927 > Looks promising.Cool, I'm not alone in this :) The only definitive results so far is that it takes less cycles on 32 bit. For me AMD CodeAnalyst confirms this is literally in cycles of up to 33% less with smaller samples in a loop. ASCII-only case seems to stay more or less the same (at least cycle-wise but not in time...) saving my sanity.These are huge and most likely the performance is limited by the memory bandwith.That could be it. I'll be making measurement on smaller samples of said files and spin on them. More tests to come tomorrow. -- Dmitry Olshansky
May 27 2013
On 05/27/2013 09:21 PM, Martin Nowak wrote:> See unittest/benchmark here: > https://gist.github.com/blackwhale/5653927 > Looks promising.This will not detect 0xFF as invalid UTF-8 sequence. For sequences with 5 or 6 bytes, that aren't used for unicode, it will return a stride of 4.
May 27 2013
28-May-2013 00:42, Martin Nowak пишет:On 05/27/2013 09:21 PM, Martin Nowak wrote:First of all there is a minor bug in std.utf in a sense that it accepts sequences of 5 and 6 bytes. They are simply explicitly not defined per Unicode standard and should throw invalid UTF as well. OK I just need to consider the next bit making the whole mask 4bits wide. Thus I need 16 slots in a register. 64bit version will fit just fine in a register 4*16 = 64. 32bit version will have to go with packing 2bits per slot and doing +1 afterwards. Here is an updated version that I'm testing again: https://github.com/blackwhale/gsoc-bench-2012/blob/master/fast_stride.d -- Dmitry Olshansky> See unittest/benchmark here: > https://gist.github.com/blackwhale/5653927 > Looks promising.This will not detect 0xFF as invalid UTF-8 sequence. For sequences with 5 or 6 bytes, that aren't used for unicode, it will return a stride of 4.
May 28 2013