digitalmars.D - regex direct support for sse4 intrinsics
- Jay Norwood (17/17) Mar 26 2012 The sse4 capabilities include a range mode of string matching,
- Dmitry Olshansky (21/38) Mar 26 2012 Nice idea.
- Timon Gehr (2/7) Mar 26 2012 Language-enforced tail call optimization would probably work too.
- Dmitry Olshansky (13/21) Mar 27 2012 Indeed it would.
- Dmitry Olshansky (31/52) Mar 27 2012 Come to think of it, I like my small feature more then full computed
- bearophile (5/10) Mar 27 2012 I have used computed gotos in GCC-C to implement some quite efficient fi...
- Martin Nowak (10/14) Mar 27 2012 What did you had in mind?
- Martin Nowak (1/2) Mar 27 2012 http://dlang.org/statement.html#GotoStatement
- bearophile (17/23) Mar 27 2012 Something like this:
- kraybourne (2/12) Mar 27 2012 +1, this would be sweet
- Martin Nowak (1/4) Mar 27 2012 Yeah, that would work with the extension.
- Timon Gehr (2/5) Mar 27 2012 Because labels reside in their own namespace.
- bearophile (6/12) Mar 27 2012 I see. That's probably true in D too.
- bearophile (5/14) Mar 27 2012 You also need to support &Label, that currently is "Error:
- Martin Nowak (4/8) Mar 27 2012 That doesn't require a syntax addition.
- Tove (14/26) Mar 27 2012 While I am in favor of all enhancements which improve low-level
- bearophile (23/28) Mar 27 2012 You have also to take in account the differences between your
The sse4 capabilities include a range mode of string matching, that lets you match characters in a 16 byte string based on a 16 byte set of start and stop character ranges. See the _SIDD_CMP_RANGES mode in the table. For example, the pattern in some of our examples for finding the start of a word is a-zA-Z, and for other characters in the word a-zA-Z0-9. Either of these patterns could be tested for match on a 16 byte input in a single operation in the sse4 engine. http://msdn.microsoft.com/en-us/library/bb531465.aspx Looking at the msft intrinsics, it seems like the D ones could be more efficient and elegant looking using D slices, since they are passing the string and length of string as separate parameters. It would be good if the D regex processing could detect simple range match patterns and use the sse4 extensions when available. There is also an article from intel where they demo use of these instructions for xml parsing. http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/
Mar 26 2012
On 26.03.2012 22:14, Jay Norwood wrote:The sse4 capabilities include a range mode of string matching, that lets you match characters in a 16 byte string based on a 16 byte set of start and stop character ranges. See the _SIDD_CMP_RANGES mode in the table. For example, the pattern in some of our examples for finding the start of a word is a-zA-Z, and for other characters in the word a-zA-Z0-9. Either of these patterns could be tested for match on a 16 byte input in a single operation in the sse4 engine. http://msdn.microsoft.com/en-us/library/bb531465.aspxNice idea. Though I can assure you that the biggest problem is not in comparing things at all. The most time is spent accessing various lookup tables, saving/loading state, updating match location and so on. Especially so because of Unicode. Also add an overhead of UTF decoding into the mix (that I plan to avoid). To put it bluntly: R-T version spends around 20% of time doing actual matching at best (averaged, depends on pattern), half of this could be avoided by optimizing "framework" code. C-T version spends around 35% on on matching, but still the main inefficiency in the "framework" part of code. C-T matcher admittedly has very simple framework. Speaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language. The VM _dispatch_ takes up to 30% of time in the default matcher.Looking at the msft intrinsics, it seems like the D ones could be more efficient and elegant looking using D slices, since they are passing the string and length of string as separate parameters. It would be good if the D regex processing could detect simple range match patterns and use the sse4 extensions when available. There is also an article from intel where they demo use of these instructions for xml parsing. http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/Worth looking into. -- Dmitry Olshansky
Mar 26 2012
On 03/26/2012 09:10 PM, Dmitry Olshansky wrote:Speaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language.Language-enforced tail call optimization would probably work too.
Mar 26 2012
On 27.03.2012 1:07, Timon Gehr wrote:On 03/26/2012 09:10 PM, Dmitry Olshansky wrote:Indeed it would. I recall being close to that observation, when I pictured a feature like guaranteed non-return function call, a-la: void foo() { ... goto fn(); //can safely overwrite foo's stack frame //never gets here } kind of unix exec for functions. -- Dmitry OlshanskySpeaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language.Language-enforced tail call optimization would probably work too.
Mar 27 2012
On 27.03.2012 16:24, Dmitry Olshansky wrote:On 27.03.2012 1:07, Timon Gehr wrote:Come to think of it, I like my small feature more then full computed gotos. Because if properly implemented it doesn't violate memory safety. Namely add to the language new statement: goto Call-expr; To designate a switch-to call, or forced tail call. That must be semantically equivalent to: 1. Save arguments on stack. 2. Move them over current stackframe, overwriting it in the process. 3. Use jump instead of call, step 2 should place params as per ABI for function call. Steps 1-2 should be combined into one and optimized as necessary. Rationale: 1. It allows implementing threaded VM interpreter in safe D code, that's quite a feat. (VM is the main use case for computed gotos btw, but they are not safe) 2. More then that it provides the ability for programmer to make assumptions about code that relies on tail-call optimization. Previously such code would be either very fragile or only usable in release mode. 3. Avoids problematics of providing pointer type for labels. I can imagine it may be useful for parallel work stealing framework, and threading in general. Problem: C-style function call convention, namely the caller responsible for restoring the stack. Ways to solve: a) require callee cleanup convention for functions that are goto'ed. From what I see on D ABI page it works for extern(D) on win32 only(?):On 03/26/2012 09:10 PM, Dmitry Olshansky wrote:Indeed it would. I recall being close to that observation, when I pictured a feature like guaranteed non-return function call, a-la: void foo() { ... goto fn(); //can safely overwrite foo's stack frame //never gets here } kind of unix exec for functions.Speaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language.Language-enforced tail call optimization would probably work too.The callee cleans the stack.b) provide some kind of hack to restore original stack pointer on eventual return, so that caller correctly cleans it up. -- Dmitry Olshansky
Mar 27 2012
Dmitry Olshansky:Speaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language. The VM _dispatch_ takes up to 30% of time in the default matcher.I have used computed gotos in GCC-C to implement some quite efficient finite state machines to be used in computational biology. I've seen 20%+ speedups compared to my alternative switch-based implementation. So I'd like computed gotos in D too. Both GCC and LLVM back-ends support computed gotos (despite the asm produced by LLVM on them is not as good as GCC one). If people feel the desire to add compiler-specific computed gotos to D, they will risk adding them with a different syntax on each present and future compiler. Even if dmd doesn't support computed gotos, defining a standard syntax (like it was done for years for vector operations) is a way to avoid that balkanization, and it will help LDC and GDC developers add this feature to their compilers. Bye, bearophile
Mar 27 2012
Both GCC and LLVM back-ends support computed gotos (despite the asm produced by LLVM on them is not as good as GCC one). If people feel the desire to add compiler-specific computed gotos to D, they will risk adding them with a different syntax on each present and future compiler.What did you had in mind? The following would only require a minor syntax change. auto codeaddr = &Label; goto codeaddr + 0x10; GotoStatement: goto Identifier ; goto Expression ; // NEW goto default ; goto case ; goto case Expression ;
Mar 27 2012
GotoStatement:http://dlang.org/statement.html#GotoStatement
Mar 27 2012
Martin Nowak:What did you had in mind? The following would only require a minor syntax change. auto codeaddr = &Label; goto codeaddr + 0x10;Something like this: Label1: //... Label2: //... Label3: //... enum void*[3] targs = [&Label1, &Label2, &Label3]; int i = 2; // run-time value goto targs[i]; See: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html (I'd like to know why GCC uses &&Label instead of &Label). Bye, bearophile
Mar 27 2012
On 3/27/12 13:37 , bearophile wrote:Something like this: Label1: //... Label2: //... Label3: //... enum void*[3] targs = [&Label1,&Label2,&Label3]; int i = 2; // run-time value goto targs[i];+1, this would be sweet
Mar 27 2012
enum void*[3] targs = [&Label1, &Label2, &Label3]; int i = 2; // run-time value goto targs[i];Yeah, that would work with the extension.
Mar 27 2012
On 03/27/2012 01:37 PM, bearophile wrote:See: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html (I'd like to know why GCC uses&&Label instead of&Label).Because labels reside in their own namespace.
Mar 27 2012
Timon Gehr:On 03/27/2012 01:37 PM, bearophile wrote:I see. That's probably true in D too. But if possible I think the syntax with a single & will be enough/better in D. Bye, bearophileSee: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html (I'd like to know why GCC uses&&Label instead of&Label).Because labels reside in their own namespace.
Mar 27 2012
Martin Nowak:The following would only require a minor syntax change. auto codeaddr = &Label; goto codeaddr + 0x10; GotoStatement: goto Identifier ; goto Expression ; // NEW goto default ; goto case ; goto case Expression ;You also need to support &Label, that currently is "Error: undefined identifier Label". Bye, bearophile
Mar 27 2012
On Tue, 27 Mar 2012 17:01:51 +0200, bearophile <You also need to support &Label, that currently is "Error: undefined identifier Label". Bye, bearophileThat doesn't require a syntax addition. One would only need to lookup the label and detect naming collisions with other declarations.
Mar 27 2012
On Tuesday, 27 March 2012 at 09:51:07 UTC, bearophile wrote:Dmitry Olshansky:While I am in favor of all enhancements which improve low-level access, I'm very surprised by your findings by computed gotos... the compiler I am most used to(rvct for arm)... seems proficient in emitting jump table instructions(TBB, TBH) for thumb2... but based on your findings I will definitely re-check the generated asm. Could it be that the compiler "heuristics" simply is less than optimal... and an alternative would be to force a specific implementation with a pragma? or the recent annotation syntax... pragma(switch, "jumptable") pragma(switch, "binary-search-tree") it would have the benefit of not having to re-factor the code and one could easily benchmark which solution is the fastest for a different inputs...Speaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language. The VM _dispatch_ takes up to 30% of time in the default matcher.I have used computed gotos in GCC-C to implement some quite efficient finite state machines to be used in computational biology. I've seen 20%+ speedups compared to my alternative switch-based implementation. So I'd like computed gotos in D too.
Mar 27 2012
Tove:I'm very surprised by your findings by computed gotos... the compiler I am most used to(rvct for arm)... seems proficient in emitting jump table instructions(TBB, TBH) for thumb2... but based on your findings I will definitely re-check the generated asm.You have also to take in account the differences between your code (and its usage) and my code that you haven't seen. I have used a state machines driven by char arrays in some complex ways. The state of the machine was encoded by the CPU program counter itself, with jumps around using normal gotos inside switches or computed gotos. If you want to see a minimal and self-contained but common use case for computed gotos, take a look at the solutions for the ICFP 2006 Programming Contest: http://schani.wordpress.com/2006/07/24/the-icfp-programming-contest-2006/ That links to: http://www.complang.tuwien.ac.at/schani/blog/icfp2006/icfp2006.tar.gz This was one of the fastest problem solutions (with D compile-time code generation skills plus a higher level encoding of the whole machine, such C++ program probably becomes quite shorter): http://codepad.org/5rTbNaQ1 The implementation of small interpreters, small virtual machines and finite state machines are common enough usages for a system language. Bye, bearophile
Mar 27 2012