www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - regex direct support for sse4 intrinsics

reply "Jay Norwood" <jayn prismnet.com> writes:
The sse4 capabilities include a range mode of string matching,
that lets you match characters in a 16 byte string based on a 16
byte set of start and stop character ranges.  See the
_SIDD_CMP_RANGES mode in the table.


For example, the pattern in some of our examples for finding the
start of a word is a-zA-Z, and for other characters in the word
a-zA-Z0-9.  Either of these patterns could be tested for match on
a 16 byte input in a single operation in the sse4 engine.

http://msdn.microsoft.com/en-us/library/bb531465.aspx

Looking at the msft intrinsics, it seems like the D ones could be
more efficient and elegant looking using D slices, since they are
passing the string and length of string as separate parameters.

It would be good if the D regex processing could detect simple
range match patterns and use the sse4 extensions when available.

There is also an article from intel where they demo use of these
instructions for xml parsing.

http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/
Mar 26 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 26.03.2012 22:14, Jay Norwood wrote:
 The sse4 capabilities include a range mode of string matching,
 that lets you match characters in a 16 byte string based on a 16
 byte set of start and stop character ranges. See the
 _SIDD_CMP_RANGES mode in the table.


 For example, the pattern in some of our examples for finding the
 start of a word is a-zA-Z, and for other characters in the word
 a-zA-Z0-9. Either of these patterns could be tested for match on
 a 16 byte input in a single operation in the sse4 engine.

 http://msdn.microsoft.com/en-us/library/bb531465.aspx
Nice idea. Though I can assure you that the biggest problem is not in comparing things at all. The most time is spent accessing various lookup tables, saving/loading state, updating match location and so on. Especially so because of Unicode. Also add an overhead of UTF decoding into the mix (that I plan to avoid). To put it bluntly: R-T version spends around 20% of time doing actual matching at best (averaged, depends on pattern), half of this could be avoided by optimizing "framework" code. C-T version spends around 35% on on matching, but still the main inefficiency in the "framework" part of code. C-T matcher admittedly has very simple framework. Speaking more of run-time version of regex, it is essentially running a VM that executes instructions that do various kinds of match-this, match-that. The VM dispatch code is quite slow, the optimal _threaded_ code requires either doing it in _assembly_ or _computed_ goto in the language. The VM _dispatch_ takes up to 30% of time in the default matcher.
 Looking at the msft intrinsics, it seems like the D ones could be
 more efficient and elegant looking using D slices, since they are
 passing the string and length of string as separate parameters.

 It would be good if the D regex processing could detect simple
 range match patterns and use the sse4 extensions when available.

 There is also an article from intel where they demo use of these
 instructions for xml parsing.

 http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/
Worth looking into. -- Dmitry Olshansky
Mar 26 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03/26/2012 09:10 PM, Dmitry Olshansky wrote:
 Speaking more of run-time version of regex, it is essentially running a
 VM that executes instructions that do various kinds of match-this,
 match-that. The VM dispatch code is quite slow, the optimal _threaded_
 code requires either doing it in _assembly_ or _computed_ goto in the
 language.
Language-enforced tail call optimization would probably work too.
Mar 26 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 27.03.2012 1:07, Timon Gehr wrote:
 On 03/26/2012 09:10 PM, Dmitry Olshansky wrote:
 Speaking more of run-time version of regex, it is essentially running a
 VM that executes instructions that do various kinds of match-this,
 match-that. The VM dispatch code is quite slow, the optimal _threaded_
 code requires either doing it in _assembly_ or _computed_ goto in the
 language.
Language-enforced tail call optimization would probably work too.
Indeed it would. I recall being close to that observation, when I pictured a feature like guaranteed non-return function call, a-la: void foo() { ... goto fn(); //can safely overwrite foo's stack frame //never gets here } kind of unix exec for functions. -- Dmitry Olshansky
Mar 27 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 27.03.2012 16:24, Dmitry Olshansky wrote:
 On 27.03.2012 1:07, Timon Gehr wrote:
 On 03/26/2012 09:10 PM, Dmitry Olshansky wrote:
 Speaking more of run-time version of regex, it is essentially running a
 VM that executes instructions that do various kinds of match-this,
 match-that. The VM dispatch code is quite slow, the optimal _threaded_
 code requires either doing it in _assembly_ or _computed_ goto in the
 language.
Language-enforced tail call optimization would probably work too.
Indeed it would. I recall being close to that observation, when I pictured a feature like guaranteed non-return function call, a-la: void foo() { ... goto fn(); //can safely overwrite foo's stack frame //never gets here } kind of unix exec for functions.
Come to think of it, I like my small feature more then full computed gotos. Because if properly implemented it doesn't violate memory safety. Namely add to the language new statement: goto Call-expr; To designate a switch-to call, or forced tail call. That must be semantically equivalent to: 1. Save arguments on stack. 2. Move them over current stackframe, overwriting it in the process. 3. Use jump instead of call, step 2 should place params as per ABI for function call. Steps 1-2 should be combined into one and optimized as necessary. Rationale: 1. It allows implementing threaded VM interpreter in safe D code, that's quite a feat. (VM is the main use case for computed gotos btw, but they are not safe) 2. More then that it provides the ability for programmer to make assumptions about code that relies on tail-call optimization. Previously such code would be either very fragile or only usable in release mode. 3. Avoids problematics of providing pointer type for labels. I can imagine it may be useful for parallel work stealing framework, and threading in general. Problem: C-style function call convention, namely the caller responsible for restoring the stack. Ways to solve: a) require callee cleanup convention for functions that are goto'ed. From what I see on D ABI page it works for extern(D) on win32 only(?):
 The callee cleans the stack.
b) provide some kind of hack to restore original stack pointer on eventual return, so that caller correctly cleans it up. -- Dmitry Olshansky
Mar 27 2012
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Dmitry Olshansky:

 Speaking more of run-time version of regex, it is essentially running a 
 VM that executes instructions that do various kinds of match-this, 
 match-that. The VM dispatch code is quite slow, the optimal _threaded_ 
 code requires either doing it in _assembly_ or _computed_ goto in the 
 language. The VM _dispatch_ takes up to 30% of time in the default matcher.
I have used computed gotos in GCC-C to implement some quite efficient finite state machines to be used in computational biology. I've seen 20%+ speedups compared to my alternative switch-based implementation. So I'd like computed gotos in D too. Both GCC and LLVM back-ends support computed gotos (despite the asm produced by LLVM on them is not as good as GCC one). If people feel the desire to add compiler-specific computed gotos to D, they will risk adding them with a different syntax on each present and future compiler. Even if dmd doesn't support computed gotos, defining a standard syntax (like it was done for years for vector operations) is a way to avoid that balkanization, and it will help LDC and GDC developers add this feature to their compilers. Bye, bearophile
Mar 27 2012
next sibling parent reply "Martin Nowak" <dawg dawgfoto.de> writes:
 Both GCC and LLVM back-ends support computed gotos (despite the asm  
 produced by LLVM on them is not as good as GCC one). If people feel the  
 desire to add compiler-specific computed gotos to D, they will risk  
 adding them with a different syntax on each present and future compiler.
What did you had in mind? The following would only require a minor syntax change. auto codeaddr = &Label; goto codeaddr + 0x10; GotoStatement: goto Identifier ; goto Expression ; // NEW goto default ; goto case ; goto case Expression ;
Mar 27 2012
next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
 GotoStatement:
http://dlang.org/statement.html#GotoStatement
Mar 27 2012
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Martin Nowak:

 What did you had in mind?
 
 The following would only require a minor syntax change.
 
 auto codeaddr = &Label;
 goto codeaddr + 0x10;
Something like this: Label1: //... Label2: //... Label3: //... enum void*[3] targs = [&Label1, &Label2, &Label3]; int i = 2; // run-time value goto targs[i]; See: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html (I'd like to know why GCC uses &&Label instead of &Label). Bye, bearophile
Mar 27 2012
next sibling parent kraybourne <stdin kraybourne.com> writes:
On 3/27/12 13:37 , bearophile wrote:

 Something like this:

 Label1:
 //...
 Label2:
 //...
 Label3:
 //...

 enum void*[3] targs  = [&Label1,&Label2,&Label3];
 int i = 2; // run-time value
 goto targs[i];
+1, this would be sweet
Mar 27 2012
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
 enum void*[3] targs  = [&Label1, &Label2, &Label3];
 int i = 2; // run-time value
 goto targs[i];
Yeah, that would work with the extension.
Mar 27 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03/27/2012 01:37 PM, bearophile wrote:
 See:
 http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html
 (I'd like to know why GCC uses&&Label instead of&Label).
Because labels reside in their own namespace.
Mar 27 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Timon Gehr:

 On 03/27/2012 01:37 PM, bearophile wrote:
 See:
 http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html
 (I'd like to know why GCC uses&&Label instead of&Label).
Because labels reside in their own namespace.
I see. That's probably true in D too. But if possible I think the syntax with a single & will be enough/better in D. Bye, bearophile
Mar 27 2012
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Martin Nowak:

 The following would only require a minor syntax change.

 auto codeaddr = &Label;
 goto codeaddr + 0x10;

 GotoStatement:
     goto Identifier ;
     goto Expression ; // NEW
     goto default ;
     goto case ;
     goto case Expression ;
You also need to support &Label, that currently is "Error: undefined identifier Label". Bye, bearophile
Mar 27 2012
parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 27 Mar 2012 17:01:51 +0200, bearophile <
 You also need to support &Label, that currently is "Error:
 undefined identifier Label".

 Bye,
 bearophile
That doesn't require a syntax addition. One would only need to lookup the label and detect naming collisions with other declarations.
Mar 27 2012
prev sibling parent reply "Tove" <tove fransson.se> writes:
On Tuesday, 27 March 2012 at 09:51:07 UTC, bearophile wrote:
 Dmitry Olshansky:

 Speaking more of run-time version of regex, it is essentially 
 running a VM that executes instructions that do various kinds 
 of match-this, match-that. The VM dispatch code is quite slow, 
 the optimal _threaded_ code requires either doing it in 
 _assembly_ or _computed_ goto in the language. The VM 
 _dispatch_ takes up to 30% of time in the default matcher.
I have used computed gotos in GCC-C to implement some quite efficient finite state machines to be used in computational biology. I've seen 20%+ speedups compared to my alternative switch-based implementation. So I'd like computed gotos in D too.
While I am in favor of all enhancements which improve low-level access, I'm very surprised by your findings by computed gotos... the compiler I am most used to(rvct for arm)... seems proficient in emitting jump table instructions(TBB, TBH) for thumb2... but based on your findings I will definitely re-check the generated asm. Could it be that the compiler "heuristics" simply is less than optimal... and an alternative would be to force a specific implementation with a pragma? or the recent annotation syntax... pragma(switch, "jumptable") pragma(switch, "binary-search-tree") it would have the benefit of not having to re-factor the code and one could easily benchmark which solution is the fastest for a different inputs...
Mar 27 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Tove:

 I'm very surprised by your findings by computed gotos... the 
 compiler I am most used to(rvct for arm)... seems proficient in 
 emitting jump table instructions(TBB, TBH) for thumb2... but 
 based on your findings I will definitely re-check the generated 
 asm.
You have also to take in account the differences between your code (and its usage) and my code that you haven't seen. I have used a state machines driven by char arrays in some complex ways. The state of the machine was encoded by the CPU program counter itself, with jumps around using normal gotos inside switches or computed gotos. If you want to see a minimal and self-contained but common use case for computed gotos, take a look at the solutions for the ICFP 2006 Programming Contest: http://schani.wordpress.com/2006/07/24/the-icfp-programming-contest-2006/ That links to: http://www.complang.tuwien.ac.at/schani/blog/icfp2006/icfp2006.tar.gz This was one of the fastest problem solutions (with D compile-time code generation skills plus a higher level encoding of the whole machine, such C++ program probably becomes quite shorter): http://codepad.org/5rTbNaQ1 The implementation of small interpreters, small virtual machines and finite state machines are common enough usages for a system language. Bye, bearophile
Mar 27 2012