www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - try find the fastest way to convert a group string into index?

reply learnfirst1 <learnfirst1 gmail.com> writes:
The use case is translate http header key into enum.

this is the current code : https://run.dlang.io/is/lpw29w

In this fake code I only list a few headers,  it should be more.  
but less then 128 and only include the a few codes.

how to make this more fast like with one loop and get the results.

I am think of use PCRE compile for this, but not sure how.

It should also save the non default headers in to a new array, so 
when search non default headers will reduce the total size.

I am plan to write a more fast http server with D betterC compare 
to vibed.
Sep 16 2018
next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Sunday, 16 September 2018 at 10:04:09 UTC, learnfirst1 wrote:
 how to make this more fast like with one loop and get the 
 results.
This is a more general problem than any specific programming language; you may want to look into perfect hashing: https://en.wikipedia.org/wiki/Perfect_hash_function There exists software to generate perfect hash functions, e.g. GNU gperf: https://www.gnu.org/software/gperf/ It looks like Basile wrote one for D: https://github.com/BBasile/IsItThere If you don't want to go that far, you can generate a string switch using string mixins: void main() { static immutable string[] keywords = ["foo", "bar", "baz"]; enum Keyword { foo, bar, baz } Keyword parseKeyword(string s) { switch (s) { mixin({ string code; foreach (keyword; keywords) code ~= `case "` ~ keyword ~ `": return Keyword.` ~ keyword ~ `;`; return code; }()); default: throw new Exception("No such keyword!"); } } assert(parseKeyword("bar") == Keyword.bar); }
Sep 16 2018
parent reply learnfirst1 <learnfirst1 gmail.com> writes:
On Sunday, 16 September 2018 at 10:14:24 UTC, Vladimir Panteleev 
wrote:
 On Sunday, 16 September 2018 at 10:04:09 UTC, learnfirst1 wrote:
 how to make this more fast like with one loop and get the 
 results.
thanks for reply, minimal perfect hashing seems good for this task, I will do more test for confirm. I try to storage the value into a static array, so when diff middleware try access the key it will reused the array offset without need to run hash on key string. (all middleware can access the key by static enum name). with Perfect Hash Function, I can translate the hash number into static enum and use them. The code also need provide string get(string key) method, in this case the header key may not on the static preset arrays, in this case it should search a smaller array (save on the parse process). with this solution I has to run hash for each of request headers , with less than 127 items (maybe around 16 at max) I am not sure it is fast then direct string compare. for example: foreach(int header_index, ref header; new_request.headers) { int hash = mpfh(header.key); auto enum_index = static_eum[ hash ]; new_request.header[enum_index] = header_index ; } the simple static switch is not the best way as I know here. with compiled PCRE match it will "Context-Type", "Context-Encoding" in once ( and maybe there is some SSE4 to pass 16 byte for each loop).
 Diederik de Groot
please ignore the Header.data assign part. which just for the demo. in the real app it will only parse once and storage in pre -alloc request/response object.
Sep 16 2018
parent reply learnfirst1 <learnfirst1 gmail.com> writes:
On Sunday, 16 September 2018 at 11:30:11 UTC, learnfirst1 wrote:
 On Sunday, 16 September 2018 at 10:14:24 UTC, Vladimir 
 Panteleev wrote:
It has to be case Case Insensitive, so before I run mpfh for each new request headers, I need to convert the keys into low or up case. this value can be storage on a stack local tmp cache. and I has to run it on the get(string key) method again.
Sep 16 2018
parent learnfirst1 <learnfirst1 gmail.com> writes:
On Sunday, 16 September 2018 at 11:51:24 UTC, learnfirst1 wrote:
 On Sunday, 16 September 2018 at 11:30:11 UTC, learnfirst1 wrote:
 On Sunday, 16 September 2018 at 10:14:24 UTC, Vladimir 
 Panteleev wrote:
I will test pcre solution vs mpfc for benchmark. the pcre is easy to deal with low/up case. The PCRE: /^(?:Accept(*:0)|Accept\-Charset(*:1)|Accept\-Encoding(*:2)|Accept\-Language(*:3)|Access\-Control\-Allow\-Credentials(*:4)|Access\-Control\-Allow\-Origin(*:5)|Access\-Control\-Allow\-Methods(*:6)|Access\-Control\-Allow\-Headers(*:7)|Access\-Control\-Max\-Age(*:8)|Access\-Control\-Expose\-Headers(*:9)|Access\-Control\-Request\-Method(*:10)|Access\-Control\-Request\-Headers(*:11)|Age(*:12)|Allow(*:13)|Authorization(*:14)|Cache\-Control(*:15)|Connection(*:16)|Content\-Encoding(*:17)|Content\-Language(*:18)|Content\-Length(*:19)|Content\-Location(*:20)|Content\-Range(*:21)|Content\-Security\-Policy(*:22)|Content\-Type(*:23)|Cookie(*:24)|Date(*:25)|ETag(*:26)|Expect(*:27)|Expires(*:28)|Host(*:29)|If\-Match(*:30)|If\-Modified\-Since(*:31)|If\-None\-Match(*:32)|If\-Range(*:33)|If\-Unmodified\-Since(*:34)|Last\-Modified(*:35)|Location(*:36)|Origin(*:37)|Pragma(*:38)|Proxy\-Authenticate(*:39)|Proxy\-Authorization(*:40)|Range(*:41)|Referer(*:42)|Retry\-After(*:43)|Sec\-Websocket\-Extensions(*:44)|Sec\-Websocket\-Key(*:45)|Sec\-Websocket\-Origin(*:46)|Sec\-Websocket\-Protocol(*:47)|Sec\-Websocket\-Version(*:48)|Server(*:49)|Set\-Cookie(*:50)|Strict\-Transport\-Security(*:51)|Transfer\-Encoding(*:52)|Upgrade(*:53)|User\-Agent(*:54)|Vary(*:55)|Via(*:56)|WWW\-Authenticate(*:57))$/i
Sep 16 2018
prev sibling parent Diederik de Groot <ddegroot talon.nl> writes:
Homework ?

I think it would also be better to only walk/match the 
Header.data and header_keys only once (using the generated switch 
statement that Vladimir suggested). So your Headers struct should 
have a parse() function, which will match all of the headers to 
the keys and store them locally in a 'matches[Type] = string' 
array (aka map). Then the get() function can simply  return 
matches[type] without any work.

Other improvements/Bonus Points:
- Because the parse() function is only called/initialize once, 
you could make the matches array immutable/const if.
- Your Type enum and header_keys contain the same data, and if 
you are going to use Vlafimir's suggestion of using a mixin, you 
can stringify the enum names an use them instead. This way you 
don't have to keep anything  in sync.
- If these Headers are coming from a client over the internet, it 
might be wise to set and check some (arbitrary) bounds for 
header-length and content. You are currently blindly copying it 
and expecting it is 'clean/correct'.
- Why copy the data to Header.data, instead of just providing it 
by ref to the get()/parse() function. You are copying it to then 
later indexing into Header.data using the map(). Why not 
copy/move piecemeal and assign them to an indexed array after 
inspecting/matching them. That  could make for cleaner code and 
nicer API. (With two caveats: we could loose the original headers 
if we don't save them somewhere else and unmatched/unknown Header 
Types will not be copied).
Sep 16 2018