www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Making regex replace CTFE by removing malloc

reply "Pierre Krafft" <kpierre+dlang outlook.com> writes:
We have CTFE regex wich creates an optimized regex engine at 
compile time. I would want to expand on that and make the 
generated machine CTFE as well. I think the only thing that is 
stopping this from working is the use of malloc as seen at 
https://github.com/D-Programming-Language/phobos/blob/cb044b02aa3abd0bddc5e48a91faebfd146cab3e/std/regex/package.d#L565
What can replace malloc that can run on compile time and won't 
make it slower at run time?

Here is an example of what I would want to work:

unittest {
     enum hello_world = camelCaseToUnderscore("helloWorld");
     assertEqual(hello_world, "hello_world");
}

private string camelCaseToUnderscore(string input){
     import std.regex;
     auto ctr = ctRegex!(`([a-z])([A-Z])`);
     auto replaceString = "$1_$2";
     return replaceAll(input, ctr, replaceString).toLower();
}
Apr 02 2015
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Thu, 02 Apr 2015 17:14:24 +0000, Pierre Krafft wrote:

 What can replace malloc that can run on compile time and won't make it
 slower at run time?
this is actually two questions, so i'll answer to two questions. 1. What can replace malloc that can run on compile time? new ubyte[](size) 2. ...and won't make it slower at run time? but we can still use malloc in runtime! `if (_ctfe)` allows us to select=20 the necessary code branch. p.s. i don't think that this is the only problem, though. but i never=20 read "std.regexp" source. it's bad, 'cause i want to make it work with=20 any range, not only with strings. this will allow to run regexp on=20 anything -- and open way to my rbtree-based document system.=
Apr 02 2015
next sibling parent "Pierre Krafft" <kpierre+dlang outlook.com> writes:
On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:
 On Thu, 02 Apr 2015 17:14:24 +0000, Pierre Krafft wrote:

 What can replace malloc that can run on compile time and won't 
 make it
 slower at run time?
this is actually two questions, so i'll answer to two questions. 1. What can replace malloc that can run on compile time? new ubyte[](size) 2. ...and won't make it slower at run time? but we can still use malloc in runtime! `if (_ctfe)` allows us to select the necessary code branch. p.s. i don't think that this is the only problem, though. but i never read "std.regexp" source. it's bad, 'cause i want to make it work with any range, not only with strings. this will allow to run regexp on anything -- and open way to my rbtree-based document system.
Thanks! I'll take a look and see if I can make it work.
Apr 03 2015
prev sibling next sibling parent reply "Pierre Krafft" <kpierre+dlang outlook.com> writes:
On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:
 On Thu, 02 Apr 2015 17:14:24 +0000, Pierre Krafft wrote:

 What can replace malloc that can run on compile time and won't 
 make it
 slower at run time?
this is actually two questions, so i'll answer to two questions. 1. What can replace malloc that can run on compile time? new ubyte[](size) 2. ...and won't make it slower at run time? but we can still use malloc in runtime! `if (_ctfe)` allows us to select the necessary code branch. p.s. i don't think that this is the only problem, though. but i never read "std.regexp" source. it's bad, 'cause i want to make it work with any range, not only with strings. this will allow to run regexp on anything -- and open way to my rbtree-based document system.
It seems like I have treaded into something which is outside my knowledge domain. The malloc is indeed one of the least problems with that code. The code makes use of completely unsafe code with pointer casts that are disallowed in CTFE code. If someone knows how to replace the pointer casts that would probably be everything needed. Otherwise I think it would need a rewrite to make it use safe D. The new code would probably be slower so for most people it would be a step back. A solution would be to have different code for CTFE and runtime but that seems unmaintainable and subpar.
Apr 03 2015
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Fri, 03 Apr 2015 15:04:38 +0000, Pierre Krafft wrote:

 It seems like I have treaded into something which is outside my
 knowledge domain. The malloc is indeed one of the least problems with
 that code. The code makes use of completely unsafe code with pointer
 casts that are disallowed in CTFE code. If someone knows how to replace
 the pointer casts that would probably be everything needed. Otherwise I
 think it would need a rewrite to make it use safe D. The new code would
 probably be slower so for most people it would be a step back. A
 solution would be to have different code for CTFE and runtime but that
 seems unmaintainable and subpar.
i believe that pointer casts are there for speed reasons. or maybe the=20 engine just didn't designed for CTFE in the first place. anyway, you can ask your questions here or in "D.learn". even if you will=20 not be able to convert the current engine to CTFE, you may be able to use=20 a subset of it as a base for separate CTFE regexp engine. i believe that=20 it is possible to write a moderately good Thompson-based code, which=20 covers alot of needs.=
Apr 05 2015
parent reply "Pierre Krafft" <kpierre+dlang outlook.com> writes:
I got some help from the IRC, poked the code a bit and have some 
details I can share. The code uses a reference counted memory 
pool, that's the part I linked. It's possible to avoid the malloc 
but that's not the hard part. For me it looks like memory pools 
aren't compatible with CTFE. This is because CTFE disallows most 
pointer casts.
Allocations and frees have fortunately been moved to common 
functions so it should be possible to change the allocation code 
without too much hassle. I would prefer to have the same code for 
runtime and CTFE, but for performance reasons that might not be 
possible if CTFE has a too limited feature set.
I think a good idea would be to extract the memory pool code out 
of the regex module. It's not a core part of the problem and 
could be reused elsewhere. Having something named memory pool 
would also make the code clearer. It might be impossible to 
implement memory pools at CTFE so  we could hide that detail away 
by simulating a memory pool (at CTFE) until it becomes possible.
The code in general is quite clear, but low level and more about 
how to do it than what to do. This is not the D-style I know and 
love, but it might be needed for performance. Therefore I 
probably won't be the one that starts the implementation of this, 
but I will try to help if someone more experienced in this kind 
of code takes action.
Apr 06 2015
next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
thank you.

and i read the code a little, and found that matching engine using stream-
like interface to work with data, so it wouldn't be very hard to use=20
ranges instead of strings. and for "real" regexps (those without=20
backtracking) range seems to doesn't even require random access.=
Apr 07 2015
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 07-Apr-2015 02:48, Pierre Krafft wrote:
 I got some help from the IRC, poked the code a bit and have some details
 I can share. The code uses a reference counted memory pool, that's the
 part I linked. It's possible to avoid the malloc but that's not the hard
 part. For me it looks like memory pools aren't compatible with CTFE.
 This is because CTFE disallows most pointer casts.
Just noticed this thread, I've been detached from D lately. Cool idea, but it wasn't considered in its design, so memchr/malloc and other nice optimized C primitives are all around the engine sources. Indeed there is a fair amount of custom memory allocation. One engine uses a freelist and the other (+ compile-time generated one) uses segmented stack. On top of that both engines use adhoc ref-counting for internal data structures. Initially I delayed memory allocation problem until we get memory allocators. Current memory allocation scheme is further obscured by trying hard to allocate just once and reusing/splitting the initial chunk.
 Allocations and frees have fortunately been moved to common functions so
 it should be possible to change the allocation code without too much
 hassle. I would prefer to have the same code for runtime and CTFE, but
 for performance reasons that might not be possible if CTFE has a too
 limited feature set.
I learned the hard way that CTFE-able code and high-performance code look very much unlike each other. Unless everything is new-ed and not deleted we can't have the same code paths. A fair amount of __ctfe is in order.
 I think a good idea would be to extract the memory pool code out of the
 regex module. It's not a core part of the problem and could be reused
 elsewhere. Having something named memory pool would also make the code
 clearer. It might be impossible to implement memory pools at CTFE so  we
 could hide that detail away by simulating a memory pool (at CTFE) until
 it becomes possible.
Would be nice to have I guess, std.regex though uses somewhat specialized intrusive free-list so it's too niche for generic solution.
 The code in general is quite clear, but low level and more about how to
 do it than what to do. This is not the D-style I know and love, but it
 might be needed for performance.
Yeah, sadly ranges and algorithms were underused. Reasons were multiple, for parser it had more to do with CTFE-ablity, for engines more of performance concerns.
 Therefore I probably won't be the one
 that starts the implementation of this, but I will try to help if
 someone more experienced in this kind of code takes action.
I might pull this off on occasion. Compared to compilation of regex itself it might even run in sensible amount of time. Anyhow I'm curious what use cases you have in mind for running regexes at CTFE. -- Dmitry Olshansky
Apr 10 2015
prev sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:
 p.s. i don't think that this is the only problem, though. but i 
 never
 read "std.regexp" source. it's bad, 'cause i want to make it 
 work with
 any range, not only with strings. this will allow to run regexp 
 on
 anything -- and open way to my rbtree-based document system.
I've wanted this too. I really hope we can figure this out in the future, it would be another completely unique feature of D's already excellent regex engine.
Apr 03 2015
parent ketmar <ketmar ketmar.no-ip.org> writes:
On Sat, 04 Apr 2015 01:50:00 +0000, Jakob Ovrum wrote:

 On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:
 p.s. i don't think that this is the only problem, though. but i never
 read "std.regexp" source. it's bad, 'cause i want to make it work with
 any range, not only with strings. this will allow to run regexp on
 anything -- and open way to my rbtree-based document system.
=20 I've wanted this too. I really hope we can figure this out in the future, it would be another completely unique feature of D's already excellent regex engine.
i hope to do some work on that in this month. but don't hold your breath,=20 i didn't even read the code yet, so i don't know how hard it will be.=
Apr 05 2015