digitalmars.D - Making regex replace CTFE by removing malloc
- Pierre Krafft (18/18) Apr 02 2015 We have CTFE regex wich creates an optimized regex engine at
- ketmar (11/13) Apr 02 2015 this is actually two questions, so i'll answer to two questions.
- Pierre Krafft (3/21) Apr 03 2015 Thanks!
- Pierre Krafft (11/29) Apr 03 2015 It seems like I have treaded into something which is outside my
- ketmar (8/17) Apr 05 2015 i believe that pointer casts are there for speed reasons. or maybe the=2...
- Pierre Krafft (23/23) Apr 06 2015 I got some help from the IRC, poked the code a bit and have some
- ketmar (5/5) Apr 07 2015 thank you.
- Dmitry Olshansky (27/49) Apr 10 2015 Just noticed this thread, I've been detached from D lately. Cool idea,
- Jakob Ovrum (4/11) Apr 03 2015 I've wanted this too. I really hope we can figure this out in the
- ketmar (3/12) Apr 05 2015 i hope to do some work on that in this month. but don't hold your breath...
We have CTFE regex wich creates an optimized regex engine at compile time. I would want to expand on that and make the generated machine CTFE as well. I think the only thing that is stopping this from working is the use of malloc as seen at https://github.com/D-Programming-Language/phobos/blob/cb044b02aa3abd0bddc5e48a91faebfd146cab3e/std/regex/package.d#L565 What can replace malloc that can run on compile time and won't make it slower at run time? Here is an example of what I would want to work: unittest { enum hello_world = camelCaseToUnderscore("helloWorld"); assertEqual(hello_world, "hello_world"); } private string camelCaseToUnderscore(string input){ import std.regex; auto ctr = ctRegex!(`([a-z])([A-Z])`); auto replaceString = "$1_$2"; return replaceAll(input, ctr, replaceString).toLower(); }
Apr 02 2015
On Thu, 02 Apr 2015 17:14:24 +0000, Pierre Krafft wrote:What can replace malloc that can run on compile time and won't make it slower at run time?this is actually two questions, so i'll answer to two questions. 1. What can replace malloc that can run on compile time? new ubyte[](size) 2. ...and won't make it slower at run time? but we can still use malloc in runtime! `if (_ctfe)` allows us to select=20 the necessary code branch. p.s. i don't think that this is the only problem, though. but i never=20 read "std.regexp" source. it's bad, 'cause i want to make it work with=20 any range, not only with strings. this will allow to run regexp on=20 anything -- and open way to my rbtree-based document system.=
Apr 02 2015
On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:On Thu, 02 Apr 2015 17:14:24 +0000, Pierre Krafft wrote:Thanks! I'll take a look and see if I can make it work.What can replace malloc that can run on compile time and won't make it slower at run time?this is actually two questions, so i'll answer to two questions. 1. What can replace malloc that can run on compile time? new ubyte[](size) 2. ...and won't make it slower at run time? but we can still use malloc in runtime! `if (_ctfe)` allows us to select the necessary code branch. p.s. i don't think that this is the only problem, though. but i never read "std.regexp" source. it's bad, 'cause i want to make it work with any range, not only with strings. this will allow to run regexp on anything -- and open way to my rbtree-based document system.
Apr 03 2015
On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:On Thu, 02 Apr 2015 17:14:24 +0000, Pierre Krafft wrote:It seems like I have treaded into something which is outside my knowledge domain. The malloc is indeed one of the least problems with that code. The code makes use of completely unsafe code with pointer casts that are disallowed in CTFE code. If someone knows how to replace the pointer casts that would probably be everything needed. Otherwise I think it would need a rewrite to make it use safe D. The new code would probably be slower so for most people it would be a step back. A solution would be to have different code for CTFE and runtime but that seems unmaintainable and subpar.What can replace malloc that can run on compile time and won't make it slower at run time?this is actually two questions, so i'll answer to two questions. 1. What can replace malloc that can run on compile time? new ubyte[](size) 2. ...and won't make it slower at run time? but we can still use malloc in runtime! `if (_ctfe)` allows us to select the necessary code branch. p.s. i don't think that this is the only problem, though. but i never read "std.regexp" source. it's bad, 'cause i want to make it work with any range, not only with strings. this will allow to run regexp on anything -- and open way to my rbtree-based document system.
Apr 03 2015
On Fri, 03 Apr 2015 15:04:38 +0000, Pierre Krafft wrote:It seems like I have treaded into something which is outside my knowledge domain. The malloc is indeed one of the least problems with that code. The code makes use of completely unsafe code with pointer casts that are disallowed in CTFE code. If someone knows how to replace the pointer casts that would probably be everything needed. Otherwise I think it would need a rewrite to make it use safe D. The new code would probably be slower so for most people it would be a step back. A solution would be to have different code for CTFE and runtime but that seems unmaintainable and subpar.i believe that pointer casts are there for speed reasons. or maybe the=20 engine just didn't designed for CTFE in the first place. anyway, you can ask your questions here or in "D.learn". even if you will=20 not be able to convert the current engine to CTFE, you may be able to use=20 a subset of it as a base for separate CTFE regexp engine. i believe that=20 it is possible to write a moderately good Thompson-based code, which=20 covers alot of needs.=
Apr 05 2015
I got some help from the IRC, poked the code a bit and have some details I can share. The code uses a reference counted memory pool, that's the part I linked. It's possible to avoid the malloc but that's not the hard part. For me it looks like memory pools aren't compatible with CTFE. This is because CTFE disallows most pointer casts. Allocations and frees have fortunately been moved to common functions so it should be possible to change the allocation code without too much hassle. I would prefer to have the same code for runtime and CTFE, but for performance reasons that might not be possible if CTFE has a too limited feature set. I think a good idea would be to extract the memory pool code out of the regex module. It's not a core part of the problem and could be reused elsewhere. Having something named memory pool would also make the code clearer. It might be impossible to implement memory pools at CTFE so we could hide that detail away by simulating a memory pool (at CTFE) until it becomes possible. The code in general is quite clear, but low level and more about how to do it than what to do. This is not the D-style I know and love, but it might be needed for performance. Therefore I probably won't be the one that starts the implementation of this, but I will try to help if someone more experienced in this kind of code takes action.
Apr 06 2015
thank you. and i read the code a little, and found that matching engine using stream- like interface to work with data, so it wouldn't be very hard to use=20 ranges instead of strings. and for "real" regexps (those without=20 backtracking) range seems to doesn't even require random access.=
Apr 07 2015
On 07-Apr-2015 02:48, Pierre Krafft wrote:I got some help from the IRC, poked the code a bit and have some details I can share. The code uses a reference counted memory pool, that's the part I linked. It's possible to avoid the malloc but that's not the hard part. For me it looks like memory pools aren't compatible with CTFE. This is because CTFE disallows most pointer casts.Just noticed this thread, I've been detached from D lately. Cool idea, but it wasn't considered in its design, so memchr/malloc and other nice optimized C primitives are all around the engine sources. Indeed there is a fair amount of custom memory allocation. One engine uses a freelist and the other (+ compile-time generated one) uses segmented stack. On top of that both engines use adhoc ref-counting for internal data structures. Initially I delayed memory allocation problem until we get memory allocators. Current memory allocation scheme is further obscured by trying hard to allocate just once and reusing/splitting the initial chunk.Allocations and frees have fortunately been moved to common functions so it should be possible to change the allocation code without too much hassle. I would prefer to have the same code for runtime and CTFE, but for performance reasons that might not be possible if CTFE has a too limited feature set.I learned the hard way that CTFE-able code and high-performance code look very much unlike each other. Unless everything is new-ed and not deleted we can't have the same code paths. A fair amount of __ctfe is in order.I think a good idea would be to extract the memory pool code out of the regex module. It's not a core part of the problem and could be reused elsewhere. Having something named memory pool would also make the code clearer. It might be impossible to implement memory pools at CTFE so we could hide that detail away by simulating a memory pool (at CTFE) until it becomes possible.Would be nice to have I guess, std.regex though uses somewhat specialized intrusive free-list so it's too niche for generic solution.The code in general is quite clear, but low level and more about how to do it than what to do. This is not the D-style I know and love, but it might be needed for performance.Yeah, sadly ranges and algorithms were underused. Reasons were multiple, for parser it had more to do with CTFE-ablity, for engines more of performance concerns.Therefore I probably won't be the one that starts the implementation of this, but I will try to help if someone more experienced in this kind of code takes action.I might pull this off on occasion. Compared to compilation of regex itself it might even run in sensible amount of time. Anyhow I'm curious what use cases you have in mind for running regexes at CTFE. -- Dmitry Olshansky
Apr 10 2015
On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:p.s. i don't think that this is the only problem, though. but i never read "std.regexp" source. it's bad, 'cause i want to make it work with any range, not only with strings. this will allow to run regexp on anything -- and open way to my rbtree-based document system.I've wanted this too. I really hope we can figure this out in the future, it would be another completely unique feature of D's already excellent regex engine.
Apr 03 2015
On Sat, 04 Apr 2015 01:50:00 +0000, Jakob Ovrum wrote:On Friday, 3 April 2015 at 03:58:33 UTC, ketmar wrote:i hope to do some work on that in this month. but don't hold your breath,=20 i didn't even read the code yet, so i don't know how hard it will be.=p.s. i don't think that this is the only problem, though. but i never read "std.regexp" source. it's bad, 'cause i want to make it work with any range, not only with strings. this will allow to run regexp on anything -- and open way to my rbtree-based document system.=20 I've wanted this too. I really hope we can figure this out in the future, it would be another completely unique feature of D's already excellent regex engine.
Apr 05 2015