digitalmars.D - Dynamic Code in D
- Michael Wilson (130/130) Jan 12 2008 I'm responsible for an existing Java application that is the core
- Robert Fraser (9/16) Jan 12 2008 Wow; that's a cool project and as much of it as you're willing to
- bearophile (14/20) Jan 12 2008 D is a very nice language, and it looks quite nice compared to Java, but...
- Michael Wilson (117/163) Jan 12 2008 I know; Java was awful when I first started using it in 1997 (not by
- bearophile (110/117) Jan 12 2008 Current DMD D is able to inline free functions and more. But Java code i...
- Neal Alexander (4/7) Jan 12 2008 I was under the impression this was easy using XADD / CMPXCHG with the
- Sean Kelly (7/16) Jan 14 2008 It's also very slow compared to some of the optimized algorithms out
- CptJack (27/72) Jan 12 2008 He meant "Common Lisp programmers". "CLisp" is just a single
- bearophile (6/13) Jan 12 2008 I like some sides of Lisp (and you have seen I too have suggest it as an...
- Michael Wilson (20/31) Jan 12 2008 I could write a long response to this but it would be off topic for
- Paolo Invernizzi (5/16) Jan 12 2008 You can also not directly target bytecodes: what I mean its that you can...
- Walter Bright (2/6) Jan 13 2008 I'm certainly willing to accept help with this!
- Sean Kelly (6/17) Jan 13 2008 Right now, D uses OS API routines for synchronization. Tango also has a...
- Paolo Invernizzi (4/8) Jan 12 2008 I guess the best approach it's LLVM...
- CptJack (2/2) Jan 12 2008 I'm not even close to being an expert on the internal D compiler mechani...
- BLS (13/13) Jan 13 2008 Hi Michael,
- Walter Bright (22/148) Jan 13 2008 I don't know what the right answer is for your application, so I'll just...
- Tom S (12/21) Jan 13 2008 For what it's worth, you don't need to use DDL on Linux. .so is just
- Rory Mcguire (7/7) Mar 27 2008 checkout kong on dsource.org. www.dsource.org/projects/kong
I'm responsible for an existing Java application that is the core product of my company. However it's essentially a productised prototype and needs a rewrite, or at least some serious refactoring, sometime this year. I am considering doing this in D; I like the language design and there are potential performance and maintainability benefits. As it turns out, this may entail creating a new D library. Thus I'd like to ask the following questions; 1) Is this practical at all? 2) Is there a better way to do this? 3) Would anyone other than us want to use this library if we made it (i.e. is it worth open-sourcing). The system carries out data mining tasks using algorithms generated by a genetic programming engine. The original prototype (developed before I joined the company) did this by construcing a Lisp-style function tree out of Java objects, then running that in an interpreter (i.e. it used ECJ). Unsurprisingly this was hideously slow. The current version that I designed works by compiling the function tree to Java bytecode (as static methods in a new class), calling an in-memory classloader on the output (allowing Java to link it), then allowing Hotspot to translate it to native machine code. However to get good performance on large datasets, I had to implement the terminal functions (i.e. the inner loops) in C, using GCC's SSE intrinsics. The dynamic java code calls this via the Java Native Interface (specifically, a class full of static native methods). However JNI has a significant call overhead, which is eating up a good fraction of the performance boost, as well as creating additional development/maintenance hassle. This system is running on a couple of beefy 16 core Opteron servers and it still takes hours to run on large datasets. The major problem with implementing in D is the lack of runtime compliation and the AFAIK subpar support for dynamic linking. I'll get to that in a bit. The other potential problems are; 1. Support for AMD64 on Linux is absolutely essential; these datasets are several gigabytes in size and need to be loaded into memory. Plus there is a nontrivial amount of 64-bit integer manipulation in the Java code. Linux obviously because what sane person would waste money on Windows Server licenses for compute servers? I haven't been able to track down the latest estimate for when DMD will support AMD64, but at least GDC does now. Is this stable enough for production use? 2. High-performance synchronization. As of 1.6, Java's monitor implementation is quite impressive; the biased locking mechanism means that atomic operations are only required for a small fraction of monitor acquisitions. I've been trying to find out if D's implementation uses the same mechanism. This is probably only good for a few percent of performance, but on 16-way NUMA locking performance can make a difference even when it isn't in the inner loops. 3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics yet (i.e. "target builtins")? This post suggests that it does; http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=D.gnu&artnum=2622 but I can't see anything in the documentation about them. If not I'll have to continue linking to the existing C code, which eliminates a significant benefit of going to D (simplifying the code base - but at least the JNI call & pinning overhead will be gone). Ideally std.intrinsic would have these (presumably that's how DMD is going to include them, if an when it does - as far as I can tell, DMD is still doing even basic double-precision operations with slow x87 code rather than SSE code, unless http://www.digitalmars.com/d/archives/digitalmars/D/Optimizations_of_DMD_49840.html is out of date). Ok, on to the serious problem, dynamic code support. The GP engine is outputting a function tree structure. I can trivially output this as D or Java source. I can output it as Java bytecode with a bit more effort. I really don't want to have to compile it to x86 machine code (and link it against the terminal functions) myself, particularly given that the research guys keep elaborating the algorithm. Here's what I've considered so far; 1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to build a dynamic library. Load the class in Java. This would remove most of the JNI overhead (because the Java/C crossover would be pushed 4 to 10 places up the call graph). However I'd have to keep invoking GCC as a seperate process, though I could cut down the overhead of this somewhat by putting it and the dynamic files on a ramdisk. I'm unsure what the performance would be like under Linux of constantly linking new shared libraries, and I'm not sure if the Sun JVM is capable of unloading native code along with unloading classes (though this memory leak is probably insignificant over the lifetime of the process). This is the solution I will be using if I can't find a good D solution. 2. Port to D. Change the GP system to target MiniD bytecode. I haven't examined MiniD in detail yet, but I get the impression that this will be marginally harder than targetting Java bytecode. Alternatively I could target MiniD source, which would be simpler for me, at some additional compilation cost, or I could dig into the MiniD compiler source and see if it has an AST I could target. The later would be ideal if the code is stable, but it probably isn't. MiniD is purely interpreted, so performance of the dynamic code will almost certainly suck badly compared to Hotspot. On the plus side, 'linking' cost is effectively zero. The inner loops will still have to be in either D (if it has decent SSE intrinsics now) or C (if it doesn't); at first glance MiniD native invokation overhead looks better than JNI (no need to pin for one thing, since the GC is non-copying) but still nontrivial. 3) Port to D. Change the GP system to target D source. Batch code updates together. On each update, recompile the codebase using gdc running in a ramdisk. Start a new process using the new code. Keep all the persistent data in a shared memory segment. Get the new process to map in the shared segment, then kill the old process. Processing can continue from where it left off using the updated code. AFAIK this /should/ work, but I don't know what the overheads are going to be like. Having to batch together the code updates is mildly inconvenient from the algorithm design side. I'm not sure what if any incremental compilation features GDC has; perhaps I could path it to stay in memory and stream new classes to it over a pipe. 4) Port to D. Change the GP system to target D source. Batch code updated together. On each update, link the new classes in with DDL. The major problem with this is that AFAIK DDL isn't working on Linux yet (at all, never mind on AMD64) and is still beta quality code even on Windows. In addition to that, I'm not clear if it can ever recover the memory for unused code, though as I mentioned for (1) this leak isn't likely to be a problem over the lifetime of the process. Plus the GDC invokation overhead as for (3). and finally... 6) Write a new 'D Dynamic Code Library'. My best guess for how this should work is this; a) This library will statically link to GDC and the GCC back end. b) On first invokation, look for a metadata file. If it isn't present or is out of date, recompile the app from source and keep the relevant GDC state in memory. Otherwise, deserialise GDC state from the metadata file. c) Supply functions that accept D source and possibly one or more of the GCC intermediate representations (i.e. AST, CFG, RTL). These would call through to GDC/GCC, intercept the output and write it into the code page, then perform runtime linking as necessary. This is probably a lot of work, but I am unsure exactly how much. d) Open source this library in case anyone else wants to use it. Any thoughts on any of the above? Bear in mind that I have a mild bias towards using D even if it is some extra work to get things going, but I am certainly not going to accept worse overall performance than the Java version. Michael Wilson Systems Developer Purple Frog Text Ltd
Jan 12 2008
Wow; that's a cool project and as much of it as you're willing to open-source would be awesome, even if it mainly serves as an educational tool. Dynamic code support has been something D has been lacking for a while. I don't know the answer to most of your questions, but... Michael Wilson wrote:1. Support for AMD64 on Linux is absolutely essential; these datasets are several gigabytes in size and need to be loaded into memory. Plus there is a nontrivial amount of 64-bit integer manipulation in the Java code. Linux obviously because what sane person would waste money on Windows Server licenses for compute servers? I haven't been able to track down the latest estimate for when DMD will support AMD64, but at least GDC does now. Is this stable enough for production use?AFAIK, the code generator on GDC is more mature/stable than DMD. It uses the GCC backend, so if you consider GCC's AMD64 codegen stable, then GDC's will follow. Anyway, good luck figuring all this out!
Jan 12 2008
Your set of problems is quite interesting. I am not much experienced in D yet, but I can say some of the things I think. There are lot of people here that know much more than me on such matters (but I have implemented many evolution-based programs), they will spot errors among the things I have said here. D is a young language, it has lot of rough corners, so you have to forget the high level of polish that Java systems have today. I like D, and hopefully more and more people will start using it, but I think today it's not much fit to replace "serious" Java programs.I am considering doing this in D; I like the language design and there are potential performance and maintainability benefits.<D is a very nice language, and it looks quite nice compared to Java, but note that most of the small (1000-10000 lines) programs I have translated from Java to D, the Java6 version was 2-10 times faster. HotSpot inlines methods, and it has a really good GC, while DMD currently seem to not inline them and its GC is snail-slow compared to the refined HotSpot (not the server one) one. Those two things alone make D code much slower if you use to code in the usual highly OOP style Java code has.The system carries out data mining tasks using algorithms generated by a genetic programming engine. The original prototype (developed before I joined the company) did this by construcing a Lisp-style function tree out of Java objects, then running that in an interpreter (i.e. it used ECJ). Unsurprisingly this was hideously slow.<For that purpose CommonLisp looks like the a good language (beside Java itself). It's fast enough if used well, and it compiles the functions on the fly. There are some disadvantages, like it's not easy to find CLisp programmers. How much time does a single fitness computation take on average? If it's really little time (less than 30-40 seconds) then compiling/interfacing timings become important. In that situation you need a fast compiler, etc.The dynamic java code calls this via the Java Native Interface (specifically, a class full of static native methods). However JNI has a significant call overhead, which is eating up a good fraction of the performance boost, as well as creating additional development/maintenance hassle. This system is running on a couple of beefy 16 core Opteron servers and it still takes hours to run on large datasets.<Often the bigger gains in such evolutionary programs come from "improving" the search space, adding more heuristics, etc. Not from going from Java to D.2. High-performance synchronization. As of 1.6, Java's monitor implementation is quite impressive;<I don't know the answer, but in most things D is far from being tuned and3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics yet<Nope, I think.1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to build a dynamic library.<If the running time of that fitness function is very little you may need to find something faster than GCC, like TinyCC (but I think that's not your situation, and most probably your fitness functions are small, so they are very quick to compile). Look for other alternative languages/solutions too, like CLisp, Java TinyCC - compiled code, Python+Psyco with Cython compiled code on the fly, etc. I think you can answer most of your questions about D doing some small benchmarks for 1-2 days only. Maybe you will like the language but not its performance for your purposes :-) Bye, bearophile
Jan 12 2008
bearophile wrote:D is a young language, it has lot of rough corners, so you have to forget the high level of polish that Java systems have today.I know; Java was awful when I first started using it in 1997 (not by choice, at the time), but these days it's actually very good for large reasonably-long-running applications. However while the runtime system is very good and the standard library is decent, the language itself is kinda gnarly and obviously restrictive compared to C (thus the use of JNI in this app).I like D, and hopefully more and more people will start using it, but I think today it's not much fit to replace "serious" Java programs.For this app, there are three types of server, the web cluster, the DB cluster and the compute cluster. The web cluster has all the interface and application logic, and that code is definitely staying in Java (+Javascript on the client side). All the persistent data stays in on the file/DB servers, in Postgres and some big binary files. The compute servers are the only ones I am considering moving to D. If it wasn't for the SSE I probably wouldn't, but that JNI overhead is seriously annoying. Plus I can do low-level things such as NUMA optimisation from D much more easily than I can from Java.HotSpot inlines methods, and it has a really good GC, while DMD currently seem to not inline them and its GC is snail-slow compared to the refined HotSpot (not the server one) one.I'm not concerned about GC since there isn't much temporary object usage; the bulk of the data is in a relatively small number of huge arrays (which I may well go to manual memory management for in D anyway). No method inlining is a little more serious, but then I'll have to use GDC rather than DMD for the 64-bit support, and presumably GDC inlines methods ok? If not I could make the GP output stage do it, the call depth and function size is low enough that it shouldn't be difficult.For that purpose CommonLisp looks like the a good language (beside Java itself). It's fast enough if used well, and it compiles the functions on the fly. There are some disadvantages, like it's not easy to find CLisp programmers.It's even harder to find D developers, but it's easy for developers and if I used it on a production system other developers at the company would have to as well). Lisp, not so much. I guarentee that if I said 'I want to reimplement in Lisp' the answer would be no, while I received provisional approval for using D on the basis that 'it's very similar to Java, our developers can learn it easily'. Aside from that the other problems with using Lisp in this app are; 1) /Much/ more work to port the sections of code I want to reuse from the old system across. 2) I don't know of any CL implementations with SSE intrinsics. So unless someone else knows one I'd be stuck linking to C code again. It's possible that there's a CL implementation with less overhead for linking dynamic code to external C libraries than JNI has, but if so I don't know of it. 3) AFAIK similar issues with trying to do NUMA optimisation as Java and worse synchronisation performance (though again, there may be some super-optimal CL implementation out there I don't know about).How much time does a single fitness computation take on average? If it's really little time (less than 30-40 seconds) then compiling/interfacing timings become important.This depends on the dataset size. For a typical dataset, on each pass the GP engine generates a few thousand classifiers based on a few hundred (dynamic) component functions. Computing the fitness of those takes somewhere between 10 seconds and a minute, on a server with eight Opteron 8218s. This is heavily optimised to reuse intermediate values where possible, which is where the moderate amount of thread synchronisation comes from (and why one big server outperforms lots of small servers of the same price and more aggregate compute performance - though we're working on that). Each run makes several hundred of these passes. So the compiler is invoked a few times a minute on the equivalent of a couple of thousand lines of code. Compiler speed shouldn't be a major factor.Often the bigger gains in such evolutionary programs come from "improving" the search space, adding more heuristics, etc. Not from going from Java to D.Absolutely, but I'm not responsible for that (though I certainly make suggestions). My job is to make the algorithms the research team comes up with run on commercial datasets (my team as a whole is responsible for 'productisation').That's something I could potentially help with once I'm decently familiar with the language, if the GDC developers are accepting patches. Otherwise, I could just implement my own synchronization, via inline assembly (since this doesn't have to be multi-platform) or C.2. High-performance synchronization. As of 1.6, Java's monitor implementation is quite impressive;<I don't know the answer, but in most things D is far from being tunedClearly they were scheduled to go in last year (from the post I linked to); has there been a disruption to GDC development recently?3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics yetNope, I think.If the running time of that fitness function is very little you may need to find something faster than GCC, like TinyCC (but I think that's not your situation, and most probably your fitness functions are small, sothey arevery quick to compile).Interesting suggestion but unfotuantely TinyCC does not (AFAIK) target AMD64 - also a problem for a lot of CL implementations. I'll have a look around for similar projects that do though.Look for other alternative languages/solutions too, like CLisp, Java TinyCC compiled code, Python+Psyco with Cython compiled code on the fly, etc.I am doing so; D is one of them. :) The very first thing I tried was Excelsior JET, a mixed mode compiler for Java (static where possible, JIT for new code loaded at runtime) that I'd had good results with in the past. Unfortunately performance dropped significantly, even when I 'cheated', saved a log of all GP functions generated in a run and supplied them to JET to statically compile for the benchmark run. As you noted, Hotspot is really quite good these days.I think you can answer most of your questions about D doing some small benchmarks for 1-2 days only. Maybe you will like the language but not its performance for your purposes :-)Yep; I'll probably start with the 'multiple processes passing data via shared memory' solution, as that sounds like the simplest to implement. If and when I've got that working as a proof of concept, I'll take a look at LLVM. Paolo Invernizzi wrote:I guess the best approach it's LLVM... Take a look at http://www.llvm.org Cheers, PaoloThanks! That does look really interesting. I had a quick look at GNU Lightning but it's almost entirely unoptimising, and thus probably unusable. LLVM looks much better at first glance and I should be able to target LLVM bytecode directly, but I'll have to investigate the feasibility of linking against D. Googling, it looks like there have been a few attempts to use LLVM and D together before, but mostly in the sense of using it as a third static compiler (e.g. http://www.dsource.org/projects/llvmdc). I will have to do some more research and see if there is anything I could get involved with. Needless to say, if I can get my company to contribute engineering time to some interesting open source development, I will. :) Robert Fraser wrote;Wow; that's a cool project and as much of it as you're willing to open-source would be awesome, even if it mainly serves as an educational tool.I'm pretty sure the core algorithms will never be open sourced, but something like a dynamic code library definitely would be OSed, and I'm hopefully that the whole GP engine might eventually be OSed. CptJack wrote:Everything you need to do is provided by Common Lisp, and I don't mean "CLisp", which is a byte-code interpreted CL implementation. Contrary to what most people assume, almost all CL implementations compile functions on-the-fly to machine code. That's raw machine object code, not byte-code, compiled and loaded dynamically while the system is still running. This sounds exactly like what you want.That sounds exactly like what the Java version is already doing. The only obvious benefit with Lisp is a slight simplification of the GP engine output stage. As I've noted, Hotspot is very well optimised these days, I'm not aware of a CL implementation that does better and indeed most perform significantly worse. D would have the advantage of being able to link to C with effectively zero overhead, do manual memory allocation and synchronization where beneficial (i.e. NUMA-optimised memory allocation that still plays well with the GC) and be able to use inline assembly for SSE code. CL doesn't have any of those advantages AFAIK, plus as I mentioned earlier it's (much) harder to get new developers for and (significantly) harder to train existing developers in. I'm actually genuinely surprised that the research team chose to develop the original algorithms in Java rather than Lisp; it would have been a good fit with that task, plus they're ex-academics and thus inherently more likely to use Lisp. But I don't think that it's a good choice for the production system, particularly given that the current code is in Java. Michael Wilson Systems Developer Purple Frog Text Ltd
Jan 12 2008
Michael Wilson:No method inlining is a little more serious, but then I'll have to use GDC rather than DMD for the 64-bit support, and presumably GDC inlines methods ok? If not I could make the GP output stage do it, the call depth and function size is low enough that it shouldn't be difficult.<Current DMD D is able to inline free functions and more. But Java code is often full of getters/setters, they don't seem to be inlined by DMD, so they may slow down code translated from Java to D. To put some "facts" behind my words you can perform a quick benchmark: import std.stdio: put = writef, putr = writefln; import d.time: clock; struct S1 { uint s; } void inc_sA(ref S1 s1) { s1.s++; } void inc_sA(S1* s1) { s1.s++; } struct S2 { uint s; void inc() { this.s++; } } class C1 { uint s; void inc() { this.s++; } } final class C2 { uint s; final void inc() { this.s++; } } class C3a : C1 { void inc2() { this.s++; } } final class C3b : C1 { final void inc2() { this.s++; } } void main() { const uint N = 300_000_000; S1 s1; auto t = clock(); for(uint i; i < N; ++i) s1.s++; putr(clock() - t, " ", s1.s); s1.s = 0; t = clock(); for(uint i; i < N; ++i) inc_sA(s1); putr(clock() - t, " ", s1.s); s1.s = 0; t = clock(); for(uint i; i < N; ++i) inc_sA(&s1); // inlined putr(clock() - t, " ", s1.s); S2 s2a; t = clock(); for(uint i; i < N; ++i) s2a.inc; // inlined putr(clock() - t, " ", s2a.s); auto c1a = new C1; t = clock(); for(uint i; i < N; ++i) c1a.inc; putr(clock() - t, " ", c1a.s); auto c2a = new C2; t = clock(); for(uint i; i < N; ++i) c2a.inc; putr(clock() - t, " ", c2a.s); auto c3a = new C3a; t = clock(); for(uint i; i < N; ++i) c3a.inc2; putr(clock() - t, " ", c3a.s); auto c3b = new C3b; t = clock(); for(uint i; i < N; ++i) c3b.inc2; putr(clock() - t, " ", c3b.s); c3a = new C3a; t = clock(); for(uint i; i < N; ++i) c3a.inc; putr(clock() - t, " ", c3a.s); } Timings N = 300_000_000, DMD 1.025 1) -O -release 2) -O -release -inline 1) 2) 2.63 2.63 4.29 4.30 3.70 2.50 * 4.30 2.69 * 4.91 4.91 4.32 4.29 4.90 4.90 4.28 4.29 4.94 4.92 Someone can run that on GDC too.'it's very similar to Java, our developers can learn it easily'.<Sometimes D can be seen as a superset of Java, but it has other things that you need time to learn, like all the juicy template tricks :-)Aside from that the other problems with using Lisp in this app are;<About them you can post questions on a Lisp newsgroup.Otherwise, I could just implement my own synchronization, via inline assembly (since this doesn't have to be multi-platform) or C.<You are a bold person :-)Interesting suggestion but unfotuantely TinyCC does not (AFAIK) target AMD64 - also a problem for a lot of CL implementations. I'll have a look around for similar projects that do though.<From the things you have said I think you don't need TinyCC at all.As you noted, Hotspot is really quite good these days.<Behind HotSpot there are quite more developers than behind DMD (but I suspect Walter is better than most of them :-) ).LLVM looks much better at first glance and I should be able to target LLVM bytecode directly, but I'll have to investigate the feasibility of linking against D.<LLVM seems a nice solution. And you can just use C/C++ with LLVM instead of D... Bye, bearophile
Jan 12 2008
I was under the impression this was easy using XADD / CMPXCHG with the LOCK prefix. How to yield / wait may be a different story, but how many possible general purpose options could there be?Otherwise, I could just implement my own synchronization, via inline assembly (since this doesn't have to be multi-platform) or C.<You are a bold person :-)
Jan 12 2008
Neal Alexander wrote:It's also very slow compared to some of the optimized algorithms out there. But putting inline asm inside functions is pretty easy either way.I was under the impression this was easy using XADD / CMPXCHG with the LOCK prefix.Otherwise, I could just implement my own synchronization, via inline assembly (since this doesn't have to be multi-platform) or C.<You are a bold person :-)How to yield / wait may be a different story, but how many possible general purpose options could there be?There are established, optimized algorithms for locking on x86, so implementing one by example shouldn't be difficult. I don't have any links offhand though. comp.programming.threads would be a good start. Sean
Jan 14 2008
Michael Wilson Wrote:He meant "Common Lisp programmers". "CLisp" is just a single implementation.For that purpose CommonLisp looks like the a good language (beside Java itself). It's fast enough if used well, and it compiles the functions on the fly. There are some disadvantages, like it's not easy to find CLisp programmers.It's even harder to find D developers, but it's easy for developers and if I used it on a production system other developers at the company would have to as well). Lisp, not so much. I guarentee that if I said 'I want to reimplement in Lisp' the answer would be no, while I received provisional approval for using D on the basis that 'it's very similar to Java, our developers can learn it easily'.That's a valid argument. However, I would like to point you to this very interesting essay before dismissing Lisp entirely for that reason: http://www.paulgraham.com/avg.htmlAside from that the other problems with using Lisp in this app are; 1) /Much/ more work to port the sections of code I want to reuse from the old system across.Once you begin to do some serious programming in Lisp, and know it as well as you do your current language of choice, I think you will find that it is /much/ faster to program in than any other language. At first you will be fighting the syntax and the parens, then one day you will just "get it" and your world will change. It has so many unique features that you will wonder how you ever programmed without them (in fact, almost every "exciting new feature" in language X was implemented in Lisp 40 years ago).2) I don't know of any CL implementations with SSE intrinsics. So unless someone else knows one I'd be stuck linking to C code again. It's possible that there's a CL implementation with less overhead for linking dynamic code to external C libraries than JNI has, but if so I don't know of it.As for SSE support, I know SBCL has SSE support, and therefore I believe CMUCL also does (since SBCL is forked from CMUCL). ECL is completely based around linking dynamic code, and is highly optimized to do so.3) AFAIK similar issues with trying to do NUMA optimisation as Java and worse synchronisation performance (though again, there may be some super-optimal CL implementation out there I don't know about).Again, take a better look at SBCL and CMUCL.CptJack wrote: > Everything you need to do is provided by Common Lisp, and I don't mean > "CLisp", which is a byte-code interpreted CL implementation. Contrary > to what most people assume, almost all CL implementations compile > functions on-the-fly to machine code. That's raw machine object code, > not byte-code, compiled and loaded dynamically while the system is > still running. This sounds exactly like what you want. That sounds exactly like what the Java version is already doing.This is my point exactly. You've had to re-implement Lisp to get Java to do what you want, a-la Greenspun's Tenth Rule.The only obvious benefit with Lisp is a slight simplification of the GP engine output stage. As I've noted, Hotspot is very well optimised these days, I'm not aware of a CL implementation that does better and indeed most perform significantly worse.http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=sbcl&lang2=java They look pretty similar to me in this admittedly fairly worthless comparison.D would have the advantage of being able to link to C with effectively zero overhead, do manual memory allocation and synchronization where beneficial (i.e. NUMA-optimised memory allocation that still plays well with the GC) and be able to use inline assembly for SSE code. CL doesn't have any of those advantagesNot true. I think it's just a matter of common misconceptions about Lisp and its implementations. Most people think it's slow and interpreted. It's neither. Movitz for example is one Common Lisp implementation designed to be as "on-the-metal" as you want including inline assembly. Some of this movitz code has been ported to other implementations as well.plus as I mentioned earlier it's (much) harder to get new developers for and (significantly) harder to train existing developers in.As I said, that is indeed a valid argument. However, some of this is because of perceived hurdles that really don't exist (or are at least smaller than they at first appear). The only reason I'm still making arguments in favor of Common Lisp for your project is because what you describe is basically a Common Lisp implementation hacked together from Java and C. Why not just do away with the hacks?
Jan 12 2008
CptJack:Once you begin to do some serious programming in Lisp, and know it as well as you do your current language of choice, I think you will find that it is /much/ faster to program in than any other language.I like some sides of Lisp (and you have seen I too have suggest it as an option for this project), but I think you can't prove that. For example I don't think you can write serious programs in Lisp faster than in Python... ;-) And to know CommonLisp as well as Java/D you may need years. If you know Java you can learn D rather well in less than 4-6 months. I think only C++ requires more learning time than CommonLisp :-)But if that's already work done, then there may be no need to write it again in Lisp :-) Bye, bearophileThat sounds exactly like what the Java version is already doing.This is my point exactly. You've had to re-implement Lisp to get Java to do what you want, a-la Greenspun's Tenth Rule.
Jan 12 2008
CptJack wrote:Once you begin to do some serious programming in Lisp, and know it as well as you do your current language of choice, I think you will find that it is /much/ faster to program in than any other language. At first you will be fighting the syntax and the parens, then one day you will just "get it" and your world will change. It has so many unique features that you will wonder how you ever programmed without them (in fact, almost every "exciting new feature" in language X was implemented in Lisp 40 years ago).I could write a long response to this but it would be off topic for this newsgroup. Rather I will say only that I have written a moderate about of Lisp code and I have had many, many encounters with 'keen' Lisp advocates. I can't recall /ever/ winning a 'which language to use' debate with /any/ of them, for any real or proposed application. At some point in becoming fond of the language a critical threshold is reached and that person is forever more believes that no other (high-level) language has a genuine reason to exist. I think I can understand where this notion comes from (I have had some experience of language and compiler design), but I certainly don't agree with it. In any case I've found such debates rather unproductive.The only reason I'm still making arguments in favor of Common Lisp for your project is because what you describe is basically a Common Lisp implementation hacked together from Java and C.You may believe this, probably based on a mental model that states 'all other languages are a nasty subset of Lisp' as an axiom, but it is not the case in this (or for that matter, most) instance. In fact I'm curious why you're here at all, as this also implies that D can never be more than a nasty subset of Lisp. Michael Wilson Systems Developer Purple Frog Text Ltd
Jan 12 2008
You can also not directly target bytecodes: what I mean its that you can directly, in your program, construct and link (with an veeeery good optimizer) the genetic functions, without having to emit/compile/link nothing bytecode related. The JIT is your friend. That can be a very good solution expecially if the JITted functions are not-trivial ones, and can benefit a lot from optimizations. I've done something similar some times ago, writing some wrapper C code to the interesting LLVM C++ API (but now I think there's an effort to produce LLVM C API), using it in a genetic D program that instantiated a lot of LLVM functions (for a generation) in an LLVM module, and then executed them over the D individuals: I aborted the effort for lack of time, and not-so-big improvement over a more classical, only D based, solution. Cheers, Paolo Michael Wilson Wrote:Thanks! That does look really interesting. I had a quick look at GNU Lightning but it's almost entirely unoptimising, and thus probably unusable. LLVM looks much better at first glance and I should be able to target LLVM bytecode directly, but I'll have to investigate the feasibility of linking against D. Googling, it looks like there have been a few attempts to use LLVM and D together before, but mostly in the sense of using it as a third static compiler (e.g. http://www.dsource.org/projects/llvmdc). I will have to do some more research and see if there is anything I could get involved with. Needless to say, if I can get my company to contribute engineering time to some interesting open source development, I will. :)
Jan 12 2008
Michael Wilson wrote:That's something I could potentially help with once I'm decently familiar with the language, if the GDC developers are accepting patches. Otherwise, I could just implement my own synchronization, via inline assembly (since this doesn't have to be multi-platform) or C.I'm certainly willing to accept help with this!
Jan 13 2008
Michael Wilson wrote:bearophile wrote:Right now, D uses OS API routines for synchronization. Tango also has a Mutex class design that integrates with the 'synchronized' statement, and it would be pretty easy to create a futex or whatever that does the same thing. This may be the way to go for special-purpose locking. Sean>> implementation is quite impressive;<2. High-performance synchronization. As of 1.6, Java's monitorI don't know the answer, but in most things D is far from being tunedThat's something I could potentially help with once I'm decently familiar with the language, if the GDC developers are accepting patches. Otherwise, I could just implement my own synchronization, via inline assembly (since this doesn't have to be multi-platform) or C.
Jan 13 2008
I guess the best approach it's LLVM... Take a look at http://www.llvm.org Cheers, Paolo Michael Wilson Wrote:Any thoughts on any of the above? Bear in mind that I have a mild bias towards using D even if it is some extra work to get things going, but I am certainly not going to accept worse overall performance than the Java version.
Jan 12 2008
I'm not even close to being an expert on the internal D compiler mechanisms to know if it satisfies your requirements. However, reading your post I kept thinking one thing: Everything you need to do is provided by Common Lisp, and I don't mean "CLisp", which is a byte-code interpreted CL implementation. Contrary to what most people assume, almost all CL implementations compile functions on-the-fly to machine code. That's raw machine object code, not byte-code, compiled and loaded dynamically while the system is still running. This sounds exactly like what you want. CMUCL and SBCL are the two most mature open source implementations that do so on linux, although ECL does as well and it's getting more mature all the time. In fact, it sounds like your current system is a perfect example of Greenspun's Tenth Rule.
Jan 12 2008
Hi Michael, A pragmatic approch : D-newLisp-C I guess you'll find the follwing newLisp examples interresting : - embed binary machine code into newLISP source - Distributed computing, map/reduce stype example of calculating and reducing word counts from different documents (see source code comments) You can embedd newLisp .so - or run as inetd or xinetd service etc. True64Unix binary available. (otherwise you have to deal with the source) Just in case, the example link : http://newlisp.org/index.cgi?Tips_and_Tricks HTH Bjoern
Jan 13 2008
I don't know what the right answer is for your application, so I'll just try and answer where I can. Michael Wilson wrote:The system carries out data mining tasks using algorithms generated by a genetic programming engine. The original prototype (developed before I joined the company) did this by construcing a Lisp-style function tree out of Java objects, then running that in an interpreter (i.e. it used ECJ). Unsurprisingly this was hideously slow. The current version that I designed works by compiling the function tree to Java bytecode (as static methods in a new class), calling an in-memory classloader on the output (allowing Java to link it), then allowing Hotspot to translate it to native machine code. However to get good performance on large datasets, I had to implement the terminal functions (i.e. the inner loops) in C, using GCC's SSE intrinsics. The dynamic java code calls this via the Java Native Interface (specifically, a class full of static native methods). However JNI has a significant call overhead, which is eating up a good fraction of the performance boost, as well as creating additional development/maintenance hassle. This system is running on a couple of beefy 16 core Opteron servers and it still takes hours to run on large datasets.C code can be called directly from D without a JNI type interface.The major problem with implementing in D is the lack of runtime compliation and the AFAIK subpar support for dynamic linking. I'll get to that in a bit. The other potential problems are; 1. Support for AMD64 on Linux is absolutely essential; these datasets are several gigabytes in size and need to be loaded into memory. Plus there is a nontrivial amount of 64-bit integer manipulation in the Java code. Linux obviously because what sane person would waste money on Windows Server licenses for compute servers? I haven't been able to track down the latest estimate for when DMD will support AMD64, but at least GDC does now. Is this stable enough for production use?64 bit DMD is not on the horizon yet, so your only option there is GDC. If it isn't stable enough, I suspect that your needs will drive it to stability.2. High-performance synchronization. As of 1.6, Java's monitor implementation is quite impressive; the biased locking mechanism means that atomic operations are only required for a small fraction of monitor acquisitions. I've been trying to find out if D's implementation uses the same mechanism. This is probably only good for a few percent of performance, but on 16-way NUMA locking performance can make a difference even when it isn't in the inner loops.D's locking mechanism is controlled by library code, so it is adaptable by the user. Currently, it is rather basic.3. SSE intrinsics. Does GDC have the equivalent of GCC SSE intrinsics yet (i.e. "target builtins")? This post suggests that it does; http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&gro p=D.gnu&artnum=2622 but I can't see anything in the documentation about them. If not I'll have to continue linking to the existing C code, which eliminates a significant benefit of going to D (simplifying the code base - but at least the JNI call & pinning overhead will be gone). Ideally std.intrinsic would have these (presumably that's how DMD is going to include them, if an when it does - as far as I can tell, DMD is still doing even basic double-precision operations with slow x87 code rather than SSE code, unless http://www.digitalmars.com/d/archives/digitalmars/D/Optimizatio s_of_DMD_49840.html is out of date).DMD supports the SSE code via the inline assembler only. I don't know if GDC does. In any case, it would be simple to just link to C code that does it.Ok, on to the serious problem, dynamic code support. The GP engine is outputting a function tree structure. I can trivially output this as D or Java source. I can output it as Java bytecode with a bit more effort. I really don't want to have to compile it to x86 machine code (and link it against the terminal functions) myself, particularly given that the research guys keep elaborating the algorithm. Here's what I've considered so far; 1. Output as C source plus a JNI stub in bytecode. Invoke gcc on it to build a dynamic library. Load the class in Java. This would remove most of the JNI overhead (because the Java/C crossover would be pushed 4 to 10 places up the call graph). However I'd have to keep invoking GCC as a seperate process, though I could cut down the overhead of this somewhat by putting it and the dynamic files on a ramdisk. I'm unsure what the performance would be like under Linux of constantly linking new shared libraries, and I'm not sure if the Sun JVM is capable of unloading native code along with unloading classes (though this memory leak is probably insignificant over the lifetime of the process). This is the solution I will be using if I can't find a good D solution. 2. Port to D. Change the GP system to target MiniD bytecode. I haven't examined MiniD in detail yet, but I get the impression that this will be marginally harder than targetting Java bytecode. Alternatively I could target MiniD source, which would be simpler for me, at some additional compilation cost, or I could dig into the MiniD compiler source and see if it has an AST I could target. The later would be ideal if the code is stable, but it probably isn't. MiniD is purely interpreted, so performance of the dynamic code will almost certainly suck badly compared to Hotspot. On the plus side, 'linking' cost is effectively zero. The inner loops will still have to be in either D (if it has decent SSE intrinsics now) or C (if it doesn't); at first glance MiniD native invokation overhead looks better than JNI (no need to pin for one thing, since the GC is non-copying) but still nontrivial.If the linker overhead is too big, it's possible to do the 'linking' yourself by directly reading the .o file, stuffing it in memory, and executing it.3) Port to D. Change the GP system to target D source. Batch code updates together. On each update, recompile the codebase using gdc running in a ramdisk. Start a new process using the new code. Keep all the persistent data in a shared memory segment. Get the new process to map in the shared segment, then kill the old process. Processing can continue from where it left off using the updated code. AFAIK this /should/ work, but I don't know what the overheads are going to be like. Having to batch together the code updates is mildly inconvenient from the algorithm design side. I'm not sure what if any incremental compilation features GDC has; perhaps I could path it to stay in memory and stream new classes to it over a pipe.This looks like your best shot with D. Like you say, it is dependent on how fast GDC can be run.4) Port to D. Change the GP system to target D source. Batch code updated together. On each update, link the new classes in with DDL. The major problem with this is that AFAIK DDL isn't working on Linux yet (at all, never mind on AMD64) and is still beta quality code even on Windows. In addition to that, I'm not clear if it can ever recover the memory for unused code, though as I mentioned for (1) this leak isn't likely to be a problem over the lifetime of the process. Plus the GDC invokation overhead as for (3). and finally...D DLLs work on Windows. DMD supports writing pic code on Linux, I've just never done the work to build a shared library using it.6) Write a new 'D Dynamic Code Library'. My best guess for how this should work is this; a) This library will statically link to GDC and the GCC back end. b) On first invokation, look for a metadata file. If it isn't present or is out of date, recompile the app from source and keep the relevant GDC state in memory. Otherwise, deserialise GDC state from the metadata file. c) Supply functions that accept D source and possibly one or more of the GCC intermediate representations (i.e. AST, CFG, RTL). These would call through to GDC/GCC, intercept the output and write it into the code page, then perform runtime linking as necessary. This is probably a lot of work, but I am unsure exactly how much. d) Open source this library in case anyone else wants to use it.Wow.Any thoughts on any of the above? Bear in mind that I have a mild bias towards using D even if it is some extra work to get things going, but I am certainly not going to accept worse overall performance than the Java version.I think that any D path you take with this will be of a large, sustained benefit to the D community.
Jan 13 2008
Michael Wilson wrote:4) Port to D. Change the GP system to target D source. Batch code updated together. On each update, link the new classes in with DDL. The major problem with this is that AFAIK DDL isn't working on Linux yet (at all, never mind on AMD64) and is still beta quality code even on Windows. In addition to that, I'm not clear if it can ever recover the memory for unused code, though as I mentioned for (1) this leak isn't likely to be a problem over the lifetime of the process. Plus the GDC invokation overhead as for (3).For what it's worth, you don't need to use DDL on Linux. .so is just fine there (at least with GDC). DDL was pretty stable when implemented with Phobos, I have used it in a mid-scale project (a first person shooter engine+game) for plugins and rendering kernels (in a micro-kernel-like architecture). DDL got recently ported to Tango. I haven't done much testing yet, but it appers to be working just fine. Good luck with your project! :) -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenode
Jan 13 2008
checkout kong on dsource.org. www.dsource.org/projects/kong It gives you the ability to modify what function is called when a library function is called, so you could have a library that you load on program initialization just to make the symbols available and then each time you generate new D code, you compile it to a shared library, Load it and then modify the original functions to point to the newly loaded libraries symbols. -Rory
Mar 27 2008