digitalmars.D - Few ideas to reduce template bloat
- bearophile (24/24) Mar 25 2010 There are some ways to reduce the amount of binary code produced by temp...
- Norbert Nemec (24/27) Mar 25 2010 Indeed, this is the very fundamental distinction between templates (C++)...
- bearophile (4/5) Mar 25 2010 Just to be sure you have understood: the solution number 5 is for a stat...
- sebastien.f (5/5) Mar 25 2010 This problem reminds me of a recent paper written by Bjarne Stroustrup (...
- bearophile (4/6) Mar 25 2010 It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one...
- Andrei Alexandrescu (3/10) Mar 25 2010 That's a very, very nice hat tip from the authors.
- Walter Bright (2/13) Mar 25 2010 Yes!
- Steve Teale (3/16) Mar 25 2010 Do we have a balanced tree implementation in Phobos these days?
- bearophile (3/3) Mar 25 2010 After reading that paper I can say that D compilers can start doing the ...
- Bernard Helyer (2/3) Mar 25 2010 Microsoft Visual D? =P
- Walter Bright (3/7) Mar 26 2010 Because of separate compilation, that can't be done with the compiler in...
- bearophile (4/6) Mar 26 2010 LDC is able to do it already! I don't need 100% general solutions.
- Walter Bright (6/11) Mar 26 2010 Are you sure LDC isn't doing it at the link step?
- bearophile (10/13) Mar 26 2010 This is the post where I have explained it:
- Norbert Nemec (13/17) Mar 26 2010 Of course. In that sense, Java, C#, C++ and D are all in the same boat.
- Walter Bright (12/15) Mar 28 2010 With enough indirection, template instances can all be done with one fun...
- Walter Bright (2/2) Mar 28 2010 Oh, one more thing. A template system can be used to create a user-defin...
- bearophile (13/18) Mar 28 2010 My idea 5 was not to remove templates from D, and not even to add normal...
- bearophile (245/245) Mar 26 2010 I have asked to llvm devs, it seems llvm already has an optimization pas...
There are some ways to reduce the amount of binary code produced by templates in a language like D. 1) There are few situations where the types are different but the asm instructions inside the function are the same, like uint and int, but this situation probably is not common enough. 2) Even in function templates that get instantiated to a different asm there can be parts that are shared. If the shared part is at the end of the function the compiler can use a jump to use a single function exit point and remove some code duplication. I don't think this is a common situation. 3) There are more situations where a template type is one of several classes (OOP), this can produce the same duplicated asm. In this case the compiler can keep only one version in the final binary. This seems common enough. run-time. D can of course do the same if it uses the LLVM backend, but let's ignore this possibility for now. 5) Java generics don't cause code bloat because it removes the types at runtime and uses boxed values. This can be type-safe, but it's less flexible than C++/D templates, and the boxing-unboxing decreases performance. On the other hand programming practice shows that in most programs not all functions have to be top performance. So a compromise can be invented (I have never found this idea elsewhere). An attribute like boxed can be used to mark what function template arguments are boxed (and don't multiply the template instances). So for example this template function can be called with double or int values, so it gets intantiated in four versions: T2 foo(T1, T2)(T1 x, T2 y) {...} Instantiation 1: foo(1, 2) Instantiation 2: foo(1, 2.0) Instantiation 3: foo(1.0, 2) Instantiation 4: foo(1.0, 2.0) If foo() is not performance-critical, then the same function template with one boxed type, and with the same inputs, gets instantiated just two times, halving its combinatorial explosion: boxed(T2) foo(T1, boxed T2)(T1 x, T2 y) {...} Instantiation 1: foo(1, 2) and foo(1, 2.0) Instantiation 2: foo(1.0, 2) and foo(1.0, 2.0) The first argument is like a normal template argument and created different instantiations, the second argument gets automatically boxed-unboxed, so it's like a single type. Inside this version of the function foo() instructions like: static if typeof(T2 == int) {... are not allowed by the compiler (despite they are formally correct) because instances of T2 are objects and the compiler reminds you that you have to use a run time cast: if (cast(Integer)T2 !is null) {... There are ways to reduce code bloat with idioms used by the programmer, but that's good for C++ programmers. Programmers in a more modern language prefer something safer done by the compiler :-) Bye, bearophile
Mar 25 2010
bearophile wrote:run-time. D can of course do the same if it uses the LLVM backend, but let's ignore this possibility for now. 5) Java generics don't cause code bloat because it removes the types at runtime and uses boxed values. This can be type-safe, but it's less flexible than C++/D templates, and the boxing-unboxing decreases performance. On the other hand programming practice shows that in most programs not all functions have to be top performance. So a compromise can be invented (I have never found this idea elsewhere). An attribute like boxed can be used to mark what function template arguments are boxed (and don't multiply the template instances).Indeed, this is the very fundamental distinction between templates (C++) Templates: * have to be instantiated at compile time * usable for meta-programming * cannot be compiled or type-checked on its own --> not type-safe (i.e. if you don't set the constraints correctly, the instantiation may fail with an error message deep inside the template code) Generics: * Every template parameter must follow a clear interface. * generic code can be compiled and fully type-checked on its own * instantiation not needed at all but possible as optimization -> use boxed values without instantiation -> optionally instantiate at run-time/link-time/compile-time (analogous to function inlining) Though both concepts have a large overlap, it is difficult to bridge the gap without adding significant complexity. Some experimental languages try to get the best of both ways by smearing or even giving up the distinction between compile-time and run-time. Though these attempts are very promising for the future, they are clearly beyond the pragmatic goals of D. D is clearly designed as language that can be compiled to a static binary.
Mar 25 2010
Norbert Nemec:D is clearly designed as language that can be compiled to a static binary.<Just to be sure you have understood: the solution number 5 is for a statically compiled language. It doesn't need a JIT. You just need normal objects, a virtual table, etc. Bye, bearophile
Mar 25 2010
This problem reminds me of a recent paper written by Bjarne Stroustrup (and others) describing and evaluating a technique named “generalized hoisting”. It’s applied to the C++ STL iterators and removes lot of code bloat and increase performances surprisingly. I found it a great read (as often with Bjarne.S) I thought you may find it interesting too. It says the ideas apply to the D language too. Minimizing Dependencies within Generic Classes for Faster and Smaller Programs. http://www.research.att.com/~bs/SCARY.pdf
Mar 25 2010
sebastien.f:I found it a great read (as often with Bjarne.S) I thought you may find it interesting too. It says the ideas apply to the D language too.It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-) Bye, bearophile
Mar 25 2010
On 03/25/2010 03:17 PM, bearophile wrote:sebastien.f:That's a very, very nice hat tip from the authors. AndreiI found it a great read (as often with Bjarne.S) I thought you may find it interesting too. It says the ideas apply to the D language too.It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-)
Mar 25 2010
Andrei Alexandrescu wrote:On 03/25/2010 03:17 PM, bearophile wrote:Yes!sebastien.f:That's a very, very nice hat tip from the authors.I found it a great read (as often with Bjarne.S) I thought you may find it interesting too. It says the ideas apply to the D language too.It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-)
Mar 25 2010
On Thu, 25 Mar 2010 16:48:14 -0500, Andrei Alexandrescu wrote:On 03/25/2010 03:17 PM, bearophile wrote:Do we have a balanced tree implementation in Phobos these days? Stevesebastien.f:That's a very, very nice hat tip from the authors. AndreiI found it a great read (as often with Bjarne.S) I thought you may find it interesting too. It says the ideas apply to the D language too.It seems a nice paper, thank you, I'll read it. Bjarne.S is the last one in the list of its authors. It's nice to see a paper written by such people that mentions D three times or more :-)
Mar 25 2010
After reading that paper I can say that D compilers can start doing the simpler thing: recognizing instantiations of a function template that are compiled to the same asm code, and keep only one of them. The Microsoft compiler seems to do that already. Bye, bearophile
Mar 25 2010
On 26/03/10 10:18, bearophile wrote:The Microsoft compiler seems to do that already.Microsoft Visual D? =P
Mar 25 2010
bearophile wrote:After reading that paper I can say that D compilers can start doing the simpler thing: recognizing instantiations of a function template that are compiled to the same asm code, and keep only one of them. The Microsoft compiler seems to do that already.Because of separate compilation, that can't be done with the compiler in the general case. The linker can, though.
Mar 26 2010
Walter Bright:Because of separate compilation, that can't be done with the compiler in the general case. The linker can, though.LDC is able to do it already! I don't need 100% general solutions. Bye, bearophile
Mar 26 2010
bearophile wrote:Walter Bright:Are you sure LDC isn't doing it at the link step? Regardless, if you implement a half solution in the compiler, and then a full solution in the linker, you've doubled your effort for the same result. This may be a practical solution, however, if you desire to work with any linker, including ones you have no control over.Because of separate compilation, that can't be done with the compiler in the general case. The linker can, though.LDC is able to do it already! I don't need 100% general solutions.
Mar 26 2010
Walter Bright:Are you sure LDC isn't doing it at the link step?This is the post where I have explained it: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=108275 As you can see from the link too llvm source code at the top of my post: https://llvm.org/svn/llvm-project/llvm/tags/RELEASE_25/lib/Transforms/IPO/MergeFunctions.cpp that's a LLVM optimization pass, and ldc is spitting out already optimized asm, so it's not something done by the linker.Regardless, if you implement a half solution in the compiler, and then a full solution in the linker, you've doubled your effort for the same result.Right. Is dmd linker able to do that? The Gold linker (http://en.wikipedia.org/wiki/Gold_%28linker%29 ) that I think in future ldc will use may be able to do that. But you are right, if D modules are compiled separately the compiler can't recognize many cases of duplications. Bye, bearophile
Mar 26 2010
bearophile wrote:Norbert Nemec:My comment was geared towards languages that experiment with multi-stage compilation (or "specialization"). What I wanted to say: statically compiled languages typically distinguish between generics (which "exist" at run time as objects) and templates (which exist at compile time only). Most languages offer only one of both. Once you start merging both concepts you quickly end up with smearing the boundary between run-time and compile-time and depend on JIT technology or multi-stage compilation. D may of course use these techniques for optimization, but the language should not depend on them.D is clearly designed as language that can be compiled to a static binary.<Just to be sure you have understood: the solution number 5 is for a statically compiled language. It doesn't need a JIT. You just need normal objects, a virtual table, etc.
Mar 26 2010
bearophile wrote:Just to be sure you have understood: the solution number 5 is for a statically compiled language. It doesn't need a JIT. You just need normal objects, a virtual table, etc.With enough indirection, template instances can all be done with one function regardless of the actual argument types. What is lost with that approach, however, are all the benefits of inlining, constant folding, specialization, etc. In addition, CPUs tend to do a much poorer job with indirection than they do with inline code. D's templates are also more powerful than generics in that they can be used to templatize data and symbols, not just functions and classes. Generic types can also be more bloated because they have to carry around the indirect references, while templated types can specialize them away. For example, a templated hash-map can often eliminate storing the hash, while a generic one will have to keep it regardless.
Mar 28 2010
Oh, one more thing. A template system can be used to create a user-defined generic system, but a generic system cannot be used to create a template system.
Mar 28 2010
Walter Bright:With enough indirection, template instances can all be done with one function regardless of the actual argument types. What is lost with that approach, however, are all the benefits of inlining, constant folding, specialization, etcMy idea 5 was not to remove templates from D, and not even to add normal Java-like generics to D. And I know templates are usually faster and more flexible. My idea was to create something mixed, where you can choose what input arguments are boxed as in Java, and what ones are not boxed as in D/C++. So the purpose was not to replace normal D templates, but to add flexibility. Because I think there are templates that don't need max performance, in such case using a mixed template-generic (or full generic) can give a "good enough" performance while keeping the final binary smaller. Reducing binary size is good because it decreases the pressure on the code L1 cache, and this speeds up the program a little. So it's like a cross between a template and a generic. This is a normal D template: T1 foo(T1, T2, T3)(T1 x, T2, y, T3 z) {...} This is a full java-style generic, that will have only one implementation in the binary, x, y and z will be boxed inside foo: generic T1 foo( generic T1, generic T2, generic T3)(T1 x, T2, y, T3 z) {...} And this is something mixed, here T1 and T3 will be boxed but T2 will not, T2 is like a D template argument, so the compiler will instantiate as many different T2 it will find: generic T1 foo( generic T1, T2, generic T3)(T1 x, T2, y, T3 z) {...}Generic types can also be more bloated because they have to carry around the indirect references, while templated types can specialize them away.<You are right, as always in life it's a matter of compromises and balances :-) If you instantiate a hybrid template like foo( generic T1, T2, generic T3) like with ten different T2 types, you can usually produce more code than a single (probably bigger) 100% generic like generic T1 foo( generic T1, generic T2, generic T3) :-) Bye, bearophile
Mar 28 2010
I have asked to llvm devs, it seems llvm already has an optimization pass that removes duplicated functions. But it's disabled on default, you can use it with LDC with "-mergefunc". Its implementation, probably written by nlewycky: https://llvm.org/svn/llvm-project/llvm/tags/RELEASE_25/lib/Transforms/IPO/MergeFunctions.cpp -------------------- I have done few tests in D1 with LDC, the first one, here -mergefunc (and no other optimizations) leaves a single function: import tango.stdc.stdio: printf; import tango.stdc.stdlib: atoi; int factorial1(int X) { if (X == 0) return 1; return X*factorial1(X-1); } int factorial2(int X) { if (X == 0) return 1; return X*factorial2(X-1); } void main(char[][] args) { printf("%d\n", factorial1(atoi(args[1].ptr))); printf("%d\n", factorial2(atoi(args[1].ptr))); } -------------------- Second test I've done: import tango.stdc.stdio: printf; import tango.stdc.stdlib: atoi; T factorial(T)(T x) { if (x == 0) return 1; else return x * factorial(x - 1); } void main(char[][] args) { printf("%d\n", factorial(atoi(args[1].ptr))); printf("%d\n", factorial(cast(uint)atoi(args[1].ptr))); } The resulting asm with -mergefunc shows some duplication left (a person says this is a bug that can be fixed, I have to talk with another developer): __unnamed_1: subl $4, %esp movl %eax, (%esp) testl %eax, %eax jne .LBB2_2 movl $1, %eax addl $4, %esp ret .LBB2_2: movl (%esp), %eax decl %eax call factorial!uint imull (%esp), %eax addl $4, %esp ret factorial!int: subl $4, %esp call __unnamed_1 addl $4, %esp ret factorial!uint: subl $4, %esp call __unnamed_1 addl $4, %esp ret _Dmain: subl $28, %esp movl 36(%esp), %eax movl %eax, 20(%esp) movl 32(%esp), %eax movl %eax, 16(%esp) cmpl $1, 16(%esp) jbe .LBB1_3 movl 20(%esp), %eax movl 12(%eax), %eax movl %eax, (%esp) call atoi call factorial!int movl %eax, 4(%esp) movl $.str, (%esp) call printf cmpl $1, 16(%esp) jbe .LBB1_4 movl 20(%esp), %eax movl 12(%eax), %eax movl %eax, (%esp) call atoi call factorial!uint movl %eax, 4(%esp) movl $.str2, (%esp) call printf xorl %eax, %eax addl $28, %esp ret $8 .LBB1_3: movl .modulefilename+4, %eax movl %eax, 4(%esp) movl .modulefilename, %eax movl %eax, (%esp) movl $12, 8(%esp) call _d_array_bounds .LBB1_4: movl .modulefilename+4, %eax movl %eax, 4(%esp) movl .modulefilename, %eax movl %eax, (%esp) movl $13, 8(%esp) call _d_array_bounds ---------------------------- My third experiment: import tango.stdc.stdio: printf; import tango.stdc.stdlib: atoi; int factorial1(int X) { if (X == 0) return 1; return X*factorial1(X-1); } int factorial2(int X) { if (X == 0) return 1; return X*factorial2(X-1); } int caller(int function(int) func, int x) { return func(x); } void main() { printf("%d\n", caller(&factorial1, atoi("10"))); printf("%d\n", caller(&factorial2, atoi("10"))); } Its asm without mergefunc: _D5temp310factorial1FiZi: subl $4, %esp movl %eax, (%esp) movl (%esp), %eax cmpl $0, %eax jne .LBB1_2 movl $1, %eax addl $4, %esp ret .LBB1_2: movl (%esp), %eax subl $1, %eax call _D5temp310factorial1FiZi movl %eax, %ecx movl (%esp), %eax imull %ecx, %eax addl $4, %esp ret _D5temp310factorial2FiZi: subl $4, %esp movl %eax, (%esp) movl (%esp), %eax cmpl $0, %eax jne .LBB2_2 movl $1, %eax addl $4, %esp ret .LBB2_2: movl (%esp), %eax subl $1, %eax call _D5temp310factorial2FiZi movl %eax, %ecx movl (%esp), %eax imull %ecx, %eax addl $4, %esp ret _D5temp36callerFPFiZiiZi: subl $12, %esp movl 16(%esp), %ecx movl %ecx, 8(%esp) movl %eax, 4(%esp) movl 8(%esp), %ecx movl 4(%esp), %eax call *%ecx addl $12, %esp ret $4 _Dmain: subl $12, %esp leal .str1, %eax movl %eax, (%esp) call atoi movl $_D5temp310factorial1FiZi, (%esp) call _D5temp36callerFPFiZiiZi subl $4, %esp movl $.str, (%esp) movl %eax, 4(%esp) call printf leal .str3, %eax movl %eax, (%esp) call atoi movl $_D5temp310factorial2FiZi, (%esp) call _D5temp36callerFPFiZiiZi subl $4, %esp movl $.str2, (%esp) movl %eax, 4(%esp) call printf xorl %eax, %eax addl $12, %esp ret $8 Its asm with mergefunc: _D5temp310factorial1FiZi: subl $4, %esp movl %eax, (%esp) testl %eax, %eax jne .LBB1_2 movl $1, %eax addl $4, %esp ret .LBB1_2: movl (%esp), %eax decl %eax call _D5temp310factorial1FiZi imull (%esp), %eax addl $4, %esp ret _D5temp310factorial2FiZi: subl $4, %esp call _D5temp310factorial1FiZi addl $4, %esp ret _D5temp36callerFPFiZiiZi: subl $12, %esp movl 16(%esp), %ecx movl %ecx, 8(%esp) movl %eax, 4(%esp) call *8(%esp) addl $12, %esp ret $4 .size _D5temp36callerFPFiZiiZi, .-_D5temp36callerFPFiZiiZi _Dmain: subl $12, %esp movl $.str1, (%esp) call atoi movl $_D5temp310factorial1FiZi, (%esp) call _D5temp36callerFPFiZiiZi subl $4, %esp movl %eax, 4(%esp) movl $.str, (%esp) call printf movl $.str3, (%esp) call atoi movl $_D5temp310factorial2FiZi, (%esp) call _D5temp36callerFPFiZiiZi subl $4, %esp movl %eax, 4(%esp) movl $.str2, (%esp) call printf xorl %eax, %eax addl $12, %esp ret $8 llvm is not able to replace all function pointers factorial2 with the ones to factorial1, so turns factorial2 into a stump that just calls the factorial1. In C/D you can compare function pointers so in this case LLVM has done what it can. But can't here llvm replace factorial2 with just a jump instruction? I am not expert enough yet to answer this. If someone has an answer for me I'd like to hear it :-) Bye, bearophile
Mar 26 2010