digitalmars.D - Multi-architecture binaries
- Jascha Wetzel (14/14) May 01 2007 A thought that came up in the VM discussion...
- Chad J (24/157) May 01 2007 I've thought about this myself, and really like the idea. In the VM
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/20) May 01 2007 On a totally unrelated note we are using GDC to build Universal Binaries
- Lutger (9/9) May 01 2007 I've seen some games with multiple executables compiled for different
- Jascha Wetzel (20/29) May 02 2007 yeah, the size issue isn't that important. it's actually more about not
- janderson (16/34) May 01 2007 This is fine when you have a small sub-set of target architectures,
- Jascha Wetzel (5/44) May 02 2007 the granularity isn't as fine as it could be, of course. but the effort
- janderson (2/46) May 02 2007
- Jascha Wetzel (6/6) May 02 2007 here is a much simpler version that works with templates. what is boils
- Don Clugston (27/34) May 02 2007 A pragma would only be required as a size optimisation. Probably not
- janderson (10/51) May 02 2007 That may be the case. Also if the code is only called once, it would
- Pragma (22/63) May 02 2007 I was thinking about this. What would be nice is if D had reflection an...
- Jascha Wetzel (18/62) May 02 2007 i'm not sure what you mean. i thought of something like this:
A thought that came up in the VM discussion... Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code. Ideally we wouldn't have to write additional code either. The compiler could emit code for multiple targets on a per-function basis (e.g. with the target architecure mangled into the function name). The runtime would check at startup, which version will be used and "link" the appropriate function. Here is a small proof-of-concept implementation of this detection and linking mechanism. Comments?
May 01 2007
I've thought about this myself, and really like the idea. In the VM discussion Don mentioned benchmarking different codepaths to find which one works best on the current CPU, then linking the best one in. This makes a lot of sense to me, since CPUs seem to have different performance characteristics, even regardless of instruction set differences. I was once benchmarking an algorithm on my notebook computer with a more modern processor, and my desktop computer with an older processor. The algo ran faster on the notebook of course, but branching had an especially reduced cost. That is, branching on the more modern processor was less expensive relative to other instructions than it was on the previous processor. This was with the same D binary on both of them. That is the sort of stuff that I think the JITC's want to leverage, but I have to wonder if using a strategy like this and covering enough permutations of costly algorithms would give exactly the same benifit, with a massively reduced startup time for applications. Of course, it would also be nice to be able to turn it off, because it will cost SOME startup time as well as executable size, which are not worthwhile costs for some apps like simple command line apps that need to be snappy and small. It would rock for games though ;) I really can't wait to see D's performance some day when/if it gets cool tricks like this, low-d vector primitives, array operations, etc. Jascha Wetzel wrote:A thought that came up in the VM discussion... Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code. Ideally we wouldn't have to write additional code either. The compiler could emit code for multiple targets on a per-function basis (e.g. with the target architecure mangled into the function name). The runtime would check at startup, which version will be used and "link" the appropriate function. Here is a small proof-of-concept implementation of this detection and linking mechanism. Comments? ------------------------------------------------------------------------ import std.cpuid; import std.stdio; //----------------------------------------------------------------------------- // This code goes into the runtime library const uint CPU_NO_EXTENSION = 0, CPU_MMX = 1, CPU_SSE = 2, CPU_SSE2 = 4, CPU_SSE3 = 8; /****************************************************************************** A function pointer with a bitmask for it's required extensions ******************************************************************************/ struct MultiTargetVariant { static MultiTargetVariant opCall(uint ext, void* func) { MultiTargetVariant mtv; mtv.ext = ext; mtv.func = func; return mtv; } uint ext; void* func; } /****************************************************************************** Chooses the first matching MTV and saves it's FP to the dummy entry in the VTBL ******************************************************************************/ void LinkMultiTarget(ClassInfo ci, void* dummy_ptr, MultiTargetVariant[] multi_target_variants) { uint extensions; if ( mmx ) extensions |= CPU_MMX; if ( sse ) extensions |= CPU_SSE; if ( sse2 ) extensions |= CPU_SSE2; if ( sse3 ) extensions |= CPU_SSE3; foreach ( i, inout vp; ci.vtbl ) { if ( vp is dummy_ptr ) { foreach ( variant; multi_target_variants ) { if ( (variant.ext & extensions) == variant.ext ) { vp = variant.func; break; } } assert(vp !is dummy_ptr); break; } } } //----------------------------------------------------------------------------- // This is application code /****************************************************************************** A class with a multi-target function ******************************************************************************/ class MyMultiTargetClass { // The following 3 functions could be generated automatically by the compiler // with different targets enabled. For example, when we have language support for // vector operations, the compiler could generate multiple versions for different // SIMD extensions. Then there would be only one extension independent implementation. char[] multi_target_sse2() { return "using SSE2"; } char[] multi_target_sse_mmx() { return "using SSE and MMX"; } char[] multi_target_noext() { return "using no extension"; } // The following code could be generated by the compiler if there are multi-target // functions char[] multi_target() { return null; } static this() { MultiTargetVariant[] variants = [ MultiTargetVariant(CPU_SSE2, &multi_target_sse2), MultiTargetVariant(CPU_SSE|CPU_MMX, &multi_target_sse_mmx), MultiTargetVariant(CPU_NO_EXTENSION, &multi_target_noext) ]; LinkMultiTarget(this.classinfo, &multi_target, variants); } } /****************************************************************************** Finally, the usage is completely opaque and there is no runtime overhead besides the detection at startup. ******************************************************************************/ void main() { MyMultiTargetClass t = new MyMultiTargetClass; writefln("%s", t.multi_target); }
May 01 2007
Jascha Wetzel wrote:A thought that came up in the VM discussion... Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code.On a totally unrelated note we are using GDC to build Universal Binaries for Mac OS X, that is: objects with both i386 (=i686) and ppc (=powerpc) code. They are however twice as big as when building for just one arch. $ file hello hello: Mach-O universal binary with 2 architectures hello (for architecture ppc): Mach-O executable ppc hello (for architecture i386): Mach-O executable i386 The GCC driver automatically runs two compilation steps and lipos them, so it's pretty straight-forward to use (unrelated to vector ops, though) gdc -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch ppc -arch i386 ... It only does one variant for each architecture, so no help for a "JIT". --anders
May 01 2007
I've seen some games with multiple executables compiled for different architectures. Since the executable size is dwarfed by resources, this is no problem for these kind of applications. How much of a negative impact would your suggested approach have on compiler optimizations? (inlining and that sort of thing) Another thing, what are the benefits of the compiler doing this over libraries? On a related note it may be worth mentioning liboil which implements exactly this in a library: http://liboil.freedesktop.org/wiki/
May 01 2007
Lutger wrote:I've seen some games with multiple executables compiled for different architectures. Since the executable size is dwarfed by resources, this is no problem for these kind of applications.yeah, the size issue isn't that important. it's actually more about not doing anything but adding a compiler switch to get multiple versions.How much of a negative impact would your suggested approach have on compiler optimizations? (inlining and that sort of thing)the smallest unit for this approach would be a non-inlined function. any function that gets inlined within the multi-arch function would be compiled with the appropriate target as well. all intraprocedural optimizations work as usual. only optimizations that change the calling convention are affected. those have to be equal for all versions of the function because the caller never knows which version it calls. in the example where only virtual functions are supported this is not an issue, since virtual functions have that requirement anyway. for static functions this has to be ensured explicitly.Another thing, what are the benefits of the compiler doing this over libraries?using libraries means that you have to at least compile multiple versions of each library and have code that loads the appropriate version. with compiler support it's a lot more convenient and less error prone, since you do not have to write any additional code or have more complex build scripts.On a related note it may be worth mentioning liboil which implements exactly this in a library: http://liboil.freedesktop.org/wiki/yep, it has the same goal and looking at the source shows how much work it is. of course that's also because everything is manually optimized.
May 02 2007
Jascha Wetzel wrote:A thought that came up in the VM discussion... Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code. Ideally we wouldn't have to write additional code either. The compiler could emit code for multiple targets on a per-function basis (e.g. with the target architecure mangled into the function name). The runtime would check at startup, which version will be used and "link" the appropriate function. Here is a small proof-of-concept implementation of this detection and linking mechanism. Comments?This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture. Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined). http://www.ddj.com/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic) -Joel
May 01 2007
the granularity isn't as fine as it could be, of course. but the effort to make it happen is pretty small and it's better than compiling the whole program multiple times and switching manually. it's not a replacement for JITC or methods like Abrash's welding. janderson wrote:Jascha Wetzel wrote:A thought that came up in the VM discussion... Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code. Ideally we wouldn't have to write additional code either. The compiler could emit code for multiple targets on a per-function basis (e.g. with the target architecure mangled into the function name). The runtime would check at startup, which version will be used and "link" the appropriate function. Here is a small proof-of-concept implementation of this detection and linking mechanism. Comments?This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture. Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined). http://www.ddj.com/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic) -Joel
May 02 2007
Jascha Wetzel wrote:the granularity isn't as fine as it could be, of course. but the effort to make it happen is pretty small and it's better than compiling the whole program multiple times and switching manually. it's not a replacement for JITC or methods like Abrash's welding.I agree, it's a good start.janderson wrote:Jascha Wetzel wrote:A thought that came up in the VM discussion... Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code. Ideally we wouldn't have to write additional code either. The compiler could emit code for multiple targets on a per-function basis (e.g. with the target architecure mangled into the function name). The runtime would check at startup, which version will be used and "link" the appropriate function. Here is a small proof-of-concept implementation of this detection and linking mechanism. Comments?This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture. Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined). http://www.ddj.com/184405765 http://www.ddj.com/184405807 http://www.ddj.com/184405848 I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic) -Joel
May 02 2007
here is a much simpler version that works with templates. what is boils down to is choosing one template instance at startup that will replace a function pointer. now the only compiler support required would be a pragma or similar to select the target architecture. this could also be used to manage multiple versions of BLADE code.
May 02 2007
Jascha Wetzel wrote:here is a much simpler version that works with templates. what is boils down to is choosing one template instance at startup that will replace a function pointer. now the only compiler support required would be a pragma or similar to select the target architecture.A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).this could also be used to manage multiple versions of BLADE code.It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one. I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).
May 02 2007
Don Clugston wrote:Jascha Wetzel wrote:That may be the case. Also if the code is only called once, it would cause a huge cache miss that would last for many nano-seconds. If this is happen a lot the code would keep spiking over over the place (for the first few seconds of the app and then when you hit code that hasn't been used before). A better approach would be to figure them out in large batches, perhaps per-module level. That way you get less cache-misses. Nice idea though. -Joelhere is a much simpler version that works with templates. what is boils down to is choosing one template instance at startup that will replace a function pointer. now the only compiler support required would be a pragma or similar to select the target architecture.A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).this could also be used to manage multiple versions of BLADE code.It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one. I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).
May 02 2007
Don Clugston wrote:Jascha Wetzel wrote:I was thinking about this. What would be nice is if D had reflection annotations/attributes to flag methods and functions, rather than kluding more information into the symbol name. pragma(attr,CPUOptionFixup(CPU.SSE2,&myFunction)) void myFunction_SSE2(){ /* do something with SSE2 */ } void myFunction(){ /* use vanilla code here */ } During, or after link-time, you just walk the reflection metadata and patch things up as appropriate. Now DDL already has a metadata capability via .ddl wrapper support - your imagination is the limit on how that is done post-build (a D front-end comes to mind). Once I get this next revision of DDL out, it should be possible to publish DDL metadata directly from a module via a static hashmap, instead of relying on a post-build process. Either way, it's just a matter of walking that metadata as it's exposed from each DynamicModule and DynamicLibrary, and patching the symbol tables during runtime linking.here is a much simpler version that works with templates. what is boils down to is choosing one template instance at startup that will replace a function pointer. now the only compiler support required would be a pragma or similar to select the target architecture.A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).this could also be used to manage multiple versions of BLADE code.It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one.I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).If I'm not mistaken, this should be doable thanks to D adopting a *very* flat memory model (at least on win32). All the segment registers have the same base address in memory. So just as long as you read/write against ES/DS/FS/GS/SS and read/call against CS, you should be good to go. IIRC, Windows does provide some stronger code-segment write protection (I forget what it was actually called), but it has to be enabled explicitly. At some point in the future, Don, I'd like to pick your brain about using trampolines like this for DDL. I'd like to see cross-OS binaries become possible by thunking the exception-handling mechanisms between *nix and Win32 at link time, but I'm not sure how to pull that off just yet. -- - EricAnderton at yahoo
May 02 2007
It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!)i'm not sure what you mean. i thought of something like this: void foo(uint arch)() { auto p = Vec!(arch)([3.5, 1.1, 3.8]); auto r = Vec!(arch)([17.0f, 28.25, 1]) p *= dot(p,r); } the template parameter to Vec could choose the target used by BLADE (x87 or SSE vor example). the result is a class with multiple instances of foo (since all desired instances appear in the static c'tor). the static c'tor chooses one of the instances (depending on hardware availability or benchmarks) and copies it's address to the init-data in the classinfo. everytime the class is instantiated, the function-pointer will automatically be initialized with the chosen pointer - no self-modifying code necessary. instead of changing the init-data, we can also modify the VTBL in the classinfo (that's what the first version of this example did). Don Clugston wrote:Jascha Wetzel wrote:here is a much simpler version that works with templates. what is boils down to is choosing one template instance at startup that will replace a function pointer. now the only compiler support required would be a pragma or similar to select the target architecture.A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).this could also be used to manage multiple versions of BLADE code.It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!) Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX. When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one. I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like: asm { naked; mov eax, CPU_TYPE; mov eax, FUNCPOINTERS[eax]; mov ecx, [esp-4]; // get the return address mov [ecx-4], eax; // patch the call address, so this thunk never gets called again. jmp [eax]; } But I think a modern OS would go nuts if you try this? (It's been a long time since I wrote self modifying code).
May 02 2007