digitalmars.D - Test for array literal arguments?
- bearophile (491/491) Jun 05 2012 Warning: this post contains some partially uncooked ideas.
- Peter Alexander (25/31) Jun 05 2012 I was also thinking about this idea today. I was writing a small
- bearophile (20/30) Jun 05 2012 The general solution is named "Partial compilation", it's a mess
- Peter Alexander (29/36) Jun 06 2012 Thanks for the link. The paper is interesting.
Warning: this post contains some partially uncooked ideas.
I hope Walter will read this post :-)
First of all the little problem. I'd like:
[1, 3, 7].canFind(x)
To be compiled with an in-lined:
x == 1 || x == 3 || x == 7
(A very optimizing D back-end is able to inline the array
creation, see that the array length is known at compile-time, to
unroll the search loop according to that length, doing what I am
asking here. Maybe LDC2 with link-time optimization is able to do
it. DMD is not able to do it. Having a very opitimizing back-end
is good, but languages like C and Go (and Java as a
counter-example) show that not _requiring_ a very optimizing
back-end is good for a language).
------------------------------
This is a starting point for the discussion, it's a modified and
simplified implementation of canFind() that also contains an
optimization for short fixed-sized arrays:
import std.stdio: writeln;
import std.traits: ForeachType, isStaticArray;
import std.string: xformat, join;
bool canFind(Range, T)(Range items, T item)
if (is(ForeachType!Range == T)) {
static if (isStaticArray!Range && Range.length < 5) {
static if (Range.length == 0) {
return false;
} else {
static string genEq(string seqName, string itemName, int
len)
/*pure nothrow*/ {
string[] result;
foreach (i; 0 .. len)
result ~= xformat("%s[%d] == %s", seqName, i, itemName);
return result.join(" || ");
}
return mixin(genEq("items", "item", Range.length));
}
} else {
foreach (x; items)
if (x == item)
return true;
return false;
}
}
int main() {
int x = 3; // run-time value
int[3] a = [1, 3, 7];
return a.canFind(x);
//assert([1, 3, 7].canFind(x));
}
The asm of the relevant functions (dmd 2.060alpha, 32 bit, -O
-release -inline):
_D4test18__T7canFindTG3iTiZ7canFindFG3iiZb13genExpressionFAyaAyaiZAya
L0: sub ESP,0Ch
push EBX
xor EBX,EBX
push ESI
mov ESI,EAX
test ESI,ESI
mov dword ptr 8[ESP],0
mov dword ptr 0Ch[ESP],0
jle L59
L1D: push dword ptr FLAT:_DATA[014h]
push dword ptr FLAT:_DATA[010h]
push dword ptr 02Ch[ESP]
push dword ptr 02Ch[ESP]
push EBX
push dword ptr 030h[ESP]
push dword ptr 030h[ESP]
call near ptr
_D3std6string24__T7xformatTaTAyaTiTAyaZ7xformatFxAaAyaiAyaZAya
push EDX
mov EDX,offset FLAT:_D13TypeInfo_AAya6__initZ
push EAX
lea ECX,010h[ESP]
push ECX
push EDX
call near ptr __d_arrayappendcT
inc EBX
add ESP,010h
cmp EBX,ESI
jl L1D
L59: push dword ptr 0Ch[ESP]
push dword ptr 0Ch[ESP]
push dword ptr FLAT:_DATA[024h]
push dword ptr FLAT:_DATA[020h]
call near ptr
_D3std5array22__T8joinImplTAAyaTAyaZ8joinImplFAAyaAyaZAya
pop ESI
pop EBX
add ESP,0Ch
ret 010h
_D4test18__T7canFindTG3iTiZ7canFindFG3iiZb
mov EDX,EAX
cmp 4[ESP],EDX
je L18
cmp 8[ESP],EDX
je L18
cmp 0Ch[ESP],EDX
je L18
xor EAX,EAX
jmp short L1D
L18: mov EAX,1
L1D: ret 0Ch
__Dmain comdat
L0: sub ESP,0Ch
mov EAX,offset FLAT:_D12TypeInfo_xAi6__initZ
push EBX
push ESI
push 0Ch
push 3
push EAX
call near ptr __d_arrayliteralTX
add ESP,8
mov EBX,EAX
mov dword ptr [EAX],1
mov ECX,3
mov EDX,EBX
push EBX
lea ESI,010h[ESP]
mov 4[EBX],ECX
mov dword ptr 8[EBX],7
push ESI
call near ptr _memcpy
add ESP,0Ch
mov EAX,3
push dword ptr 010h[ESP]
push dword ptr 010h[ESP]
push dword ptr 010h[ESP]
call near ptr
_D4test18__T7canFindTG3iTiZ7canFindFG3iiZb
and EAX,0FFh
pop ESI
pop EBX
add ESP,0Ch
ret
That asm shows two problems, that I can't fully explain:
- The very existance of a canFind instance with an inline xformat
call (!);
- The missed inlining of the canFind instance that performs just
the 3 equals (_D4test18__T7canFindTG3iTiZ7canFindFG3iiZb ).
But for this discussion I don't care about those two problems.
I'd like this too to be computed using the mixin(genEq):
int main() {
int x = 3; // run-time value
assert([1, 3, 7].canFind(x));
}
That doesn't happen because [1, 3, 7] is a int[] literal, so
inside canFind() isStaticArray!Range is false.
On the other hand currently this works, because despite [1, 3, 7]
being a int[] literal, is also implicitly castable to int[3]:
import std.string: xformat, join;
bool canFind(int[3] items, int item) {
static string genEq(string seqName, string itemName, int len) {
string[] result;
foreach (i; 0 .. len)
result ~= xformat("%s[%d] == %s", seqName, i, itemName);
return result.join(" || ");
}
return mixin(genEq("items", "item", items.length));
}
int main() {
int x = 3; // run-time value
return [1, 3, 7].canFind(x);
}
But this doesn't compile:
import std.string: xformat, join;
bool canFind(size_t N)(int[N] items, int item) {
static string genEq(string seqName, string itemName, int len) {
string[] result;
foreach (i; 0 .. len)
result ~= xformat("%s[%d] == %s", seqName, i, itemName);
return result.join(" || ");
}
return mixin(genEq("items", "item", items.length));
}
int main() {
int x = 3; // run-time value
return [1, 3, 7].canFind(x);
}
----------------------------
So is it possible to modify D to be able to create something like
this that returns true conservatively (so it returns false when
it's an unknown type of array) if a template function argument is
an array literal (or enum of array)?
__traits(is_array_literal, Range)
That probably doesn't suffice, but it's an idea.
I'd really like to know if an argument is a array/string literal,
and in such case be able to see it as a fixed-sized one if I
choose so.
An alternative is just to allow to write two template functions
like this (present at the same time):
bool canFind(size_t N, T)( literal T[N] items, T item) if (N < 5)
{
static if (N == 0) {
return false;
} else {
static string genEq(string seqName, string itemName, int len)
{
string[] result;
foreach (i; 0 .. len)
result ~= xformat("%s[%d] == %s", seqName, i, itemName);
return result.join(" || ");
}
return mixin(genEq("items", "item", items.length));
}
}
bool canFind(Range)(Range items, T item)
if (is(ForeachType!Range == T)) {
foreach (x; items)
if (x == item)
return true;
return false;
}
Where " literal T[N] items" means that canFind function template
overload is used even when you call it with a (dynamic) array
literal (if N isn't too much high, according to its template
constraint).
Suggestions for better ideas are welcome.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This is the second part of this post, about a related topic.
A related more general desire is to know if a template function
argument is known at compile-time, and in this case being able to
use it at compile time, if I want so (in this case the D compiler
instantiates a new template instance of the function, of course).
Something like this for me is very good because it would allow D
programmers to do a many things that today are not possible, like
running some test at compile-time on the inputs, optimizing the
function on the base of some arguments (a manually implemented
poor's man partial compilation. I think this is often enough),
and more.
The classic function used to explain partial compilation is the
integer exponent power function (I use printf() to get a much
simpler and shorter asm):
import core.stdc.stdio: printf;
T iPow(T)(T base, size_t exp) {
T result = 1;
for (; exp > 0; exp >>= 1) {
if (exp & 1)
result *= base;
base *= base;
}
return result;
}
void main() {
printf("%d", iPow(5, 3));
}
Its asm:
_D5test011__T4iPowTiZ4iPowFikZi comdat
push EBX
mov ECX,EAX
mov EDX,8[ESP]
push ESI
mov EBX,1
test ECX,ECX
je L2D
L11: test ECX,1
je L20
mov EAX,EDX
imul EAX,EBX
mov EBX,EAX
L20: mov ESI,EDX
imul ESI,EDX
shr ECX,1
mov EDX,ESI
test ECX,ECX
jne L11
L2D: pop ESI
mov EAX,EBX
pop EBX
ret 4
__Dmain comdat
L0: push EAX
mov EAX,3
push 5
call near ptr _D5test011__T4iPowTiZ4iPowFikZi
mov ECX,offset FLAT:_DATA
push EAX
push ECX
call near ptr _printf
add ESP,8
xor EAX,EAX
pop ECX
ret
Often the exp is a literal or it's known at compile-time. In this
case in D you are free to use a different function template:
import core.stdc.stdio: printf;
T iPow(size_t exp, T)(T base) {
T result = 1;
for (int e = exp; e > 0; e >>= 1) {
if (e & 1)
result *= base;
base *= base;
}
return result;
}
void main() {
printf("%d", iPow!(3)(5));
}
Its asm:
_D4test14__T4iPowVi3TiZ4iPowFiZi
push EBX
mov EDX,EAX
mov EBX,1
push ESI
mov ECX,3
LE: test ECX,1
je L1D
mov EAX,EDX
imul EAX,EBX
mov EBX,EAX
L1D: mov ESI,EDX
imul ESI,EDX
sar ECX,1
mov EDX,ESI
test ECX,ECX
jg LE
pop ESI
mov EAX,EBX
pop EBX
ret
__Dmain comdat
L0: push EAX
mov EAX,5
call near ptr _D4test14__T4iPowVi3TiZ4iPowFiZi
mov ECX,offset FLAT:_DATA
push EAX
push ECX
call near ptr _printf
add ESP,8
xor EAX,EAX
pop ECX
ret
With DMD to see an improvement you need a static foreach, now
it's fully inlined:
import core.stdc.stdio: printf;
import std.typetuple: TypeTuple;
private template _iPowHelper(size_t exp) {
static if (exp == 0)
alias TypeTuple!() _iPowHelper;
else
alias TypeTuple!(exp, _iPowHelper!(exp >> 1)) _iPowHelper;
}
T iPow(size_t exp, T)(T base) {
// Unfortunately I can't nest template _iPowHelper here.
T result = 1;
// Optionally add a static if here to not use _iPowHelper
// when exp is too much large
/*static*/ foreach (e; _iPowHelper!exp) {
if (e & 1)
result *= base;
base *= base;
}
return result;
}
void main() {
printf("%lf", iPow!(3)(5));
}
Its asm:
_D5test214__T4iPowVi3TiZ4iPowFiZi
mov ECX,EAX
mov EDX,EAX
imul EAX,ECX
mov ECX,EAX
imul EAX,EDX
mov EDX,EAX
ret
__Dmain comdat
L0: push EAX
mov EAX,offset FLAT:_DATA
push 07Dh
push EAX
call near ptr _printf
add ESP,8
xor EAX,EAX
pop ECX
ret
This variant of the last program shows that the code is really
working as desired:
import core.stdc.stdio: printf;
import std.typetuple: TypeTuple;
private template _iPowHelper(size_t exp) {
static if (exp == 0)
alias TypeTuple!() _iPowHelper;
else
alias TypeTuple!(exp, _iPowHelper!(exp >> 1)) _iPowHelper;
}
T iPow(size_t exp, T)(T base) {
// Unfortunately I can't nest template _iPowHelper here.
T result = 1;
// Optionally add a static if here to not use _iPowHelper
// when exp is too much large
/*static*/ foreach (e; _iPowHelper!exp) {
if (e & 1)
result *= base;
base *= base;
}
return result;
}
void main(string[] args) {
double x = args.length + 4.0;
printf("%lf", iPow!(3)(x));
}
Its asm, the main shows no calls to iPow, no loops and just two
fmul:
_D6test2d14__T4iPowVi3TdZ4iPowFdZd comdat
sub ESP,0Ch
fld qword ptr 010h[ESP]
fld qword ptr 010h[ESP]
fmul qword ptr 010h[ESP]
fxch ST1
fstp qword ptr [ESP]
fstp qword ptr 010h[ESP]
fld qword ptr 010h[ESP]
fmul qword ptr [ESP]
fstp qword ptr [ESP]
fld qword ptr [ESP]
add ESP,0Ch
ret 8
__Dmain comdat
L0: sub ESP,024h
mov EAX,028h[ESP]
xor ECX,ECX
mov [ESP],EAX
mov EDX,offset FLAT:_DATA[010h]
mov 4[ESP],ECX
fild long64 ptr [ESP]
fadd qword ptr FLAT:_DATA[00h]
fst qword ptr 8[ESP]
fld qword ptr 8[ESP]
fld qword ptr 8[ESP]
fxch ST2
fstp qword ptr 010h[ESP]
fstp qword ptr 018h[ESP]
fmul qword ptr 010h[ESP]
fstp qword ptr 010h[ESP]
fld qword ptr 010h[ESP]
fmul qword ptr 018h[ESP]
fstp qword ptr 018h[ESP]
fld qword ptr 018h[ESP]
sub ESP,8
fstp qword ptr [ESP]
push EDX
call near ptr _printf
add ESP,0Ch
add ESP,024h
xor EAX,EAX
ret
So we are back to an idea Walter has tried and refused few years
ago, of compile-time known arguments with "static". The small
difference here is (I think) that both iPow templates are allowed
to exist in the code at the same time, and the iPow overload with
"static" is preferred by the argument is known at compile-time:
import core.stdc.stdio: printf;
import std.typetuple: TypeTuple;
private template _iPowHelper(size_t exp) {
static if (exp == 0)
alias TypeTuple!() _iPowHelper;
else
alias TypeTuple!(exp, _iPowHelper!(exp >> 1)) _iPowHelper;
}
T iPow(T)(T base, static size_t exp) {
// Unfortunately I can't nest template _iPowHelper here.
T result = 1;
/*static*/ foreach (e; _iPowHelper!exp) {
if (e & 1)
result *= base;
base *= base;
}
return result;
}
T iPow(T)(T base, size_t exp) {
T result = 1;
for (; exp > 0; exp >>= 1) {
if (exp & 1)
result *= base;
base *= base;
}
return result;
}
void main() {
printf("%d", iPow(5, 3)); // calls first iPow overload!
}
Note that this is not what I was asking for in the first half of
this post, because I didn't want to know the contents of "items"
at compile-time, only its length (and currently arrays can't be
template arguments, despite strings can, even dstrings, that
aren't very different from a uint[], it's a small mystery):
bool canFind(T)(static T[] items, T item) {...}
Bye,
bearophile
Jun 05 2012
On Tuesday, 5 June 2012 at 23:48:48 UTC, bearophile wrote:So we are back to an idea Walter has tried and refused few years ago, of compile-time known arguments with "static". The small difference here is (I think) that both iPow templates are allowed to exist in the code at the same time, and the iPow overload with "static" is preferred by the argument is known at compile-time:I was also thinking about this idea today. I was writing a small math function (Van der Corput sequence generator) with the signature vdc(int n, int b) and noticed that the code could be faster when b == 2 (the most common case) because divides can be turned into shifts and mods turned to bitwise AND etc. You could duplicate the function to take the static args as template args, but that's ugly. I also came to the conclusion of using 'static' as a parameter "storage class" to specify that the parameter is known at compile-time. One problem with this approach is that it only solves some cases and cannot work in general. It also has other implications: - Code with 1 or more optimised versions will require extra maintenance/testing. - Makes code more difficult to reason about (can be difficult to tell which is version is called). - Adds more rules for overload resolution. However, the biggest problem with this proposal (in my opinion) is that it is unnecessary. I care deeply about performance, but tiny optimisations like this are simply not important 99% of the time. When they are important, just write a specific optimised version and use that. Yes, you lose generality, but special needs call for special cases. Let's not complicate the language and bloat the codebase further for questionable gain.
Jun 05 2012
Peter Alexander:One problem with this approach is that it only solves some cases and cannot work in general.The general solution is named "Partial compilation", it's a mess and probably you don't want it in the DMD compiler (despite it seems LLVM is getting able to do it a bit). Yet lot of people are studying partial compilation for 20+ years or more, because it's very interesting and potentially useful.- Adds more rules for overload resolution.This needs to be studied. But keep in mind that Walter has already tried and refused that idea of "static" arguments. So you can't assume it's an easy thing to implement. Here we are discussing just about the second part of my post. The title of my post refers to just the first half of it.However, the biggest problem with this proposal (in my opinion) is that it is unnecessary. I care deeply about performance, but tiny optimisations like this are simply not important 99% of the time. When they are important, just write a specific optimised version and use that. Yes, you lose generality, but special needs call for special cases. Let's not complicate the language and bloat the codebase further for questionable gain.Writing specialized versions without any language help is not nice, and I think the gain is significant, it's not just tiny optimizations. My D programs contain lot of stuff known at compile-time. I think such simple poor's man hand-made version of partial compilation is able to do things like (done by true partial compilation): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.5469&rep=rep1&type=pdf Bye, bearophile
Jun 05 2012
On Wednesday, 6 June 2012 at 01:44:24 UTC, bearophile wrote:Writing specialized versions without any language help is not nice, and I think the gain is significant, it's not just tiny optimizations. My D programs contain lot of stuff known at compile-time. I think such simple poor's man hand-made version of partial compilation is able to do things like (done by true partial compilation): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.5469&rep=rep1&type=pdfThanks for the link. The paper is interesting. One interesting point to note is that they bloat the executable massively by specialising (30x larger in a few cases). Fortunately for them, their program is small enough to fit in the I-cache anyway, so it has no effect on performance, however, in a large program that size blowup would cause much more I-cache capacity misses, which would then negatively affect runtime. This effect has already been seen in the games industry with std::sort. As you probably know, std::sort specialises the sort routine for the types being sorted, so you end up with N specialised sort routines for N different types and each one adds a few KB to the executable. If I want to sort a small array (maybe just 8 integers or so) then using std::sort is often worse than qsort because std::sort takes up more of the I-cache. It is now common in games to use qsort (or a small type-safe equivalent) instead of std::sort for small sorts to reduce bloat and I-cache pollution. See slide 14-15 of this DICE presentation: http://www.slideshare.net/DICEStudio/executable-bloat-how-it-happens-and-how-we-can-ght-it I'm sure your D programs do contain a lot of stuff known at compile-time, but I'm also sure that in any non-trivial program, 95% of your code is not performance sensitive and would be better to be small than fast. The sample ray-tracer is not representative of a real program. I hold my position that this would be counterproductive 95% of the time. In the 5% that is highly performance sensitive, we can use metaprogramming techniques (in D, or using external codegen) to make a workable solution.
Jun 06 2012








"Peter Alexander" <peter.alexander.au gmail.com>