www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - dmd command line scripting experiments and observations

reply Witold Baryluk <witold.baryluk gmail.com> writes:
For a very long time I have been using bash, grep, sed, awk, 
usual suspects on Unix, as they are super quick to type, 
incremental, etc. Once complexity is to big I usually switch to 
Python (decades ago it might have been Perl or PHP).

I often will embed small snippets of grep or awk in some other 
tools that just need to do something with some text files. For 
example do some pre-processing for plotting in Gnuplot.

I even wrote my custom line-column processing "language", called 
`kolumny` over a decade ago). To help with similar tasks. And 
while it does work well, I rarely use it (once a year these days 
sadly), because it is not really a full language.

Yesterday I had a need to some simple processing before before 
doing plotting in gnuplot:


```gnuplot
set ylabel "locking rate [M/s]"
plot "<grep ^mx1 foo.txt" using 3:($3*$4/$9/1e6) title "RWMutex", 
\
      "<grep ^mx2 foo.txt" using 3:($3*$4/$9/1e6) title "drwMutex"

```

where a file `foo.txt` has things like this:

```
mx1 32 1 10000000 0.0001 1 100 100 0.552091302 552.091302ms
mx1 32 1 10000000 0.0001 1 100 100 0.552518653 552.518653ms
mx1 32 1 10000000 0.0001 1 100 100 0.562133796 562.133796ms
...
mx2 32 1 10000000 0.0001 1 100 100 0.613519317 613.519317ms
mx2 32 1 10000000 0.0001 1 100 100 0.602255619 602.255619ms
...
mx1 32 2 10000000 0.0001 1 100 100 1.489152483 1.489152483s
mx1 32 2 10000000 0.0001 1 100 100 1.469110205 1.469110205s
...

...
mx2 32 64 10000000 0.0001 1 100 100 8.84282034 8.84282034s
```

Ok, so my gnuplot script works, but now I have a lot of points 
for each x.

I would like to take max throughput (lowest time, in column 9), 
and only use that. Or maybe median. (Definitively not average 
tho).

And I didn't feel like doing this in `awk`

So started exploring rdmd a bit:

First or second attempt:

`rdmd --eval='float[][int] g; foreach (line; 
stdin.byLine.filter!(x=>x.matchFirst("^mx1"))) { auto a = 
line.split; auto c=a[2].to!int; auto rate=c * a[3].to!float / 
a[8].to!float; g[c] ~= rate; } foreach (c, values; g) { 
writeln(c, " ", values.reduce!max); }' < foo.txt`

Removed redundant `()` to make code shorter.

```
45 2.29471e+07
26 2.25617e+07
52 2.26505e+07
43 2.30352e+07
17 2.32184e+07
34 2.33697e+07
60 2.26649e+07
61 2.25918e+07
...
```

Ok. "Works"

That is not good for few reasons.

1) Still kind of long
2) Cannot easily embed into gnuplot script, because of usage of 
both `'`, and `"`
3) I do group by `c`, using map (associative array), but that 
means during print, it will be unordered. If I switch to plotting 
using line instead of default point, I want ascending order, 
otherwise plot will be a chaos of lines. This could be fixed by 
piping output to `sort -n -k 1`, but a) is less efficient, b) 
makes things even longer. Obvious way would be to remember 
previous `c`, and aggregate on a fly. Faster, ordered by design 
(because input is ordered), less memory usage.

Next attempt (not fully correct), trying to rectify few things 
incrementally, not shooting for the perfect solution yet, just 
exploring a bit more:

`rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; 
stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto 
c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if 
(prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} 
prev_c=c;max_rate=max(max_rate,rate);}' < foo.txt`


```
0 0
1 1.81999e+07
2 1.3897e+07
3 1.68113e+07
4 1.77501e+07
5 1.77466e+07
6 2.00162e+07
7 2.00754e+07
8 2.24083e+07
9 2.43998e+07
...
63 2.24421e+07
```

Some progress, but not quite there (obviously). We do not output 
line for 64, because check for `prev_c!=c` is only in a loop, but 
we should have a `writeln` again after a loop.

Lets fix this then.

`rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; 
stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto 
c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if 
(prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} 
prev_c=c;max_rate=max(max_rate,rate);} writeln(prev_c, `` ``, 
max_rate); max_rate=0;' < foo.txt`


A bit hairy but does the job.  (Still prints 0, but that is easy 
to fix with something like `if (prev_c != c && prev_c)`

Lets reimplement in awk, for an unfair comparison:

`awk 'BEGIN{prev_c = 0; max_rate=0.0;} /^mx1/{ c=$3; 
rate=c*$4/$9; if (prev_c != c) { print prev_c, max_rate; 
max_rate=0;} prev_c=c;if(rate>max_rate)max_rate=rate;} END{print 
prev_c,  max_rate;}' < foo.txt`


Quite a bit shorter.

There things that would be hard to do in D, but still possible.

`auto x = ...`, replace with `x:=...` (like in Go). This could be 
done with a simple preprocessor (even just a `sed -E -e 
's/([a-zA-Z0-9_]+) *:=/auto \1=/g'` before passing to `gdmd`.

`/regexp/{}  /regexp/{}`, and `foreach (a......)`, replace with a 
an abstraction for doing this for us.

Should be possible to implement, probably with API like this:

```d
each(   // implicitly on stdin.byLine()
     "^mx1",
     (a, m) => {    // a is just line split on whitespaces,
                    // m is regexp match groups (optional)
        c := a[2].to!int;
        ...
     },
     ...,  // more matchers.
     ...,  // All matching matchers are executed in order, not 
just the first one.
     ...,  // delegate with no preceding matcher, is equivalent to 
".*" matching.
     ...);

```

We can accept both `void` delegates, or ones returning `int`, 
i.e. if we want to do something like loop `break`. But in 
scripting, instead of `break` in main loop, you will usually just 
exit whole script. So not super useful. (`continue` works by just 
returning from void delegate, so not a concern).

More advance `each` could allow multiple predicates, multiple 
regexps, and possibly some conditions (`&&`, `||`). Can invent a 
mini DSL for this, or use operator overloading for this (maybe, 
as not all operators are overloadable in D, i.e. overloading 
comparison operators is very problematic in D, it was possible in 
D1, but not in D2).

We can also add original full line (unsplitted) as a first 
element of the `a`, so `a[0]` is just like awk `$0` (whole line), 
and `a[1]` is just like `$1` (columns, with first one being `$1`).

Note: We do not want to put this `each` implicitly into a runner 
script, because often we want to do things before it. This could 
be done with something like `--begin`, and `--end`, but is more 
verbose. Plus `--begin` and `--end`, would make it harder to port 
command line code to file based script.


On the other front of `to!int`, we can do better too. Either 
provide helper functions to common type conversions like to!int, 
to!float:

So instead of:

```d
        c := a[2].to!int;
```

we do

```d
        c := a[2].INT;
        rate:=c * a[3].F32 / a[8].F32;
```

Ok, how about `each` is smarter, and not only just does input 
line split into column of strings (`string[]`), but instead puts 
each value into a custom library type, that provides a dynamic 
typing. Something like `DynamicTypeValue[]`, but operator 
overloading for arithmetic, comparison and toString functions.

```d
        c := a[2];
        rate := c * a[3] / a[8];
```


Surely possible.

Lets also add a awk-like print (similar to Python `print`), which 
puts space between each argument for us, and for a good measure, 
lets use old PHP, `echo` construct, to save one extra character.

How this would look:

`./dm 'prev_c := 0;max_rate:=0.0; each("^mx1", (DT[] a){ c:=a[2]; 
rate:=c * a[3] / a[8]; if (prev_c != c) { echo(prev_c, max_rate); 
max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}); 
echo(prev_c, max_rate);' ./foo.txt`


That looks pretty nice. Not optimal, but not too bad. Only 14 
more characters than awk (203 bytes, vs 189).


Note: I do not quite have a full solution to `DynamicTypeValue`, 
(missing hashing support, so it can be used as a key in 
associative array), but prototype is kind of working.

Unfortunately it is not quite working, even with some tries:


```
$ ./dm ....
/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): 
Error: cannot implicitly convert expression `c` of type `DT` to 
`int`
/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): 
Error: cannot implicitly convert expression `rate` of type `DT` 
to `double`
Failed: ["/usr/bin/dmd", "-d", "-v", "-o-", 
"/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d", 
"-I/tmp/.rdmd-1000"]
...
```

This boils down to:

```
int prev_c = 0;
prev_c = DT("1");
```

not compiling. I defined `opCast`, but this is only for explicit 
casts.

If I would be able to allow semi-implicit casts for my type, that 
would work perfectly.



There was also a small issue with `max`, 
`std.algorithm.comparison.max` complains a bit about comparing 
`DT` and `double`:

```
/tmp/.rdmd-1000/eval.EA89F8F1475E6A614DCFA85E8098FEFF.d(122): 
Error: none of the overloads of template 
`std.algorithm.comparison.max` are callable using argument types 
`!()(double, DT)`
/usr/include/dmd/phobos/std/algorithm/comparison.d(1644):        
Candidates are: `max(T...)(T args)`
   with `T = (double, DT)`
   whose parameters have the following constraints:
   `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
`    T.length >= 2
   > !is(CommonType!T == void)
`  `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
/usr/include/dmd/phobos/std/algorithm/comparison.d(1681):         
                `max(T, U)(T a, U b)`
   with `T = double,
        U = DT`
   whose parameters have the following constraints:
   `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
`  > is(T == U)
   - is(typeof(a < b))
`  `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
   Tip: not satisfied constraints are marked with `>`
```

Fair enough, I could provide my own `max` and `min`, and possible 
few more functions (i.e. functions like std.math.sqrt, abs, etc), 
to operate easily on DT. Hard to do it fully transparently for 
everything, but should be possible to cover at least everything 
that `awk` has too.

Doing `$1` -> `a[0]`, translation is trivial using some regular 
expressions. It could save 2 characters, but that is not a lot.


In summary:

So, in pure form, D language and rdmd, are usable, but rather 
verbose (mostly due to `auto`, long function names like writeln, 
and extra arguments they require for putting spaces between 
argument). But still usable. The script I wrote would probably be 
a close to the limit of what would be acceptable, which is not 
great, because the example script does very little.

With some hacks, preprocessing, and extra library type and 
functions, it is possible to make usage way easier, code way 
shorter, and very comparable to awk. (I didn't test other 
functions like open, and operating on files), but it should not 
be too dissimilar).

Some operator overloading facilities of D programming language 
are lacking to fully make it usable tho.

Inability to opt-in to allow implicit opCast casting are making 
it not possible to develop fully dynamic and easy to use solution.



What do you think?



For reference, `dm` script

```python


import os
import re
import subprocess
import sys


code = sys.argv[1]
filenames = sys.argv[2:]

header = """
struct DT {
   string x_;
   this(string x) { x_ = x; }
   this(float x) { x_ = to!string(x); }
   this(int x) { x_ = to!string(x); }
   // string toString() const { return to!string(x_); }
   string toString() const { return x_; }
   bool can(T)() const {
     try { to!T(x_); } catch { return false; } return true;
   }
   bool numeric() const { return can!double(); }
   double number() const { return to!double(x_); }
   auto opBinary(string op)(const ref DT other) const {
     if (numeric() && other.numeric()) {
       const n = number();
       const m = other.number();
       return DT(to!string(mixin("n " ~ op ~ " m")));
     }
     throw new Exception("cannot perform " ~ op ~ " on string");
   }
   auto opBinary(string op, Other)(const ref Other other) const {
     if (numeric()) {
       // static assert(is(other : float, double, int, uint));
       // TODO(baryluk): We could maybe support adding string too. 
Not super useful tho.
       // I want dynamic typing, but still to be strong typing. 
Not weak like PHP or JavaScript.
       return DT(to!string(mixin("number() " ~ op ~ " other")));
     }
     // We could possibly allow number + string, and string + 
string, and string * int
     throw new Exception("cannot perform " ~ op ~ " on string");
   }

   // opUnary, -, ~
   // negation, ! - i.e. !c,   where c is string repreenting 
integer, then we for !c we if c == "0", it will be true.
   // todo support some bool?

   int opCmp(const ref const(DT) other) const {
     if (numeric() && other.numeric()) {
       const n = number();
       const m = other.number();
       return (n > m) - (n < m);
     }
     if (!numeric() && !other.numeric()) {
       return x_ < other.x_;
     }
     throw new Exception("cannot compare string with other");
   }
   int opCmp(Other)(const ref Other other) const {
     // static if (is(Other: int, float, ...));
     if (numeric()) {
       const n = number();
       return (n > other) - (n < other);  // Quick hack
     }
     static if (is(Other == string)) {
       return x_ < other;
     } else {
       throw new Exception("cannot compare string with other");
     }
   }
   bool opEquals(const ref DT other) const {
     return this.opCmp(other) == 0;
   }
   bool opEquals(Other)(const ref Other other) const {
     return this.opCmp(other) == 0;
   }

   // This also handled !value
   bool opCast(T)() const if (is(T == bool)) {
     if (numeric()) {
       return !number();
     }
     return !x_;
   }
   auto opCast(T)() const if (is(T == string)) {
     return x_;
   }
   auto opCast(T)() const {  // if T is numeric, i.e. int, double
     pragma(msg, "casting to", T);
     return x_.number();
   }

   auto opAssign(const ref DT other) {
     x_ = other.x_;
     return this;
   }
   auto opAssign(Other)(const ref Other other) {
     x_ = to!string(other);
     return this;
   }
}
void echo(T...)(T args) {
   foreach (arg; args[0..$-1]) {
       write(arg);
       write(' ');
   }
   writeln(args[$-1]);
}
void each(D)(string re, D dg) {  // just an initial prototype
   foreach (line; stdin.byLine) {
     if (line.matchFirst(re)) {
       dg(line.map!split().map!(x=>new DT(x))());
     }
   }
}
"""

code = re.sub(r"([a-zA-Z_][a-zA-Z0-9_]*) *:=", r" auto \1=", code)



with subprocess.Popen(["rdmd", f"--eval={header}{code}"], 
stdin=subprocess.PIPE, text=True) as p:
     for filename in filenames:
         with open(filename) as f:
             for line in f:
                 p.stdin.write(line)

print(p)
```
Dec 25 2023
next sibling parent reply Witold Baryluk <witold.baryluk gmail.com> writes:
On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:
...
 Inability to opt-in to allow implicit opCast casting are making 
 it not possible to develop fully dynamic and easy to use 
 solution.
Was thinking a bit after posting, and maybe there is some hope: So, one of the possibly limited hacks would be to force `double` return type on `opBinary` when used with arithmetic operators Instead of ```d auto opBinary(string op)(const ref DT other) const { if (numeric() && other.numeric()) { const n = number(); const m = other.number(); return DT(to!string(mixin("n " ~ op ~ " m"))); } throw new Exception("cannot perform " ~ op ~ " on string"); } ``` we do ```d double opBinary(string op)(const ref DT other) const if (op == "+" || op == "-" || op == "*" | op == "/" || op == "^^" || op == "|" || op == "&" || op == "^"){ if (numeric() && other.numeric()) { const n = number(); const m = other.number(); return mixin("n " ~ op ~ " m"); } throw new Exception("cannot perform " ~ op ~ " on string"); } string opBinary(string op)(const ref DT other) const if (op == "~") { if (!numeric() && !other.numeric()) { return mixin("x_ " ~ op ~ " other.x_"); } throw new Exception("cannot perform " ~ op ~ " on non-string"); } ... // more overloads // ... ``` This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).
Dec 25 2023
parent reply Sergey <kornburn yandex.ru> writes:
On Monday, 25 December 2023 at 12:16:46 UTC, Witold Baryluk wrote:
 On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk 
 wrote:
 This is quite limit tho in general. What if I want to also 
 support things more types than just double, and do it 
 efficiently (cdouble, BigInt, other custom types).
Maybe if you are able to change spaces with tabs (create tsv file) this tool will help you https://github.com/eBay/tsv-utils
Dec 25 2023
parent reply Witold Baryluk <witold.baryluk gmail.com> writes:
On Monday, 25 December 2023 at 12:29:19 UTC, Sergey wrote:
 On Monday, 25 December 2023 at 12:16:46 UTC, Witold Baryluk 
 wrote:
 On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk 
 wrote:
 This is quite limit tho in general. What if I want to also 
 support things more types than just double, and do it 
 efficiently (cdouble, BigInt, other custom types).
Maybe if you are able to change spaces with tabs (create tsv file) this tool will help you https://github.com/eBay/tsv-utils
I am able to change space to tabs. But I do not want. I strongly prefer spaces. As I mentioned before, I have a custom tool called `kolumny`, that does what tsv-utils do, and way more. I want something more generic, also the reason for my post is not to solve my problem in particular (In case you missed the point of the post), but about language design, and make D useful in wider area of applications.
Dec 25 2023
parent reply Sergey <kornburn yandex.ru> writes:
On Monday, 25 December 2023 at 12:45:50 UTC, Witold Baryluk wrote:
 post), but about language design, and make D useful in wider 
 area of applications.
I didn’t take the post yeah. Why not use templates type to support not only double but any T?
Dec 25 2023
parent reply Witold Baryluk <witold.baryluk gmail.com> writes:
On Monday, 25 December 2023 at 15:50:52 UTC, Sergey wrote:
 On Monday, 25 December 2023 at 12:45:50 UTC, Witold Baryluk 
 wrote:
 post), but about language design, and make D useful in wider 
 area of applications.
I didn’t take the post yeah. Why not use templates type to support not only double but any T?
It just doesn't work. ``` struct DT { // ... } auto c = 0; // or 0.0, doesn't matter c = DT("1"); ``` c will infer to be `int`, or `double`. There is no way to convince D compiler to make it call some operator to do a conversion. There is no `opAssignRight`. I do not see how templates help. As a hack, I can do: ``` c := 0; c = DT("1") ``` And instead of converting `varname := ` to `auto varname = `, do `DT varname = `. Then I could probably do something about it. But I think sometimes you want to force a type, or have an empty and default initialized variable like "string c;". Otherwise it looks not like D, and very hacky in general.
Dec 25 2023
parent Witold Baryluk <witold.baryluk gmail.com> writes:
Actually I do not think `DT varname = ...` will work. It will 
only work for a limited number of types, with value semantic. It 
will not work for reference types, and other non-trivial types 
(i.e. from libraries, phobos, arrays, etc). But for command line 
scripting I would still want to be able to use `auto` (via `:=` 
to `auto` translation) for them instead.
Dec 25 2023
prev sibling parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:

 For a very long time I have been using bash, grep, sed, awk, 
 usual suspects on Unix, as they are super quick to type, 
 incremental, etc. Once complexity is to big I usually switch to 
 Python (decades ago it might have been Perl or PHP).
I'm actually using Ruby instead of sed, awk and friends for this kind of tasks. Python is whitespace sensitive and that's the reason why I don't like it in general. But Ruby is essentially a modernized Perl with very expressive syntax. Your example with spaces removed and single character variables: `awk 'BEGIN{p=0;m=0.0;}/^mx1/{c=$3;r=c*$4/$9;if(p!=c){print p,m;m=0;}p=c;if(r>m)m=r;}END{print p,m;}' < foo.txt` `ruby -e'g={0=>0};while l=gets;l.scan(/^mx1/){a=l.split;c=a[2].to_i;r=c*a[3].to_f/a[8].to_f;g[c]=[g[c]||0,r].ma }end;g.each{puts"%d %g"%_1}' < foo.txt` The following variant works with both Ryby and Crystal: `crystal eval 'g={0=>0.0};while l=gets;l.scan(/^mx1/){a=l.split;c=a[2].to_i;rate=c*a[3].to_f/a[8].to_f;g[c]||=0.0;g[c]=[g[c],rate].ma }end;g.each{|v|puts "%d %g"%v}' < foo.txt` D is not the best language for very terse singleliner codegolfing. And it doesn't look like many people are interested in adding special syntax sugar tailored for this.
Jan 27