digitalmars.D - dmd command line scripting experiments and observations
- Witold Baryluk (386/386) Dec 25 2023 For a very long time I have been using bash, grep, sed, awk,
- Witold Baryluk (43/46) Dec 25 2023 Was thinking a bit after posting, and maybe there is some hope:
- Sergey (3/8) Dec 25 2023 Maybe if you are able to change spaces with tabs (create tsv
- Witold Baryluk (9/18) Dec 25 2023 I am able to change space to tabs. But I do not want. I strongly
- Sergey (3/5) Dec 25 2023 I didn’t take the post yeah.
- Witold Baryluk (23/29) Dec 25 2023 It just doesn't work.
- Witold Baryluk (6/6) Dec 25 2023 Actually I do not think `DT varname = ...` will work. It will
- Siarhei Siamashka (16/20) Jan 27 2024 I'm actually using Ruby instead of sed, awk and friends for this
For a very long time I have been using bash, grep, sed, awk, usual suspects on Unix, as they are super quick to type, incremental, etc. Once complexity is to big I usually switch to Python (decades ago it might have been Perl or PHP). I often will embed small snippets of grep or awk in some other tools that just need to do something with some text files. For example do some pre-processing for plotting in Gnuplot. I even wrote my custom line-column processing "language", called `kolumny` over a decade ago). To help with similar tasks. And while it does work well, I rarely use it (once a year these days sadly), because it is not really a full language. Yesterday I had a need to some simple processing before before doing plotting in gnuplot: ```gnuplot set ylabel "locking rate [M/s]" plot "<grep ^mx1 foo.txt" using 3:($3*$4/$9/1e6) title "RWMutex", \ "<grep ^mx2 foo.txt" using 3:($3*$4/$9/1e6) title "drwMutex" ``` where a file `foo.txt` has things like this: ``` mx1 32 1 10000000 0.0001 1 100 100 0.552091302 552.091302ms mx1 32 1 10000000 0.0001 1 100 100 0.552518653 552.518653ms mx1 32 1 10000000 0.0001 1 100 100 0.562133796 562.133796ms ... mx2 32 1 10000000 0.0001 1 100 100 0.613519317 613.519317ms mx2 32 1 10000000 0.0001 1 100 100 0.602255619 602.255619ms ... mx1 32 2 10000000 0.0001 1 100 100 1.489152483 1.489152483s mx1 32 2 10000000 0.0001 1 100 100 1.469110205 1.469110205s ... ... mx2 32 64 10000000 0.0001 1 100 100 8.84282034 8.84282034s ``` Ok, so my gnuplot script works, but now I have a lot of points for each x. I would like to take max throughput (lowest time, in column 9), and only use that. Or maybe median. (Definitively not average tho). And I didn't feel like doing this in `awk` So started exploring rdmd a bit: First or second attempt: `rdmd --eval='float[][int] g; foreach (line; stdin.byLine.filter!(x=>x.matchFirst("^mx1"))) { auto a = line.split; auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; g[c] ~= rate; } foreach (c, values; g) { writeln(c, " ", values.reduce!max); }' < foo.txt` Removed redundant `()` to make code shorter. ``` 45 2.29471e+07 26 2.25617e+07 52 2.26505e+07 43 2.30352e+07 17 2.32184e+07 34 2.33697e+07 60 2.26649e+07 61 2.25918e+07 ... ``` Ok. "Works" That is not good for few reasons. 1) Still kind of long 2) Cannot easily embed into gnuplot script, because of usage of both `'`, and `"` 3) I do group by `c`, using map (associative array), but that means during print, it will be unordered. If I switch to plotting using line instead of default point, I want ascending order, otherwise plot will be a chaos of lines. This could be fixed by piping output to `sort -n -k 1`, but a) is less efficient, b) makes things even longer. Obvious way would be to remember previous `c`, and aggregate on a fly. Faster, ordered by design (because input is ordered), less memory usage. Next attempt (not fully correct), trying to rectify few things incrementally, not shooting for the perfect solution yet, just exploring a bit more: `rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if (prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}' < foo.txt` ``` 0 0 1 1.81999e+07 2 1.3897e+07 3 1.68113e+07 4 1.77501e+07 5 1.77466e+07 6 2.00162e+07 7 2.00754e+07 8 2.24083e+07 9 2.43998e+07 ... 63 2.24421e+07 ``` Some progress, but not quite there (obviously). We do not output line for 64, because check for `prev_c!=c` is only in a loop, but we should have a `writeln` again after a loop. Lets fix this then. `rdmd --eval='auto prev_c = 0; auto max_rate=0.0; foreach (a; stdin.byLine.filter!(x=>x.matchFirst(``^mx1``)).map!split) { auto c=a[2].to!int; auto rate=c * a[3].to!float / a[8].to!float; if (prev_c != c) { writeln(prev_c, `` ``, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);} writeln(prev_c, `` ``, max_rate); max_rate=0;' < foo.txt` A bit hairy but does the job. (Still prints 0, but that is easy to fix with something like `if (prev_c != c && prev_c)` Lets reimplement in awk, for an unfair comparison: `awk 'BEGIN{prev_c = 0; max_rate=0.0;} /^mx1/{ c=$3; rate=c*$4/$9; if (prev_c != c) { print prev_c, max_rate; max_rate=0;} prev_c=c;if(rate>max_rate)max_rate=rate;} END{print prev_c, max_rate;}' < foo.txt` Quite a bit shorter. There things that would be hard to do in D, but still possible. `auto x = ...`, replace with `x:=...` (like in Go). This could be done with a simple preprocessor (even just a `sed -E -e 's/([a-zA-Z0-9_]+) *:=/auto \1=/g'` before passing to `gdmd`. `/regexp/{} /regexp/{}`, and `foreach (a......)`, replace with a an abstraction for doing this for us. Should be possible to implement, probably with API like this: ```d each( // implicitly on stdin.byLine() "^mx1", (a, m) => { // a is just line split on whitespaces, // m is regexp match groups (optional) c := a[2].to!int; ... }, ..., // more matchers. ..., // All matching matchers are executed in order, not just the first one. ..., // delegate with no preceding matcher, is equivalent to ".*" matching. ...); ``` We can accept both `void` delegates, or ones returning `int`, i.e. if we want to do something like loop `break`. But in scripting, instead of `break` in main loop, you will usually just exit whole script. So not super useful. (`continue` works by just returning from void delegate, so not a concern). More advance `each` could allow multiple predicates, multiple regexps, and possibly some conditions (`&&`, `||`). Can invent a mini DSL for this, or use operator overloading for this (maybe, as not all operators are overloadable in D, i.e. overloading comparison operators is very problematic in D, it was possible in D1, but not in D2). We can also add original full line (unsplitted) as a first element of the `a`, so `a[0]` is just like awk `$0` (whole line), and `a[1]` is just like `$1` (columns, with first one being `$1`). Note: We do not want to put this `each` implicitly into a runner script, because often we want to do things before it. This could be done with something like `--begin`, and `--end`, but is more verbose. Plus `--begin` and `--end`, would make it harder to port command line code to file based script. On the other front of `to!int`, we can do better too. Either provide helper functions to common type conversions like to!int, to!float: So instead of: ```d c := a[2].to!int; ``` we do ```d c := a[2].INT; rate:=c * a[3].F32 / a[8].F32; ``` Ok, how about `each` is smarter, and not only just does input line split into column of strings (`string[]`), but instead puts each value into a custom library type, that provides a dynamic typing. Something like `DynamicTypeValue[]`, but operator overloading for arithmetic, comparison and toString functions. ```d c := a[2]; rate := c * a[3] / a[8]; ``` Surely possible. Lets also add a awk-like print (similar to Python `print`), which puts space between each argument for us, and for a good measure, lets use old PHP, `echo` construct, to save one extra character. How this would look: `./dm 'prev_c := 0;max_rate:=0.0; each("^mx1", (DT[] a){ c:=a[2]; rate:=c * a[3] / a[8]; if (prev_c != c) { echo(prev_c, max_rate); max_rate=0;} prev_c=c;max_rate=max(max_rate,rate);}); echo(prev_c, max_rate);' ./foo.txt` That looks pretty nice. Not optimal, but not too bad. Only 14 more characters than awk (203 bytes, vs 189). Note: I do not quite have a full solution to `DynamicTypeValue`, (missing hashing support, so it can be used as a key in associative array), but prototype is kind of working. Unfortunately it is not quite working, even with some tries: ``` $ ./dm .... /tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): Error: cannot implicitly convert expression `c` of type `DT` to `int` /tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d(122): Error: cannot implicitly convert expression `rate` of type `DT` to `double` Failed: ["/usr/bin/dmd", "-d", "-v", "-o-", "/tmp/.rdmd-1000/eval.75B7D99A106E2F3F190C7D5398C1A329.d", "-I/tmp/.rdmd-1000"] ... ``` This boils down to: ``` int prev_c = 0; prev_c = DT("1"); ``` not compiling. I defined `opCast`, but this is only for explicit casts. If I would be able to allow semi-implicit casts for my type, that would work perfectly. There was also a small issue with `max`, `std.algorithm.comparison.max` complains a bit about comparing `DT` and `double`: ``` /tmp/.rdmd-1000/eval.EA89F8F1475E6A614DCFA85E8098FEFF.d(122): Error: none of the overloads of template `std.algorithm.comparison.max` are callable using argument types `!()(double, DT)` /usr/include/dmd/phobos/std/algorithm/comparison.d(1644): Candidates are: `max(T...)(T args)` with `T = (double, DT)` whose parameters have the following constraints: `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` ` T.length >= 2 > !is(CommonType!T == void) ` `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` /usr/include/dmd/phobos/std/algorithm/comparison.d(1681): `max(T, U)(T a, U b)` with `T = double, U = DT` whose parameters have the following constraints: `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` ` > is(T == U) - is(typeof(a < b)) ` `~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` Tip: not satisfied constraints are marked with `>` ``` Fair enough, I could provide my own `max` and `min`, and possible few more functions (i.e. functions like std.math.sqrt, abs, etc), to operate easily on DT. Hard to do it fully transparently for everything, but should be possible to cover at least everything that `awk` has too. Doing `$1` -> `a[0]`, translation is trivial using some regular expressions. It could save 2 characters, but that is not a lot. In summary: So, in pure form, D language and rdmd, are usable, but rather verbose (mostly due to `auto`, long function names like writeln, and extra arguments they require for putting spaces between argument). But still usable. The script I wrote would probably be a close to the limit of what would be acceptable, which is not great, because the example script does very little. With some hacks, preprocessing, and extra library type and functions, it is possible to make usage way easier, code way shorter, and very comparable to awk. (I didn't test other functions like open, and operating on files), but it should not be too dissimilar). Some operator overloading facilities of D programming language are lacking to fully make it usable tho. Inability to opt-in to allow implicit opCast casting are making it not possible to develop fully dynamic and easy to use solution. What do you think? For reference, `dm` script ```python import os import re import subprocess import sys code = sys.argv[1] filenames = sys.argv[2:] header = """ struct DT { string x_; this(string x) { x_ = x; } this(float x) { x_ = to!string(x); } this(int x) { x_ = to!string(x); } // string toString() const { return to!string(x_); } string toString() const { return x_; } bool can(T)() const { try { to!T(x_); } catch { return false; } return true; } bool numeric() const { return can!double(); } double number() const { return to!double(x_); } auto opBinary(string op)(const ref DT other) const { if (numeric() && other.numeric()) { const n = number(); const m = other.number(); return DT(to!string(mixin("n " ~ op ~ " m"))); } throw new Exception("cannot perform " ~ op ~ " on string"); } auto opBinary(string op, Other)(const ref Other other) const { if (numeric()) { // static assert(is(other : float, double, int, uint)); // TODO(baryluk): We could maybe support adding string too. Not super useful tho. // I want dynamic typing, but still to be strong typing. Not weak like PHP or JavaScript. return DT(to!string(mixin("number() " ~ op ~ " other"))); } // We could possibly allow number + string, and string + string, and string * int throw new Exception("cannot perform " ~ op ~ " on string"); } // opUnary, -, ~ // negation, ! - i.e. !c, where c is string repreenting integer, then we for !c we if c == "0", it will be true. // todo support some bool? int opCmp(const ref const(DT) other) const { if (numeric() && other.numeric()) { const n = number(); const m = other.number(); return (n > m) - (n < m); } if (!numeric() && !other.numeric()) { return x_ < other.x_; } throw new Exception("cannot compare string with other"); } int opCmp(Other)(const ref Other other) const { // static if (is(Other: int, float, ...)); if (numeric()) { const n = number(); return (n > other) - (n < other); // Quick hack } static if (is(Other == string)) { return x_ < other; } else { throw new Exception("cannot compare string with other"); } } bool opEquals(const ref DT other) const { return this.opCmp(other) == 0; } bool opEquals(Other)(const ref Other other) const { return this.opCmp(other) == 0; } // This also handled !value bool opCast(T)() const if (is(T == bool)) { if (numeric()) { return !number(); } return !x_; } auto opCast(T)() const if (is(T == string)) { return x_; } auto opCast(T)() const { // if T is numeric, i.e. int, double pragma(msg, "casting to", T); return x_.number(); } auto opAssign(const ref DT other) { x_ = other.x_; return this; } auto opAssign(Other)(const ref Other other) { x_ = to!string(other); return this; } } void echo(T...)(T args) { foreach (arg; args[0..$-1]) { write(arg); write(' '); } writeln(args[$-1]); } void each(D)(string re, D dg) { // just an initial prototype foreach (line; stdin.byLine) { if (line.matchFirst(re)) { dg(line.map!split().map!(x=>new DT(x))()); } } } """ code = re.sub(r"([a-zA-Z_][a-zA-Z0-9_]*) *:=", r" auto \1=", code) with subprocess.Popen(["rdmd", f"--eval={header}{code}"], stdin=subprocess.PIPE, text=True) as p: for filename in filenames: with open(filename) as f: for line in f: p.stdin.write(line) print(p) ```
Dec 25 2023
On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote: ...Inability to opt-in to allow implicit opCast casting are making it not possible to develop fully dynamic and easy to use solution.Was thinking a bit after posting, and maybe there is some hope: So, one of the possibly limited hacks would be to force `double` return type on `opBinary` when used with arithmetic operators Instead of ```d auto opBinary(string op)(const ref DT other) const { if (numeric() && other.numeric()) { const n = number(); const m = other.number(); return DT(to!string(mixin("n " ~ op ~ " m"))); } throw new Exception("cannot perform " ~ op ~ " on string"); } ``` we do ```d double opBinary(string op)(const ref DT other) const if (op == "+" || op == "-" || op == "*" | op == "/" || op == "^^" || op == "|" || op == "&" || op == "^"){ if (numeric() && other.numeric()) { const n = number(); const m = other.number(); return mixin("n " ~ op ~ " m"); } throw new Exception("cannot perform " ~ op ~ " on string"); } string opBinary(string op)(const ref DT other) const if (op == "~") { if (!numeric() && !other.numeric()) { return mixin("x_ " ~ op ~ " other.x_"); } throw new Exception("cannot perform " ~ op ~ " on non-string"); } ... // more overloads // ... ``` This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).
Dec 25 2023
On Monday, 25 December 2023 at 12:16:46 UTC, Witold Baryluk wrote:On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote: This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).Maybe if you are able to change spaces with tabs (create tsv file) this tool will help you https://github.com/eBay/tsv-utils
Dec 25 2023
On Monday, 25 December 2023 at 12:29:19 UTC, Sergey wrote:On Monday, 25 December 2023 at 12:16:46 UTC, Witold Baryluk wrote:I am able to change space to tabs. But I do not want. I strongly prefer spaces. As I mentioned before, I have a custom tool called `kolumny`, that does what tsv-utils do, and way more. I want something more generic, also the reason for my post is not to solve my problem in particular (In case you missed the point of the post), but about language design, and make D useful in wider area of applications.On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote: This is quite limit tho in general. What if I want to also support things more types than just double, and do it efficiently (cdouble, BigInt, other custom types).Maybe if you are able to change spaces with tabs (create tsv file) this tool will help you https://github.com/eBay/tsv-utils
Dec 25 2023
On Monday, 25 December 2023 at 12:45:50 UTC, Witold Baryluk wrote:post), but about language design, and make D useful in wider area of applications.I didn’t take the post yeah. Why not use templates type to support not only double but any T?
Dec 25 2023
On Monday, 25 December 2023 at 15:50:52 UTC, Sergey wrote:On Monday, 25 December 2023 at 12:45:50 UTC, Witold Baryluk wrote:It just doesn't work. ``` struct DT { // ... } auto c = 0; // or 0.0, doesn't matter c = DT("1"); ``` c will infer to be `int`, or `double`. There is no way to convince D compiler to make it call some operator to do a conversion. There is no `opAssignRight`. I do not see how templates help. As a hack, I can do: ``` c := 0; c = DT("1") ``` And instead of converting `varname := ` to `auto varname = `, do `DT varname = `. Then I could probably do something about it. But I think sometimes you want to force a type, or have an empty and default initialized variable like "string c;". Otherwise it looks not like D, and very hacky in general.post), but about language design, and make D useful in wider area of applications.I didn’t take the post yeah. Why not use templates type to support not only double but any T?
Dec 25 2023
Actually I do not think `DT varname = ...` will work. It will only work for a limited number of types, with value semantic. It will not work for reference types, and other non-trivial types (i.e. from libraries, phobos, arrays, etc). But for command line scripting I would still want to be able to use `auto` (via `:=` to `auto` translation) for them instead.
Dec 25 2023
On Monday, 25 December 2023 at 11:59:35 UTC, Witold Baryluk wrote:For a very long time I have been using bash, grep, sed, awk, usual suspects on Unix, as they are super quick to type, incremental, etc. Once complexity is to big I usually switch to Python (decades ago it might have been Perl or PHP).I'm actually using Ruby instead of sed, awk and friends for this kind of tasks. Python is whitespace sensitive and that's the reason why I don't like it in general. But Ruby is essentially a modernized Perl with very expressive syntax. Your example with spaces removed and single character variables: `awk 'BEGIN{p=0;m=0.0;}/^mx1/{c=$3;r=c*$4/$9;if(p!=c){print p,m;m=0;}p=c;if(r>m)m=r;}END{print p,m;}' < foo.txt` `ruby -e'g={0=>0};while l=gets;l.scan(/^mx1/){a=l.split;c=a[2].to_i;r=c*a[3].to_f/a[8].to_f;g[c]=[g[c]||0,r].ma }end;g.each{puts"%d %g"%_1}' < foo.txt` The following variant works with both Ryby and Crystal: `crystal eval 'g={0=>0.0};while l=gets;l.scan(/^mx1/){a=l.split;c=a[2].to_i;rate=c*a[3].to_f/a[8].to_f;g[c]||=0.0;g[c]=[g[c],rate].ma }end;g.each{|v|puts "%d %g"%v}' < foo.txt` D is not the best language for very terse singleliner codegolfing. And it doesn't look like many people are interested in adding special syntax sugar tailored for this.
Jan 27 2024