www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Looking for a Code Review of a Bioinformatics POC

reply duck_tape <sstadick gmail.com> writes:
Hi! I'm new to dlang but loving it so far! One of my favorite 
first things to implement in a new language is an interval 
library. In this case I want to submit to a benchmark repo: 
https://github.com/lh3/biofast

If anyone is willing to take a look and give some feedback I'd be 
very appreciative! Specifically if you have an performance 
improvement ideas: https://github.com/sstadick/dgranges/pull/1

Currently my D version is a few seconds slower than the Crystal 
version. putting it very solid in third place overall. I'm not 
really sure where it's falling behind crystal since `-release` 
removes bounds checking. I have not looked at the assembly 
between the two, but I suspect that Crystal inlines the callback 
and D does not.

I also think there is room for improvement in the IO, as I'm just 
using the defaults.
Jun 11 2020
next sibling parent reply CraigDillabaugh <craig.dillabaugh gmail.com> writes:
On Thursday, 11 June 2020 at 16:13:34 UTC, duck_tape wrote:
 Hi! I'm new to dlang but loving it so far! One of my favorite 
 first things to implement in a new language is an interval 
 library. In this case I want to submit to a benchmark repo: 
 https://github.com/lh3/biofast

 I also think there is room for improvement in the IO, as I'm 
 just using the defaults.
Are you building with DMD or with LDC/GDC?
Jun 11 2020
parent duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 17:25:13 UTC, CraigDillabaugh wrote:
 Are you building with DMD or with LDC/GDC?
I'm building with LDC. I haven't pulled up a linux box to test drive gdc yet. `ldc2 -O -release`
Jun 11 2020
prev sibling next sibling parent tastyminerals <tastyminerals gmail.com> writes:
On Thursday, 11 June 2020 at 16:13:34 UTC, duck_tape wrote:
 Hi! I'm new to dlang but loving it so far! One of my favorite 
 first things to implement in a new language is an interval 
 library. In this case I want to submit to a benchmark repo: 
 https://github.com/lh3/biofast

 If anyone is willing to take a look and give some feedback I'd 
 be very appreciative! Specifically if you have an performance 
 improvement ideas: https://github.com/sstadick/dgranges/pull/1

 Currently my D version is a few seconds slower than the Crystal 
 version. putting it very solid in third place overall. I'm not 
 really sure where it's falling behind crystal since `-release` 
 removes bounds checking. I have not looked at the assembly 
 between the two, but I suspect that Crystal inlines the 
 callback and D does not.

 I also think there is room for improvement in the IO, as I'm 
 just using the defaults.
Move as much as possible code to compile time. Do not allocate inside the loops. Keep GC collection away from performance critical parts with GC.disable switch; Also dflags-ldc "-mcpu=native" in dub.json might give you some edge.
Jun 11 2020
prev sibling next sibling parent reply tastyminerals <tastyminerals gmail.com> writes:
On Thursday, 11 June 2020 at 16:13:34 UTC, duck_tape wrote:
 Hi! I'm new to dlang but loving it so far! One of my favorite 
 first things to implement in a new language is an interval 
 library. In this case I want to submit to a benchmark repo: 
 https://github.com/lh3/biofast

 If anyone is willing to take a look and give some feedback I'd 
 be very appreciative! Specifically if you have an performance 
 improvement ideas: https://github.com/sstadick/dgranges/pull/1

 Currently my D version is a few seconds slower than the Crystal 
 version. putting it very solid in third place overall. I'm not 
 really sure where it's falling behind crystal since `-release` 
 removes bounds checking. I have not looked at the assembly 
 between the two, but I suspect that Crystal inlines the 
 callback and D does not.

 I also think there is room for improvement in the IO, as I'm 
 just using the defaults.
Add to your dub.json the following: """ "buildTypes": { "release": { "buildOptions": [ "releaseMode", "inline", "optimize" ], "dflags": [ "-boundscheck=off" ] }, } """ dub build --compiler=ldc2 --build=release Mir Slices instead of standard D arrays are faster. Athough looking at your code I don't see where you can plug them in. Just keep in mind.
Jun 11 2020
next sibling parent reply duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 20:24:37 UTC, tastyminerals wrote:
 Mir Slices instead of standard D arrays are faster. Athough 
 looking at your code I don't see where you can plug them in. 
 Just keep in mind.
Thanks for taking a look! What is it about Mir Slices that makes them faster? I hadn't seen the Mir package before but it looks very useful and intriguing.
Jun 11 2020
parent reply tastyminerals <tastyminerals gmail.com> writes:
On Thursday, 11 June 2020 at 21:54:31 UTC, duck_tape wrote:
 On Thursday, 11 June 2020 at 20:24:37 UTC, tastyminerals wrote:
 Mir Slices instead of standard D arrays are faster. Athough 
 looking at your code I don't see where you can plug them in. 
 Just keep in mind.
Thanks for taking a look! What is it about Mir Slices that makes them faster? I hadn't seen the Mir package before but it looks very useful and intriguing.
Mir is fine-tuned for LLVM, pointer magic and SIMD optimizations.
Jun 11 2020
parent duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 22:53:52 UTC, tastyminerals wrote:
 Mir is fine-tuned for LLVM, pointer magic and SIMD 
 optimizations.
I'll have to give that a shot for the biofast version of this. There are other ways of doing this same thing that could very well benefit from Mir.
Jun 11 2020
prev sibling parent duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 20:24:37 UTC, tastyminerals wrote:
 Mir Slices instead of standard D arrays are faster. Athough 
 looking at your code I don't see where you can plug them in. 
 Just keep in mind.
I just started following links, sweet blog! Your reason for getting into D is exactly the same as mine. Awesome blog!
Jun 11 2020
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jun 11, 2020 at 04:13:34PM +0000, duck_tape via Digitalmars-d-learn
wrote:
[...]
 Currently my D version is a few seconds slower than the Crystal
 version.  putting it very solid in third place overall. I'm not really
 sure where it's falling behind crystal since `-release` removes bounds
 checking. I have not looked at the assembly between the two, but I
 suspect that Crystal inlines the callback and D does not.
To encourage inlining, you could make it an alias parameter instead of a delegate, something like this: void overlap(alias cb)(SType start, SType stop) { ... } ... bed[chr].overlap!callback(st0, en0); This doesn't guarantee inlining, though. And no guarantee it will actually improve performance.
 I also think there is room for improvement in the IO, as I'm just
 using the defaults.
I wouldn't spend too much time optimizing I/O without profiling it first, to check that it's actually a bottleneck. If I/O turns out to be a real bottleneck, you could try using std.mmfile.MmFile to mmap the input directly into the program's address space, which should give you a speed boost. T -- An imaginary friend squared is a real enemy.
Jun 11 2020
parent reply duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 22:19:27 UTC, H. S. Teoh wrote:
 To encourage inlining, you could make it an alias parameter 
 instead of a delegate, something like this:

 	void overlap(alias cb)(SType start, SType stop) { ... }
 	...
 	bed[chr].overlap!callback(st0, en0);
I don't think ldc can handl that yet. I get an error saying ``` source/app.d(72,7): Error: function app.main.overlap!(callback).overlap requires a dual-context, which is not yet supported by LDC ``` And I see an open ticket for it on the ldc project.
Jun 11 2020
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jun 11, 2020 at 10:41:12PM +0000, duck_tape via Digitalmars-d-learn
wrote:
 On Thursday, 11 June 2020 at 22:19:27 UTC, H. S. Teoh wrote:
 To encourage inlining, you could make it an alias parameter instead
 of a delegate, something like this:
 
 	void overlap(alias cb)(SType start, SType stop) { ... }
 	...
 	bed[chr].overlap!callback(st0, en0);
 
I don't think ldc can handl that yet. I get an error saying ``` source/app.d(72,7): Error: function app.main.overlap!(callback).overlap requires a dual-context, which is not yet supported by LDC ``` And I see an open ticket for it on the ldc project.
Oh right. :-( But in any case, I'm a little skeptical whether this is the performance bottleneck anyway. But one simple thing to try is to add 'scope' to the callback parameter, which could potentially save you a GC allocation. I'm not 100% certain this will make a difference, but since it's such an easy change it's worth a shot. T -- Philosophy: how to make a career out of daydreaming.
Jun 11 2020
parent reply duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 22:57:55 UTC, H. S. Teoh wrote:

 But one simple thing to try is to add 'scope' to the callback 
 parameter, which could potentially save you a GC allocation. 
 I'm not 100% certain this will make a difference, but since 
 it's such an easy change it's worth a shot.
I will give that a shot! Also of interest, the profiler results on a full runthrough do show file writing and int parsing as the 2nd and 3rd most time consuming activities: ``` Num Tree Func Per Calls Time Time Call 8942869 46473 44660 0 void app.IITree!(int, bool).IITree.overlap(int, int, void delegate(app.IITree!(int, bool).IITree.Interval)) 8942869 33065 9656 0 safe void std.stdio.File.write!(char[], immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int, immutable(char)[], int, char).write(char[], immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int, immutable(char)[], int, char) 20273052 10024 9569 0 pure safe int std.conv.parse!(int, char[]).parse(ref char[]) 1 128571 8894 8894 _Dmain 80485821 6539 6539 0 nothrow nogc trusted ulong std.stdio.trustedFwrite!(char).trustedFwrite(shared(core.stdc.stdio.__sFILE)*, const(char[])) 17885738 8606 3808 0 safe void std.conv.toTextRange!(int, std.stdio.File.LockingTextWriter).toTextRange(int, std.stdio.File.LockingTextWriter) 30409578 3751 3751 0 pure nothrow nogc trusted char[] std.algorithm.searching.find!("a == b", char[], char).find(char[], char).trustedMemchr(ref char[], ref char) 10136528 3300 3274 0 ulong std.stdio.File.readln!(char).readln(ref char[], dchar) 30409578 13151 3047 0 pure safe char[] app.next!(std.algorithm.iteration.splitter!("a == b", char[], char).splitter(char[], char).Result).next(ref std.algorithm.iteration.splitter!("a == b", char[], char).splitter(char[], char).Result) 30409578 8964 2605 0 pure property safe char[] std.algorithm.iteration.splitter!("a == b", char[], char).splitter(char[], char).Result.front() 30409578 6289 2471 0 pure safe char[] std.algorithm.searching.find!("a == b", char[], char).find(char[], char) ```
Jun 11 2020
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jun 11, 2020 at 11:02:21PM +0000, duck_tape via Digitalmars-d-learn
wrote:
[...]
 I will give that a shot! Also of interest, the profiler results on a
 full runthrough do show file writing and int parsing as the 2nd and
 3rd most time consuming activities:
 
 ```
   Num          Tree        Func        Per
   Calls        Time        Time        Call
 
 8942869       46473       44660           0     void app.IITree!(int,
bool).IITree.overlap(int, int, void delegate(app.IITree!(int,
bool).IITree.Interval))
 8942869       33065        9656           0      safe void
std.stdio.File.write!(char[], immutable(char)[], char[], immutable(char)[],
char[], immutable(char)[], int, immutable(char)[], int, char).write(char[],
immutable(char)[], char[], immutable(char)[], char[], immutable(char)[], int,
immutable(char)[], int, char)
 20273052       10024        9569           0     pure  safe int
std.conv.parse!(int, char[]).parse(ref char[])
Hmm, looks like it's not so much input that's slow, but *output*. In fact, it looks pretty bad, taking almost as much time as overlap() does in total! This makes me think that writing your own output buffer could be worthwhile. Here's a quick-n-dirty way of doing that: import std.array : appender; auto filebuf = appender!string; ... // Replace every call to writeln with this: put(filebuf, text(... /* arguments go here */ ..., "\n")); ... // At the end of the main loop: enum bufLimit = 0x1000; // whatever the limit you want if (filebuf.length > someLimit) { write(filebuf.data); // flush output data stdout.flush; filebuf.clear; } This is just a rough sketch for an initial test, of course. For a truly optimized output buffer I'd write a container struct with methods for managing the appending and flushing of output. But this is just to get an idea of whether it actually improves performance before investing more effort into going in this direction. T -- He who sacrifices functionality for ease of use, loses both and deserves neither. -- Slashdotter
Jun 11 2020
parent reply duck_tape <sstadick gmail.com> writes:
On Thursday, 11 June 2020 at 23:45:31 UTC, H. S. Teoh wrote:
 Hmm, looks like it's not so much input that's slow, but 
 *output*. In fact, it looks pretty bad, taking almost as much 
 time as overlap() does in total!

 This makes me think that writing your own output buffer could 
 be worthwhile.  Here's a quick-n-dirty way of doing that:

 	import std.array : appender;
 	auto filebuf = appender!string;
 	...
 	// Replace every call to writeln with this:
 	put(filebuf, text(... /* arguments go here */ ..., "\n"));

 	...

 	// At the end of the main loop:
 	enum bufLimit = 0x1000; // whatever the limit you want
 	if (filebuf.length > someLimit) {
 		write(filebuf.data); // flush output data
 		stdout.flush;
 		filebuf.clear;
 	}

 This is just a rough sketch for an initial test, of course.  
 For a truly optimized output buffer I'd write a container 
 struct with methods for managing the appending and flushing of 
 output. But this is just to get an idea of whether it actually 
 improves performance before investing more effort into going in 
 this direction.


 T
I'll play with that a bit tomorrow! I saw a nice implementation on eBay's tsvutils that I may need to look closer at. Someone else suggested that stdout flushes per line by default. I dug around the stdlib but could confirm that. I also played around with setvbuf but it didn't seem to change anything. Have you run into that before / know if stdout is flushing every newline? I'm not above opening '/dev/stdout' as a file of that writes faster.
Jun 11 2020
parent reply Jon Degenhardt <jond noreply.com> writes:
On Friday, 12 June 2020 at 00:58:34 UTC, duck_tape wrote:
 On Thursday, 11 June 2020 at 23:45:31 UTC, H. S. Teoh wrote:
 Hmm, looks like it's not so much input that's slow, but 
 *output*. In fact, it looks pretty bad, taking almost as much 
 time as overlap() does in total!

 [snip...]
I'll play with that a bit tomorrow! I saw a nice implementation on eBay's tsvutils that I may need to look closer at. Someone else suggested that stdout flushes per line by default. I dug around the stdlib but could confirm that. I also played around with setvbuf but it didn't seem to change anything. Have you run into that before / know if stdout is flushing every newline? I'm not above opening '/dev/stdout' as a file of that writes faster.
I put some comparative benchmarks in https://github.com/jondegenhardt/dcat-perf. It compares input and output using standard Phobos facilities (File.byLine, File.write), iopipe (https://github.com/schveiguy/iopipe), and the tsv-utils buffered input and buffered output facilities. I haven't spent much time on results presentation, I know it's not that easy to read and interpret the results. Brief summary - On files with short lines buffering will result in dramatic throughput improvements over the standard phobos facilities. This is true for both input and output, through likely for different reasons. For input iopipe is the fastest available. tsv-utils buffered facilities are materially faster than phobos for both input and output, but not as fast as iopipe for input. Combining iopipe for input with tsv-utils BufferOutputRange for output works pretty well. For files with long lines both iopipe and tsv-utils BufferedByLine are materially faster than Phobos File.byLine when reading. For writing there wasn't much difference from Phobos File.write. A note on File.byLine - I've had many opportunities to compare Phobos File.byLine to facilities in other programming languages, and it is not bad at all. But it is beatable. About Memory Mapped Files - The benchmarks don't include compare against mmfile. They certainly make sense as a comparison point. --Jon
Jun 11 2020
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jun 12, 2020 at 03:32:48AM +0000, Jon Degenhardt via
Digitalmars-d-learn wrote:
[...]
 I haven't spent much time on results presentation, I know it's not
 that easy to read and interpret the results. Brief summary - On files
 with short lines buffering will result in dramatic throughput
 improvements over the standard phobos facilities. This is true for
 both input and output, through likely for different reasons. For input
 iopipe is the fastest available. tsv-utils buffered facilities are
 materially faster than phobos for both input and output, but not as
 fast as iopipe for input. Combining iopipe for input with tsv-utils
 BufferOutputRange for output works pretty well.
 
 For files with long lines both iopipe and tsv-utils BufferedByLine are
 materially faster than Phobos File.byLine when reading. For writing
 there wasn't much difference from Phobos File.write.
Interesting. Based on the OP's posted profile data, I got the impression that input wasn't a big deal, but output was. I wonder why.
 A note on File.byLine - I've had many opportunities to compare Phobos
 File.byLine to facilities in other programming languages, and it is
 not bad at all. But it is beatable.
I glanced over the implementation of byLine. It appears to be the unhappy compromise of trying to be 100% correct, cover all possible UTF encodings, and all possible types of input streams (on-disk file vs. interactive console). It does UTF decoding and resizing of arrays, and a lot of other frilly little squirrelly things. In fact I'm dismayed at how hairy it is, considering the conceptual simplicity of the task! Given this, it will definitely be much faster to load in large chunks of the file at a time into a buffer, and scanning in-memory for linebreaks. I wouldn't bother with decoding at all; I'd just precompute the byte sequence of the linebreaks for whatever encoding the file is expected to be in, and just scan for that byte pattern and return slices to the data.
 About Memory Mapped Files - The benchmarks don't include compare
 against mmfile. They certainly make sense as a comparison point.
[...] I'd definitely seriously consider using std.mmfile if I/O is determined to be a significant bottleneck. Letting the OS page in the file on-demand for you instead of copying buffers across the C file API boundary is definitely going to be a lot faster. Plus it will greatly simplify the code -- you could just arbitrarily scan and slice over file data without needing to manually manage buffers on your own, so your code will be much simpler and conducive for the compiler to squeeze the last bit of speed juice out of. I'd definitely avoid stdio.byLine if input was determined to be a bottleneck: decoding characters from file data just to find linebreaks seems to me to be definitely a slow way of doing things. Having said all of that, though: usually in non-trivial programs reading input is the least of your worries, so this kind of micro-optimization is probably unwarranted except for very niche cases and for micro-benchmarks and other such toy programs where the cost of I/O constitutes a significant chunk of running times. But knowing what byLine does under the hood is definitely interesting information for me to keep in mind, the next time I write an input-heavy program. (I'm reminded of that one time when, as little diversion, I decided to see if I could beat GNU wc at counting lines in a file. It was not easy to beat since wc it's optimized to next year and back, but eventually a combination of std.mmfile and std.parallelism to scan large chunks of file simultaneously managed to beat wc by a good margin. In the meantime, though, I also discovered that a file of very short lines triggers poor performance out of wc, whereas a file of very long lines triggers the best performance -- because glibc's memchr appeared to be optimized for a micro-benchmark geared towards scanning arrays with only rare occurrences of the sought character, but typical text files exhibit much more frequent matches (shorter lines). When fed a file of very short lines, the overhead of the hyper-optimized code added up significantly in the latter case, whereas when lines were sufficiently long it far outweighed the overhead cost. Optimization is a tricky beast: always make sure to measure and optimize for your actual use case rather than making your code look good on some artificial micro-benchmark, else your code may look good on the benchmark but actually perform poorly on real-world data.) T -- May you live all the days of your life. -- Jonathan Swift
Jun 11 2020
parent reply Jon Degenhardt <jond noreply.com> writes:
On Friday, 12 June 2020 at 06:20:59 UTC, H. S. Teoh wrote:
 I glanced over the implementation of byLine.  It appears to be 
 the unhappy compromise of trying to be 100% correct, cover all 
 possible UTF encodings, and all possible types of input streams 
 (on-disk file vs. interactive console).  It does UTF decoding 
 and resizing of arrays, and a lot of other frilly little 
 squirrelly things.  In fact I'm dismayed at how hairy it is, 
 considering the conceptual simplicity of the task!

 Given this, it will definitely be much faster to load in large 
 chunks of the file at a time into a buffer, and scanning 
 in-memory for linebreaks. I wouldn't bother with decoding at 
 all; I'd just precompute the byte sequence of the linebreaks 
 for whatever encoding the file is expected to be in, and just 
 scan for that byte pattern and return slices to the data.
This is basically what bufferedByLine in tsv-utils does. See: https://github.com/eBay/tsv-utils/blob/master/common/src/tsv_utils/common/utils.d#L793. tsv-utils has the advantage of only needing to support utf-8 files with Unix newlines, so the code is simpler. (Windows newlines are detected, this occurs separately from bufferedByLine.) But as you describe, support for a wider variety of input cases could be done without sacrificing basic performance. iopipe provides much more generic support, and it is quite fast.
 Having said all of that, though: usually in non-trivial 
 programs reading input is the least of your worries, so this 
 kind of micro-optimization is probably unwarranted except for 
 very niche cases and for micro-benchmarks and other such toy 
 programs where the cost of I/O constitutes a significant chunk 
 of running times.  But knowing what byLine does under the hood 
 is definitely interesting information for me to keep in mind, 
 the next time I write an input-heavy program.
tsv-utils tools saw performance gains of 10-40% by moving from File.byLine to bufferedByLine, depending on tool and type of file (narrow or wide). Gains of 5-20% were obtained by switching from File.write to BufferedOutputRange, with some special cases improving by 50%. tsv-utils tools aren't micro-benchmarks, but they are not typical apps either. Most of the tools go into a tight loop of some kind, running a transformation on the input and writing to the output. Performance is a real benefit to these tools, as they get run on reasonably large data sets.
Jun 12 2020
parent reply duck_tape <sstadick gmail.com> writes:
On Friday, 12 June 2020 at 07:25:09 UTC, Jon Degenhardt wrote:
 tsv-utils has the advantage of only needing to support utf-8 
 files with Unix newlines, so the code is simpler. (Windows 
 newlines are detected, this occurs separately from 
 bufferedByLine.) But as you describe, support for a wider 
 variety of input cases could be done without sacrificing basic 
 performance. iopipe provides much more generic support, and it 
 is quite fast.
I will have to look into iopipe for sure. All this info is great. For this particular benchmark the goal is just to show off some 'high-level' languages and how close to c they can get. If I can avoid going way into the weeds writing my own output methods, that's more in the spirit of things. However, I do intend to be using D for bioinformatics, which is incredibly IO intensive, so much of this will be put to good use. For speedups with getting my hands dirty: - Does writef and company flush on every line? I still haven't found the source of this. - It looks like I could use {f}printf if I really wanted to: https://forum.dlang.org/post/hzcjbanvkxgohkbvjnkv forum.dlang.org It's particularly interesting what is said about short lines doing worse, because these are pretty short, less than 20 characters usually.
Jun 12 2020
parent reply duck_tape <sstadick gmail.com> writes:
On Friday, 12 June 2020 at 12:02:19 UTC, duck_tape wrote:
 For speedups with getting my hands dirty:
 - Does writef and company flush on every line? I still haven't 
 found the source of this.
 - It looks like I could use {f}printf if I really wanted to: 
 https://forum.dlang.org/post/hzcjbanvkxgohkbvjnkv forum.dlang.org
On Friday, 12 June 2020 at 12:02:19 UTC, duck_tape wrote: Switching to using `core.stdc.stdio.printf` shaved off nearly two seconds (11->9)! Once I wrap this up for submission to biofast I will play with mem memmapping / iopipe / tsvutils buffered writers. Sambamba is also doing some non-standard tweaks to it's outputting as well. I'm still convinced that stdout is flushing by line.
Jun 12 2020
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jun 12, 2020 at 12:11:44PM +0000, duck_tape via Digitalmars-d-learn
wrote:
 On Friday, 12 June 2020 at 12:02:19 UTC, duck_tape wrote:
 For speedups with getting my hands dirty:
 - Does writef and company flush on every line? I still haven't found
 the source of this.
writef, et al, ultimately goes through LockingTextWriter in std.stdio.File: https://github.com/dlang/phobos/blob/master/std/stdio.d#L2890 Looks like it's doing some Unicode manipulation and writing character by character -- a pretty slow proposition IMO! It was done this way for Unicode-correctness, AFAICT, but if you already know the final form your output is going to take, directly calling fwrite(), or the D wrapper File.rawWrite(), will probably give you a significant performance boost.
 - It looks like I could use {f}printf if I really wanted to:
 https://forum.dlang.org/post/hzcjbanvkxgohkbvjnkv forum.dlang.org
Be aware that D strings, other than string literals, are generally NOT null-terminated, so you need to call toStringZ before calling fprintf, otherwise you might be in for a nasty surprise. :-P Other than that, calling C from D is pretty easy: extern(C) int printf(char*, ...); void myDCode(string data) { printf("%s\n", data.toStringZ); // calls C printf }
 On Friday, 12 June 2020 at 12:02:19 UTC, duck_tape wrote:
 
 Switching to using `core.stdc.stdio.printf` shaved off nearly two
 seconds (11->9)!
 
 Once I wrap this up for submission to biofast I will play with mem
 memmapping / iopipe / tsvutils buffered writers. Sambamba is also
 doing some non-standard tweaks to it's outputting as well.
 
 I'm still convinced that stdout is flushing by line.
It seems likely, if you're outputting to terminal. Otherwise, it's likely the performance slowdown is caused by Unicode manipulation code inside LockingTextWriter. On that note, somebody should get to the bottom of this, and submit a PR to Phobos with a fast-track path for the (IMO very common) case where the string can just be fwrite'd straight into output. AFAICT, all the extra baggage currently in LockingTextWriter is mainly to deal with the case where the OS expects a (slightly) different encoding for text than is internally represented, e.g., classic 0x0D 0x0A DOS line endings (which I heard are obsolete these days, so even that case may not be as common as it used to be anymore), or outputting UTF-16 to UTF-8 or vice versa. I'm skeptical whether this is the common case these days, so having a fast path for UTF-8 -> UTF-8 (i.e., just fwrite the whole thing straight to file) will be a good improvement for D. T -- Nobody is perfect. I am Nobody. -- pepoluan, GKC forum
Jun 12 2020