www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] Application case study comparing Java, Go, and C++

reply Jon Degenhardt <jond noreply.com> writes:
This paper may be of interest to people here:

"A comparison of three programming languages for a full-fledged 
next-generation sequencing tool", P.Costanza, C.Herzeel, 
W.Verachrert
https://doi.org/10.1101/558056

The paper compares implementations of a tool operating on SAM/BAM 
files (bioinformatics) from a performance perspective. Focus is 
on comparison of GC schemes used in Go and Java with reference 
counting in C++. The GC schemes were materially faster.

I'm not familiar with the authors or the implementations, so 
cannot say how well the implementations were done. However, it 
appears to be a useful case study, and the authors go provide a 
fair bit of analysis in the paper.

There's a reddit thread also: 
https://www.reddit.com/r/programming/comments/avsfc6/performance_comparison_of_go_c_and_java_for/
Feb 28 2019
parent reply Seb <seb wilzba.ch> writes:
On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt 
wrote:
 This paper may be of interest to people here:

 "A comparison of three programming languages for a full-fledged 
 next-generation sequencing tool", P.Costanza, C.Herzeel, 
 W.Verachrert
 https://doi.org/10.1101/558056

 The paper compares implementations of a tool operating on 
 SAM/BAM files (bioinformatics) from a performance perspective. 
 Focus is on comparison of GC schemes used in Go and Java with 
 reference counting in C++. The GC schemes were materially 
 faster.

 I'm not familiar with the authors or the implementations, so 
 cannot say how well the implementations were done. However, it 
 appears to be a useful case study, and the authors go provide a 
 fair bit of analysis in the paper.

 There's a reddit thread also: 
 https://www.reddit.com/r/programming/comments/avsfc6/performance_comparison_of_go_c_and_java_for/
I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example: "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance... For reference, samtools is the de-facto standard for a reason (yes it's old and written in C). Though, to be fair sambamba (written in D) is faster than the C "standard" implementation: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
Feb 28 2019
parent reply Jon Degenhardt <jond noreply.com> writes:
On Thursday, 28 February 2019 at 22:58:54 UTC, Seb wrote:
 On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt 
 wrote:
 This paper may be of interest to people here:

 "A comparison of three programming languages for a 
 full-fledged next-generation sequencing tool", P.Costanza, 
 C.Herzeel, W.Verachrert
 https://doi.org/10.1101/558056

 The paper compares implementations of a tool operating on 
 SAM/BAM files (bioinformatics) from a performance perspective. 
 Focus is on comparison of GC schemes used in Go and Java with 
 reference counting in C++. The GC schemes were materially 
 faster.

 I'm not familiar with the authors or the implementations, so 
 cannot say how well the implementations were done. However, it 
 appears to be a useful case study, and the authors go provide 
 a fair bit of analysis in the paper.

 There's a reddit thread also: 
 https://www.reddit.com/r/programming/comments/avsfc6/performance_comparison_of_go_c_and_java_for/
I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example: "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance... For reference, samtools is the de-facto standard for a reason (yes it's old and written in C). Though, to be fair sambamba (written in D) is faster than the C "standard" implementation: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
They do have benchmark comparisons against GATK 4 in another paper: "elPrep 4: A multithreaded framework for sequence analysis" https://doi.org/10.1371/journal.pone.0209523 I'm not so familiar with these tool sets. How does GATK 4 stack up against other tools? From the paper it looks like many of the performance gains over GATK 4 resulted from architecture and algorithm changes, so it may not be valid from the perspective of comparing C++/Go/Java and GC vs reference counting.
Feb 28 2019
parent reply Pjotr Prins <pjotr.public12 thebird.nl> writes:
On Thursday, 28 February 2019 at 23:50:44 UTC, Jon Degenhardt 
wrote:
 On Thursday, 28 February 2019 at 22:58:54 UTC, Seb wrote:
 On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt 
 wrote:
 [...]
I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example: "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance... For reference, samtools is the de-facto standard for a reason (yes it's old and written in C). Though, to be fair sambamba (written in D) is faster than the C "standard" implementation: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
They do have benchmark comparisons against GATK 4 in another paper: "elPrep 4: A multithreaded framework for sequence analysis" https://doi.org/10.1371/journal.pone.0209523 I'm not so familiar with these tool sets. How does GATK 4 stack up against other tools? From the paper it looks like many of the performance gains over GATK 4 resulted from architecture and algorithm changes, so it may not be valid from the perspective of comparing C++/Go/Java and GC vs reference counting.
As the co-author of sambamba and having a pretty good understanding of samtools I call BS on mentioned Go/C++/Java comparison paper. It is all about implementation, i.e., the programmer. Saying that Go is faster than C++ makes no sense to me (go figure). Maybe the C++ implementation should have used a ring buffer like Sambamba does in D (Artem did the smart thing). One reason I like chess is that it is an honest comparison of skill. Have two people play and you can tell quickly who is superior. In computing we don't have such an easy framework. You can compare tools, i.e., implementation, but to make it a language comparison is bound to be flawed. The problem with that comparision paper is the way they wrote it up.
Mar 01 2019
parent Jon Degenhardt <jond noreply.com> writes:
On Friday, 1 March 2019 at 12:46:12 UTC, Pjotr Prins wrote:
 On Thursday, 28 February 2019 at 23:50:44 UTC, Jon Degenhardt 
 wrote:
 On Thursday, 28 February 2019 at 22:58:54 UTC, Seb wrote:
 On Thursday, 28 February 2019 at 20:48:01 UTC, Jon Degenhardt 
 wrote:
 [...]
I wouldn't give much value to this paper. It hasn't been peer reviewed and I doubt it would pass any. A quick example: "It [their tool] can be used as a drop-in replacement for many operations implemented by SAMtools [...]". Though no performance comparison was done against samtools (nor any other tools expect their own implementations). I find this pretty shocking, because their entire paper's purpose is about performance... For reference, samtools is the de-facto standard for a reason (yes it's old and written in C). Though, to be fair sambamba (written in D) is faster than the C "standard" implementation: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765878
They do have benchmark comparisons against GATK 4 in another paper: "elPrep 4: A multithreaded framework for sequence analysis" https://doi.org/10.1371/journal.pone.0209523 I'm not so familiar with these tool sets. How does GATK 4 stack up against other tools? From the paper it looks like many of the performance gains over GATK 4 resulted from architecture and algorithm changes, so it may not be valid from the perspective of comparing C++/Go/Java and GC vs reference counting.
As the co-author of sambamba and having a pretty good understanding of samtools I call BS on mentioned Go/C++/Java comparison paper. It is all about implementation, i.e., the programmer. Saying that Go is faster than C++ makes no sense to me (go figure). Maybe the C++ implementation should have used a ring buffer like Sambamba does in D (Artem did the smart thing). One reason I like chess is that it is an honest comparison of skill. Have two people play and you can tell quickly who is superior. In computing we don't have such an easy framework. You can compare tools, i.e., implementation, but to make it a language comparison is bound to be flawed. The problem with that comparision paper is the way they wrote it up.
Thanks for the feedback (both Seb and Pjotr). It's too bad the paper doesn't provide more meaningful value, as application level comparisons of alternate programming environments are quite rare. Application level benchmarks are useful in conjunction with the micro-benchmarks that are more the norm. More important, in my view. But, if the work isn't well founded, or least can't be shown to be well founded, then it's not useful. If there were a number of similar results it might be seen as contributing evidence. As a single work it'd always need to be viewed skeptically, but if people who have expertise in the application area don't find it worthy, well... --Jon
Mar 01 2019