www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - They wrote the fastest parallelized BAM parser in D

reply "george" <georgkam gmail.com> writes:
http://bioinformatics.oxfordjournals.org/content/early/2015/02/18/bioinformatics.btv098.full.pdf+html

and a feature
http://google-opensource.blogspot.nl/2015/03/gsoc-project-sambamba-published-in.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+GoogleOpenSourceBlog+(Google+Open+Source+Blog)


D may hold a sweet spot in bioinformatics where you often require 
quick turnaround (productivity) , raw speed and agility.
Mar 29 2015
next sibling parent "Laeeth Isharc" <nospamlaeeth nospam.laeeth.com> writes:
On Monday, 30 March 2015 at 06:50:19 UTC, george wrote:
 http://bioinformatics.oxfordjournals.org/content/early/2015/02/18/bioinformatics.btv098.full.pdf+html

 and a feature
 http://google-opensource.blogspot.nl/2015/03/gsoc-project-sambamba-published-in.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+GoogleOpenSourceBlog+(Google+Open+Source+Blog)


 D may hold a sweet spot in bioinformatics where you often 
 require quick turnaround (productivity) , raw speed and agility.
Thanks. Added to Python wiki section here: http://wiki.dlang.org/Coming_From/Python But we should also create anchors for guides by different use domains for D: finance, bioinformatics, etc. Enterprise users often like to know they are not the first.
Mar 30 2015
prev sibling next sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Monday, 30 March 2015 at 06:50:19 UTC, george wrote:
 http://bioinformatics.oxfordjournals.org/content/early/2015/02/18/bioinformatics.btv098.full.pdf+html

 and a feature
 http://google-opensource.blogspot.nl/2015/03/gsoc-project-sambamba-published-in.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+GoogleOpenSourceBlog+(Google+Open+Source+Blog)


 D may hold a sweet spot in bioinformatics where you often 
 require quick turnaround (productivity) , raw speed and agility.
.NET actually already has a foothold in bioinformatics, specially in user facing software and steering of reading equipments and robots. visualization) use cases. -- Paulo
Mar 30 2015
parent reply "george" <georgkam gmail.com> writes:
 .NET actually already has a foothold in bioinformatics, 
 specially in user facing software and steering of reading 
 equipments and robots.


 visualization) use cases.

 --
 Paulo
Though when it comes to open source bioinformatics projects, Perl and Python have a large foothold among most most bioinformaticians. Most utilities that require speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc). I think D stands a good chance as a language of choice for bioinformatics projects. George
Mar 30 2015
next sibling parent reply Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Mon, 2015-03-30 at 18:04 +0000, george via Digitalmars-d wrote:
 .NET actually already has a foothold in bioinformatics,=20
 specially in user facing software and steering of reading=20
 equipments and robots.
=20

 visualization) use cases.
=20
 --
 Paulo
Paulo, Can you send me some pointers to this stuff?
=20
 Though when it comes to open source bioinformatics projects, Perl=20
 and Python have a large foothold
 among most most bioinformaticians. Most utilities that require=20
 speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc).
=20
 I think D stands a good chance as a language of choice for=20
 bioinformatics projects.
=20
 George
My "prejudice", based on training people in Python and C++ over the=20 last few years, is that Python and C++ have a very strong position in=20 the bioinformatics community, with the use of IPython (now becoming=20 Jupyter) increasing and solidifying the Python position. D's position is quite weak here because one of the important things is=20 visualising data, something SciPy/Matplotlib are very good at. D has=20 no real play in this arena and so there is no way (currently) of=20 creating a foothold. Sad, but=E2=80=A6 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 30 2015
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/30/15 11:23 AM, Russel Winder via Digitalmars-d wrote:
 On Mon, 2015-03-30 at 18:04 +0000, george via Digitalmars-d wrote:
 .NET actually already has a foothold in bioinformatics,
 specially in user facing software and steering of reading
 equipments and robots.


 visualization) use cases.

 --
 Paulo
Paulo, Can you send me some pointers to this stuff?
 Though when it comes to open source bioinformatics projects, Perl
 and Python have a large foothold
 among most most bioinformaticians. Most utilities that require
 speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc).

 I think D stands a good chance as a language of choice for
 bioinformatics projects.

 George
My "prejudice", based on training people in Python and C++ over the last few years, is that Python and C++ have a very strong position in the bioinformatics community, with the use of IPython (now becoming Jupyter) increasing and solidifying the Python position. D's position is quite weak here because one of the important things is visualising data, something SciPy/Matplotlib are very good at. D has no real play in this arena and so there is no way (currently) of creating a foothold. Sad, but…
... incongruent with the recently-published bioinformatics paper. -- Andrei
Mar 30 2015
prev sibling next sibling parent reply "Laeeth Isharc" <nospamlaeeth nospam.laeeth.com> writes:
 My "prejudice", based on training people in Python and C++ over 
 the last few years, is that Python and C++ have a very strong 
 position in the bioinformatics community, with the use of 
 IPython (now becoming Jupyter) increasing and solidifying the 
 Python position.
It's just possible there is a selection effect ;) Plus the future may not be like the past.
 D's position is quite weak here because one of the important 
 things is visualising data, something SciPy/Matplotlib are very 
 good at. D has no real play in this arena and so there is no 
 way (currently) of
 creating a foothold. Sad, but…
You're right about the lack of visualization being a shame. I have been thinking about porting Bokeh bindings to D. There isn't much too it on the server side - all you need to do is build up the object model and translate it to JSON - but I have not time right now to do it all myself. https://github.com/bokeh/bokeh I did port MathGL C API to D, although I haven't tested yet beyond the simplest example. The C++ bindings aren't so much work to add, although even the C API is not so ugly. http://mathgl.sourceforge.net/doc_en/Main.html
Mar 30 2015
parent reply "CraigDillabaugh" <craig.dillabaugh gmail.com> writes:
On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:

clip
 You're right about the lack of visualization being a shame. I 
 have been thinking about porting Bokeh bindings to D.  There 
 isn't much too it on the server side - all you need to do is 
 build up the object model and translate it to JSON - but I have 
 not time right now to do it all myself.
clip A comment on the visualization thing. Is this really a big issue? Data processing (D's strong point) and visualization are different tasks, and presumably as long as outputs are to standard file types (ie. NetCDF, HDF5 or other domain specific formats) then existing visualization tools should be usable. I did some image processing work with D and didn't find the lack of specific D tools for visualization a big issue. There is some advantage to being able to perform visualization tasks in the same lanaguage as you do the data processing work, but I wouldn't this this would be a major obstacle.
Mar 30 2015
next sibling parent "george" <georgkam gmail.com> writes:
 I did some image processing work with D and didn't find the 
 lack of specific D tools for visualization a big issue.

 There is some advantage to being able to perform visualization 
 tasks in the same lanaguage as you do the data processing work, 
 but I wouldn't this this would be a major obstacle.
I personally prefer the model where I create a tool that takes some input and provides output in a suitable format that I can load to a proper statistical environment (R or Julia ) for visualisation and manipulation. Therefore I would rather write a tool that performs a single task optimally and pipes its output to a different tool for another task. This way I can use the tools and allow for flexible pipelines. rawdata -> clean -> QC –> to format Y –> to format X -> tool A -> tool B-> visualize George
Mar 30 2015
prev sibling parent reply "lobo" <swamplobo gmail.com> writes:
On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh wrote:
 On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:

 clip
 You're right about the lack of visualization being a shame. I 
 have been thinking about porting Bokeh bindings to D.  There 
 isn't much too it on the server side - all you need to do is 
 build up the object model and translate it to JSON - but I 
 have not time right now to do it all myself.
clip A comment on the visualization thing. Is this really a big issue?
[snip] Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline. It's also why Matlab is so useful for those lucky enough to work for a company that can afford it. bye, lobo
Mar 30 2015
parent reply "Craig Dillabaugh" <craig.dillabaugh gmail.com> writes:
On Monday, 30 March 2015 at 22:55:37 UTC, lobo wrote:
 On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh wrote:
 On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:

 clip
 You're right about the lack of visualization being a shame. I 
 have been thinking about porting Bokeh bindings to D.  There 
 isn't much too it on the server side - all you need to do is 
 build up the object model and translate it to JSON - but I 
 have not time right now to do it all myself.
clip A comment on the visualization thing. Is this really a big issue?
[snip] Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline. It's also why Matlab is so useful for those lucky enough to work for a company that can afford it. bye, lobo
My point wasn't that visualization isn't important, it is that in most scientific computing it is very easy (and sensible) to separate the processing and visualization aspects. So lack of D visualization tools should not hinder its value as a data processing tool. For example, Hadoop is immensely popular for data processing, but it includes no visualization tools. That is a slightly different domain I understand, but there are similarities. So in short, if there were nice D visualization tools that would certainly be helpful, but I don't think is should be a show stopper.
Mar 30 2015
parent reply "Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:
On Tuesday, 31 March 2015 at 02:31:58 UTC, Craig Dillabaugh wrote:
 On Monday, 30 March 2015 at 22:55:37 UTC, lobo wrote:
 On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh 
 wrote:
 On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc wrote:

 clip
 You're right about the lack of visualization being a shame. 
 I have been thinking about porting Bokeh bindings to D.  
 There isn't much too it on the server side - all you need to 
 do is build up the object model and translate it to JSON - 
 but I have not time right now to do it all myself.
clip A comment on the visualization thing. Is this really a big issue?
[snip] Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline. It's also why Matlab is so useful for those lucky enough to work for a company that can afford it. bye, lobo
My point wasn't that visualization isn't important, it is that in most scientific computing it is very easy (and sensible) to separate the processing and visualization aspects. So lack of D visualization tools should not hinder its value as a data processing tool. For example, Hadoop is immensely popular for data processing, but it includes no visualization tools. That is a slightly different domain I understand, but there are similarities. So in short, if there were nice D visualization tools that would certainly be helpful, but I don't think is should be a show stopper.
Yes, I tried to pick my words carefully. It is not a disaster, as a someone seemed to imply, but it would be nice to have visualization, particularly for interactive exploration of data. One is back to Walter's quote about the two language combination being an indicator that something is lacking.
Mar 30 2015
parent reply "Andrew Brown" <aabrown24 hotmail.com> writes:
Visualisation is certainly not behind python's success in 
bioinformatics, which predates ipython. If you look through 
journals, very few of the figures are done in python (and none at 
all in julia). It succeeded because it allows you to hack your 
way through massive text files and it's not perl.

One problem with using D instead of C or C++ for projects like 
this, is that these projects are a few people developing software 
for many users, who are working on frequently very old clusters 
where they don't have admin rights. Getting an executable file to 
work for them is not trivial. Programs like samtools solve this 
by expecting people to compile it themselves, knowing they can 
rely on gcc to be installed. But none of these clusters have a D 
compiler handy.

On my university, out of the box executables for ldc don't run, 
gdc executable files don't link with libc, and dmd sometimes 
shouts it can't find dmd.conf. And this is a fairly up to date 
and well administered cluster, I know quite a few instituions 
still on centOS 5. Now, I can work to fix these problems for 
myself, but I can't expect a user spend 3 hours compiling llvm, 
then ldc and various libraries to use my software, rather than 
just look for the C/C++ equivalent.

Yesterday I was asked if I'd rewrite my code in C++ to solve this 
problem, not really an option as I don't know C++. I guess this 
is a fairly niche issue, D Learn kindly pointed me in the 
direction of VMs which I think will solve most of my problems. 
The sambabamba authors seem to be sharing dockers (congrat on the 
paper by the way!). But I think it is a factor to be considered 
when using D: disseminating software is trickier than with C/C++.

On Tuesday, 31 March 2015 at 03:30:09 UTC, Laeeth Isharc wrote:
 On Tuesday, 31 March 2015 at 02:31:58 UTC, Craig Dillabaugh 
 wrote:
 On Monday, 30 March 2015 at 22:55:37 UTC, lobo wrote:
 On Monday, 30 March 2015 at 20:25:33 UTC, CraigDillabaugh 
 wrote:
 On Monday, 30 March 2015 at 20:09:35 UTC, Laeeth Isharc 
 wrote:

 clip
 You're right about the lack of visualization being a shame. 
 I have been thinking about porting Bokeh bindings to D.  
 There isn't much too it on the server side - all you need 
 to do is build up the object model and translate it to JSON 
 - but I have not time right now to do it all myself.
clip A comment on the visualization thing. Is this really a big issue?
[snip] Yes of course, why do you think Pyhton + sciPy/Numpy has such a foothold in the scientific community. Visualisation is an important part of data processing pipeline. It's also why Matlab is so useful for those lucky enough to work for a company that can afford it. bye, lobo
My point wasn't that visualization isn't important, it is that in most scientific computing it is very easy (and sensible) to separate the processing and visualization aspects. So lack of D visualization tools should not hinder its value as a data processing tool. For example, Hadoop is immensely popular for data processing, but it includes no visualization tools. That is a slightly different domain I understand, but there are similarities. So in short, if there were nice D visualization tools that would certainly be helpful, but I don't think is should be a show stopper.
Yes, I tried to pick my words carefully. It is not a disaster, as a someone seemed to imply, but it would be nice to have visualization, particularly for interactive exploration of data. One is back to Walter's quote about the two language combination being an indicator that something is lacking.
Mar 31 2015
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 31 March 2015 at 08:09:00 UTC, Andrew Brown wrote:
 Visualisation is certainly not behind python's success in 
 bioinformatics, which predates ipython. If you look through 
 journals, very few of the figures are done in python (and none 
 at all in julia). It succeeded because it allows you to hack 
 your way through massive text files and it's not perl.

 One problem with using D instead of C or C++ for projects like 
 this, is that these projects are a few people developing 
 software for many users, who are working on frequently very old 
 clusters where they don't have admin rights. Getting an 
 executable file to work for them is not trivial. Programs like 
 samtools solve this by expecting people to compile it 
 themselves, knowing they can rely on gcc to be installed. But 
 none of these clusters have a D compiler handy.

 On my university, out of the box executables for ldc don't run, 
 gdc executable files don't link with libc, and dmd sometimes 
 shouts it can't find dmd.conf. And this is a fairly up to date 
 and well administered cluster, I know quite a few instituions 
 still on centOS 5. Now, I can work to fix these problems for 
 myself, but I can't expect a user spend 3 hours compiling llvm, 
 then ldc and various libraries to use my software, rather than 
 just look for the C/C++ equivalent.

 Yesterday I was asked if I'd rewrite my code in C++ to solve 
 this problem, not really an option as I don't know C++. I guess 
 this is a fairly niche issue, D Learn kindly pointed me in the 
 direction of VMs which I think will solve most of my problems. 
 The sambabamba authors seem to be sharing dockers (congrat on 
 the paper by the way!). But I think it is a factor to be 
 considered when using D: disseminating software is trickier 
 than with C/C++.
Building LDC and its depedencies isn't that difficult, but it was still a pain to have to do that just to compile my code for the cluster. There needs to be some sort of bootstrap script, downloads included, available to go from a bare bones c++ toolchain to a working D compiler. Or even just some executables online compiled with an ancient glibc.
Mar 31 2015
prev sibling next sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Monday, 30 March 2015 at 18:23:31 UTC, Russel Winder wrote:
 On Mon, 2015-03-30 at 18:04 +0000, george via Digitalmars-d 
 wrote:
 .NET actually already has a foothold in bioinformatics, 
 specially in user facing software and steering of reading 
 equipments and robots.
 

 visualization) use cases.
 
 --
 Paulo
Paulo, Can you send me some pointers to this stuff?
Sure, just sent to your email. -- Paulo
Mar 30 2015
prev sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Monday, 30 March 2015 at 18:23:31 UTC, Russel Winder wrote:
 On Mon, 2015-03-30 at 18:04 +0000, george via Digitalmars-d 
 wrote:
 .NET actually already has a foothold in bioinformatics, 
 specially in user facing software and steering of reading 
 equipments and robots.
 

 visualization) use cases.
 
 --
 Paulo
Paulo, Can you send me some pointers to this stuff?
 
 Though when it comes to open source bioinformatics projects, 
 Perl and Python have a large foothold
 among most most bioinformaticians. Most utilities that require 
 speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS 
 etc).
 
 I think D stands a good chance as a language of choice for 
 bioinformatics projects.
 
 George
My "prejudice", based on training people in Python and C++ over the last few years, is that Python and C++ have a very strong position in the bioinformatics community, with the use of IPython (now becoming Jupyter) increasing and solidifying the Python position. D's position is quite weak here because one of the important things is visualising data, something SciPy/Matplotlib are very good at. D has no real play in this arena and so there is no way (currently) of creating a foothold. Sad, but…
As Andrew Brown pointed out, visualization is not behind Pythons success. Its success lies in the fact that it's a language you can hack away in easily. Almost everybody who has to do some data processing (most researchers do these days) and has limited or no experience with programming will opt for Python: easy (at first!), well-documented and everyone else uses it. However, the initial euphoria of being able to automatically rename files and extract value X from file Y soon gives way to frustration when it comes to performance. The paper shows well that in a world where data processing is of utmost importance, and we're talking about huge sets of data, languages like Python don't cut it anymore. Two things are happening at the moment: on the one hand people still use Python for various reasons (see above and hundreds of posts on this forum), at the same time there's growing discontent among researchers, scientists and engineers as regards performance, simply because the data sets are becoming bigger and bigger every day and the algorithms are getting more and more refined. Sooner or later people will have to find new ways, out of sheer necessity. Don't forget that "the state of the art" can change very quickly in IT and the name of the game is anticipating new developments rather than taking snapshots of the current state of the art and frame them. D really has a lot to offer for data processing and I wouldn't rule it out that more and more programmers will turn to it for this task.
Mar 31 2015
parent reply "Laeeth Isharc" <nospamlaeeth nospam.laeeth.com> writes:
 As Andrew Brown pointed out, visualization is not behind 
 Pythons success. Its success lies in the fact that it's a 
 language you can hack away in easily.
Sounds right. I am not in the camp that says it is a killer for D. It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive. (The REPL might be one route). The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data. My interests are finance more than science, so that may lead to a different set of needs. Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work. But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake.
 the initial euphoria of being able to automatically rename 
 files and extract value X from file Y soon gives way to 
 frustration when it comes to performance.
Yep.
 The paper shows well that in a world where data processing is 
 of utmost importance, and we're talking about huge sets of 
 data, languages like Python don't cut it anymore.
I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D. It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application. Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace. This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in). People say "what is D's edge", but my personal perception is "where is the competition for D" in this area. It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration.
 at the same time there's growing discontent among researchers, 
 scientists and engineers as regards performance, simply because 
 the data sets are becoming bigger and bigger every day and the 
 algorithms are getting more and more refined. Sooner or later 
 people will have to find new ways, out of sheer necessity.
upvote. I would love to see any references you have on this - not because it's not rather obvious to me, but because it is helpful when talking to other people.
 Don't forget that "the state of the art" can change very 
 quickly in IT and the name of the game is anticipating new 
 developments rather than taking snapshots of the current state 
 of the art and frame them. D really has a lot to offer for data 
 processing and I wouldn't rule it out that more and more 
 programmers will turn to it for this task.
I fully agree. If we started a section on use cases, would you be able to write a page or two on D's advantages in data processing?
Mar 31 2015
next sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Tuesday, 31 March 2015 at 11:04:50 UTC, Laeeth Isharc wrote:
 As Andrew Brown pointed out, visualization is not behind 
 Pythons success. Its success lies in the fact that it's a 
 language you can hack away in easily.
Sounds right. I am not in the camp that says it is a killer for D. It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive. (The REPL might be one route). The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data. My interests are finance more than science, so that may lead to a different set of needs. Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work. But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake.
 the initial euphoria of being able to automatically rename 
 files and extract value X from file Y soon gives way to 
 frustration when it comes to performance.
Yep.
 The paper shows well that in a world where data processing is 
 of utmost importance, and we're talking about huge sets of 
 data, languages like Python don't cut it anymore.
I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D. It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application. Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace. This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in). People say "what is D's edge", but my personal perception is "where is the competition for D" in this area. It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration.
 at the same time there's growing discontent among researchers, 
 scientists and engineers as regards performance, simply 
 because the data sets are becoming bigger and bigger every day 
 and the algorithms are getting more and more refined. Sooner 
 or later people will have to find new ways, out of sheer 
 necessity.
upvote. I would love to see any references you have on this - not because it's not rather obvious to me, but because it is helpful when talking to other people.
The article that gave rise to this thread is a good reference. I came from a slightly different angle, I looked for alternatives to Python, because I needed: 1. fast native execution (real time) 2. easy interfacing to C 3. cross-platform development (Modern convenience, templates, ranges etc. were bonuses I discovered bit by bit) As regards algorithms and data processing, most people in research use Matlab (proprietary) and Python. However, in my field they're useless when it comes to building data-driven systems (fast analysis, retraining of machine based on (slight) modifications), and putting computationally heavy algorithms into real world applications. Proof of concept is all it amounts to, usually. So D has a real chance here, because of 1. native code 2. modern convenience 3. templates, structs, mixins, ranges, std.algorithm etcetc. 4. interfacing to C libs
 Don't forget that "the state of the art" can change very 
 quickly in IT and the name of the game is anticipating new 
 developments rather than taking snapshots of the current state 
 of the art and frame them. D really has a lot to offer for 
 data processing and I wouldn't rule it out that more and more 
 programmers will turn to it for this task.
I fully agree. If we started a section on use cases, would you be able to write a page or two on D's advantages in data processing?
I think that Dicebot et al would have good examples.
Mar 31 2015
parent "Chris" <wendlec tcd.ie> writes:
On Tuesday, 31 March 2015 at 13:31:33 UTC, Chris wrote:
 On Tuesday, 31 March 2015 at 11:04:50 UTC, Laeeth Isharc wrote:
 As Andrew Brown pointed out, visualization is not behind 
 Pythons success. Its success lies in the fact that it's a 
 language you can hack away in easily.
Sounds right. I am not in the camp that says it is a killer for D. It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive. (The REPL might be one route). The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data. My interests are finance more than science, so that may lead to a different set of needs. Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work. But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake.
 the initial euphoria of being able to automatically rename 
 files and extract value X from file Y soon gives way to 
 frustration when it comes to performance.
Yep.
 The paper shows well that in a world where data processing is 
 of utmost importance, and we're talking about huge sets of 
 data, languages like Python don't cut it anymore.
I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D. It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application. Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace. This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in). People say "what is D's edge", but my personal perception is "where is the competition for D" in this area. It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration.
 at the same time there's growing discontent among 
 researchers, scientists and engineers as regards performance, 
 simply because the data sets are becoming bigger and bigger 
 every day and the algorithms are getting more and more 
 refined. Sooner or later people will have to find new ways, 
 out of sheer necessity.
upvote. I would love to see any references you have on this - not because it's not rather obvious to me, but because it is helpful when talking to other people.
The article that gave rise to this thread is a good reference. I came from a slightly different angle, I looked for alternatives to Python, because I needed: 1. fast native execution (real time) 2. easy interfacing to C 3. cross-platform development (Modern convenience, templates, ranges etc. were bonuses I discovered bit by bit) As regards algorithms and data processing, most people in research use Matlab (proprietary) and Python. However, in my field they're useless when it comes to building data-driven systems (fast analysis, retraining of machine based on (slight) modifications), and putting computationally heavy algorithms into real world applications. Proof of concept is all it amounts to, usually. So D has a real chance here, because of 1. native code 2. modern convenience 3. templates, structs, mixins, ranges, std.algorithm etcetc. 4. interfacing to C libs
 Don't forget that "the state of the art" can change very 
 quickly in IT and the name of the game is anticipating new 
 developments rather than taking snapshots of the current 
 state of the art and frame them. D really has a lot to offer 
 for data processing and I wouldn't rule it out that more and 
 more programmers will turn to it for this task.
I fully agree. If we started a section on use cases, would you be able to write a page or two on D's advantages in data processing?
I think that Dicebot et al would have good examples.
It'd be nice, if we had a dedicated data-analysis section and/or library. I'm almost sure that people working with massive amounts of data would find it by googling "efficient data analysis" or something like that. Facebook probably has a wealth of data analysis examples / techniques, too.
Mar 31 2015
prev sibling parent "Paulo Pinto" <pjmlp progtools.org> writes:
On Tuesday, 31 March 2015 at 11:04:50 UTC, Laeeth Isharc wrote:
 As Andrew Brown pointed out, visualization is not behind 
 Pythons success. Its success lies in the fact that it's a 
 language you can hack away in easily.
Sounds right. I am not in the camp that says it is a killer for D. It would just be nice to have both at least a passable solution for visualization, and some way of making it interactive. (The REPL might be one route). The problem with separating the processes completely and just piping the output from D code that does the heavy lifting to a python or julia front end is it may make it more painful to play with and explore the data. My interests are finance more than science, so that may lead to a different set of needs. Finishing mathgl and writing D bindings for bokeh (take a look - it is pretty cool, particularly to be able to use the browser as client, acknowledging that it is a tradeoff) is not so much work. But some help on bokeh particularly would be nice, as I fear picking one way of implementing the object structure and later finding it is a mistake.
 the initial euphoria of being able to automatically rename 
 files and extract value X from file Y soon gives way to 
 frustration when it comes to performance.
Yep.
 The paper shows well that in a world where data processing is 
 of utmost importance, and we're talking about huge sets of 
 data, languages like Python don't cut it anymore.
I could not agree more, and I do think the intersection of two trends creates tremendous opportunity for D. It's also commonsensical to look at notable successes - and I hope it is not just my biases that lead me to think many of these are in just this kind of application. Data sets keep getting larger (but not necessarily more information rich in dollar terms), and Moore's Law/memory speed+latency is not keeping pace. This is exactly the kind of change that creeps up on you because not much changes in a few months (which is the kind of horizon many of us tend to think in). People say "what is D's edge", but my personal perception is "where is the competition for D" in this area. It has to be native code/JIT, and I refuse to learn Java; it also should be plastic and lend itself to rapid iteration.
It is in the JVM and .NET eco-systems. Both have AOT compilers available, are able to chew data on GPGPUs and offer SIMD libraries. This is why there is such a strong focus with value types and better C interop planned for Java 10, has its use for data analysis has been growing. In HPF, companies prefer to live with JVM workarounds for the current limitations than go out and hire a few C++ developers, given the amount of money saved in salaries. -- Paulo
Mar 31 2015
prev sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Monday, 30 March 2015 at 18:04:58 UTC, george wrote:
 .NET actually already has a foothold in bioinformatics, 
 specially in user facing software and steering of reading 
 equipments and robots.


 visualization) use cases.

 --
 Paulo
Though when it comes to open source bioinformatics projects, Perl and Python have a large foothold among most most bioinformaticians. Most utilities that require speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc). I think D stands a good chance as a language of choice for bioinformatics projects. George
Yes on the server side and UNIX based research. However, I have learned in the last years that Windows based systems are also used a lot, specially in controlling robots and doing the first processing steps and visualization. At least in commercial research. -- Paulo
Mar 30 2015
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 30 March 2015 at 20:28:11 UTC, Paulo Pinto wrote:
 On Monday, 30 March 2015 at 18:04:58 UTC, george wrote:
 .NET actually already has a foothold in bioinformatics, 
 specially in user facing software and steering of reading 
 equipments and robots.


 visualization) use cases.

 --
 Paulo
Though when it comes to open source bioinformatics projects, Perl and Python have a large foothold among most most bioinformaticians. Most utilities that require speed are often written in C and C++ (BLAST, HMMER, SAMTOOLS etc). I think D stands a good chance as a language of choice for bioinformatics projects. George
Yes on the server side and UNIX based research. However, I have learned in the last years that Windows based systems are also used a lot, specially in controlling robots and doing the first processing steps and visualization. At least in commercial research. -- Paulo
Yes, to the benefit of literally no-one. To be fair, it's not a problem of the operating system, just that special purpose GUI programmes for scientific work always seem to be utterly dreadful. "Hey, we need to record some time series and show a spectrum on the fly" "OK great, let's commission a closed source Windows GUI application with its own proprietary file format, sure it'll crash once a day and have scientifically important paramters hard-coded and undocumented, but at least you can point and click!" It seems to be true across the board in government research facilities, pharmaceutical companies, most of academia and so on... Enormous piles of proprietary vomit being propped up by an endless stream of disinterested and semi-incompetent programmers, steadily digging their way to job security.
Mar 31 2015
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/29/15 11:50 PM, george wrote:
 http://bioinformatics.oxfordjournals.org/content/early/2015/02/18/bioinformatics.btv098.full.pdf+html


 and a feature
 http://google-opensource.blogspot.nl/2015/03/gsoc-project-sambamba-published-in.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+GoogleOpenSourceBlog+(Google+Open+Source+Blog)



 D may hold a sweet spot in bioinformatics where you often require quick
 turnaround (productivity) , raw speed and agility.
Nice! Went to post it on reddit, was already there: http://www.reddit.com/r/programming/comments/30tvlf/d_in_bioinformatics_gsoc_project_sambamba/ More: https://news.ycombinator.com/newest https://twitter.com/D_Programming/status/582603844355424257 https://www.facebook.com/dlang.org/posts/1041963349150679 Andrei
Mar 30 2015