www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Big Data Ecosystem

reply Eduard Staniloiu <edi33416 gmail.com> writes:
Cheers, everybody!

I was wondering what is the current state of affairs of the D 
ecosystem with respect to Big Data: are there any libraries out 
there? If so, which?

Thank you,
Edi
Jul 09 2019
next sibling parent reply Andre Pany <andre s-e-a-p.de> writes:
On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries out 
 there? If so, which?

 Thank you,
 Edi
Big data is a broad topic:), you can achieve it with specific software like spark, kafka or even with cloud storage services like AWS S3 or even known databases like Postgres. For Kafka there is a Deimos binding for librdkafka available here https://github.com/DlangApache/librdkafka. There is also a native implementation for D available, but unfortunately not longer maintained https://github.com/tamediadigital/kafka-d. For AWS services, I prefer the AWS client executable. It accepts JSON input and also outputs JSON. From the official AWS services metadata files you can easily create D structs and classes (https://github.com/aws/aws-sdk-js/tree/master/apis). It almost feels like the real AWS SDK available e.g. for Python, Java, C++. For AWS s3 there is also s native D implementation based on vibe-D. For postgres you can e.g. use this great library https://github.com/adamdruppe/arsd/blob/master/postgres.d. In one way or another you need in Big Data scenarios http client and servers. Also here the ARSD library has some lightweight components. Also the current GSOC project regarding dataframes is an important part of Big Data. What I currently really miss is the possibility to read/write Parquet files. Kind regards Andre
Jul 09 2019
parent reply bachmeier <no spam.net> writes:
On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote:

 What I currently really miss is the possibility to read/write 
 Parquet files.
For the record, this *is* something that can be done because there are R packages (like sparklyr) that do it, and that means you can do it from D as well. Now maybe you mean you want an interface written in D, but the functionality is nonetheless easily available to D programs. I've never worked with Parquet files so I can't comment on the details.
Jul 11 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 11 July 2019 at 18:12:15 UTC, bachmeier wrote:
 On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote:

 What I currently really miss is the possibility to read/write 
 Parquet files.
For the record, this *is* something that can be done because there are R packages (like sparklyr) that do it, and that means you can do it from D as well. Now maybe you mean you want an interface written in D, but the functionality is nonetheless easily available to D programs. I've never worked with Parquet files so I can't comment on the details.
In something like two minutes of googling, I found that Apache Arrow [1] has C bindings [2] for parquet's C++ read/write utilities. I know nothing about Parquet files, but I imagine this would be faster than calling the R packages. [1] https://github.com/apache/arrow [2] https://github.com/apache/arrow/tree/master/c_glib/parquet-glib
Jul 11 2019
parent Andre Pany <andre s-e-a-p.de> writes:
On Thursday, 11 July 2019 at 20:00:19 UTC, jmh530 wrote:
 On Thursday, 11 July 2019 at 18:12:15 UTC, bachmeier wrote:
 On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote:

 What I currently really miss is the possibility to read/write 
 Parquet files.
For the record, this *is* something that can be done because there are R packages (like sparklyr) that do it, and that means you can do it from D as well. Now maybe you mean you want an interface written in D, but the functionality is nonetheless easily available to D programs. I've never worked with Parquet files so I can't comment on the details.
In something like two minutes of googling, I found that Apache Arrow [1] has C bindings [2] for parquet's C++ read/write utilities. I know nothing about Parquet files, but I imagine this would be faster than calling the R packages. [1] https://github.com/apache/arrow [2] https://github.com/apache/arrow/tree/master/c_glib/parquet-glib
Thanks. The benefit of Parquet in contrast to e.g hdf5 is the file size. A 500 mb csv has a size of 300 mb as hdf5 and 180 mb as Parquet. The file size is important when you need to read and write to e.g. AWS S3. Kind regards Andre
Jul 11 2019
prev sibling next sibling parent reply bioinfornatics <bioinfornatics fedoraproject.org> writes:
On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries out 
 there? If so, which?

 Thank you,
 Edi
Dear, To be fair if you need something to be ready in use go to scala and java through spark, deeplearning4j and others Otherwise you are welcome to demonstrate to the world the power of D in this field best regards
Jul 10 2019
parent Les De Ridder <les lesderid.net> writes:
On Wednesday, 10 July 2019 at 21:56:19 UTC, bioinfornatics wrote:
 On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries 
 out there? If so, which?

 Thank you,
 Edi
Dear, To be fair if you need something to be ready in use go to scala and java through spark, deeplearning4j and others
In my experience, the performance of Spark in particular leaves much to be desired when you don't have a large Hadoop cluster.
Jul 11 2019
prev sibling parent Laeeth Isharc <laeeth kaleidic.io> writes:
On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries out 
 there? If so, which?

 Thank you,
 Edi
Weka.io of course have the world's fastest file system and I understand ML at scale is one hot market for them. It's simple to get going from what I saw and it's not expensive in the scheme of things. I don't really understand myself why you would use cloud in many cases, but it does work on the cloud if you want. I guess you know mir and Lubeck. There's LDA tucked away there in case you need. James Thompson lightning talk was quite interesting - sometimes doing things efficiently can reduce the need for all the complexity of some of the standard approaches. I don't know if you consider postgres part of big data solutions, but with Timescale DB maybe. You can quite easily write Foreign Data Wrappers in D to integrate with other data sources and you can also write server side functions. I have done maybe half the work for that but didn't get time to finish yet. DPP more or less works for postgres headers. Joyent have an interesting approach to working on big data the UNIX way. They have an object store called Manta that allows you to run code on the same node as the data (stored using zfs). One could do something similar in D. I wanted to get comfortable with SmartOS but I don't think it's ready for us today. However one could do something similar home-rolled with zfs and Linux containers. I wrapped libzfscore and lxd - alpha quality right now. Not sure if I pushed the latest versions to GitHub yet. For syncing stuff across a WAN between regions, TCP doesn't have great throughput. You can either strap together a bunch of connections or use something on top of UDP to make it reliable. We found UDT-D gave us 300x faster file transfers between London and HK. It's up at GitHub though not very polished code.
Jul 12 2019