digitalmars.D - Big Data Ecosystem

Eduard Staniloiu (6/6) Jul 09 2019 Cheers, everybody!

Andre Pany (26/32) Jul 09 2019 Big data is a broad topic:), you can achieve it with specific

bachmeier (7/9) Jul 11 2019 For the record, this *is* something that can be done because

jmh530 (8/17) Jul 11 2019 In something like two minutes of googling, I found that Apache

Andre Pany (8/27) Jul 11 2019 Thanks. The benefit of Parquet in contrast to e.g hdf5 is the

bioinfornatics (7/13) Jul 10 2019 Dear,

Les De Ridder (3/15) Jul 11 2019 In my experience, the performance of Spark in particular leaves

Laeeth Isharc (30/36) Jul 12 2019 Weka.io of course have the world's fastest file system and I

Eduard Staniloiu <edi33416 gmail.com> writes:

Cheers, everybody!

I was wondering what is the current state of affairs of the D 
ecosystem with respect to Big Data: are there any libraries out 
there? If so, which?

Thank you,
Edi

Jul 09 2019

Andre Pany <andre s-e-a-p.de> writes:

On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries out 
 there? If so, which?

 Thank you,
 Edi

Big data is a broad topic:), you can achieve it with specific 
software like spark, kafka or even with cloud storage services 
like AWS S3 or even known databases like Postgres.

For Kafka there is a Deimos binding for librdkafka available here 
https://github.com/DlangApache/librdkafka. There is also a native 
implementation for D available, but unfortunately not longer 
maintained https://github.com/tamediadigital/kafka-d.

For AWS services, I prefer the AWS client executable. It accepts 
JSON input and also outputs JSON. From the official AWS services 
metadata files you can easily create D structs and classes 
(https://github.com/aws/aws-sdk-js/tree/master/apis). It almost 
feels like the real AWS SDK available e.g. for Python, Java, C++.
For AWS s3 there is also s native D implementation based on 
vibe-D.

For postgres you can e.g. use this great library 
https://github.com/adamdruppe/arsd/blob/master/postgres.d.

In one way or another you need in Big Data scenarios http client 
and servers. Also here the ARSD library has some lightweight 
components.

Also the current GSOC project regarding dataframes is an 
important part of Big Data.

What I currently really miss is the possibility to read/write 
Parquet files.

Kind regards
Andre

Jul 09 2019

bachmeier <no spam.net> writes:

On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote:

 What I currently really miss is the possibility to read/write 
 Parquet files.

For the record, this *is* something that can be done because 
there are R packages (like sparklyr) that do it, and that means 
you can do it from D as well. Now maybe you mean you want an 
interface written in D, but the functionality is nonetheless 
easily available to D programs. I've never worked with Parquet 
files so I can't comment on the details.

Jul 11 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 11 July 2019 at 18:12:15 UTC, bachmeier wrote:
 On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote:

 What I currently really miss is the possibility to read/write 
 Parquet files.

 For the record, this *is* something that can be done because 
 there are R packages (like sparklyr) that do it, and that means 
 you can do it from D as well. Now maybe you mean you want an 
 interface written in D, but the functionality is nonetheless 
 easily available to D programs. I've never worked with Parquet 
 files so I can't comment on the details.

In something like two minutes of googling, I found that Apache 
Arrow [1] has C bindings [2] for parquet's C++ read/write 
utilities. I know nothing about Parquet files, but I imagine this 
would be faster than calling the R packages.

[1] https://github.com/apache/arrow
[2] 
https://github.com/apache/arrow/tree/master/c_glib/parquet-glib

Jul 11 2019

Andre Pany <andre s-e-a-p.de> writes:

On Thursday, 11 July 2019 at 20:00:19 UTC, jmh530 wrote:
 On Thursday, 11 July 2019 at 18:12:15 UTC, bachmeier wrote:
 On Tuesday, 9 July 2019 at 21:16:03 UTC, Andre Pany wrote:

 What I currently really miss is the possibility to read/write 
 Parquet files.

 For the record, this *is* something that can be done because 
 there are R packages (like sparklyr) that do it, and that 
 means you can do it from D as well. Now maybe you mean you 
 want an interface written in D, but the functionality is 
 nonetheless easily available to D programs. I've never worked 
 with Parquet files so I can't comment on the details.

 In something like two minutes of googling, I found that Apache 
 Arrow [1] has C bindings [2] for parquet's C++ read/write 
 utilities. I know nothing about Parquet files, but I imagine 
 this would be faster than calling the R packages.

 [1] https://github.com/apache/arrow
 [2] 
 https://github.com/apache/arrow/tree/master/c_glib/parquet-glib

Thanks. The benefit of Parquet in contrast to e.g  hdf5 is the 
file size. A 500 mb csv has a size of 300 mb as hdf5 and 180 mb 
as Parquet.
The file size is important when you need to read and write to 
e.g. AWS S3.

Kind regards
Andre

Jul 11 2019

bioinfornatics <bioinfornatics fedoraproject.org> writes:

On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries out 
 there? If so, which?

 Thank you,
 Edi


Dear,
To be fair if you need something to be ready in use go to scala 
and java through spark, deeplearning4j and others

Otherwise you are welcome to demonstrate to the world the power 
of D in this field

best regards

Jul 10 2019

Les De Ridder <les lesderid.net> writes:

On Wednesday, 10 July 2019 at 21:56:19 UTC, bioinfornatics wrote:
 On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries 
 out there? If so, which?

 Thank you,
 Edi


 Dear,
 To be fair if you need something to be ready in use go to scala 
 and java through spark, deeplearning4j and others

In my experience, the performance of Spark in particular leaves
much to be desired when you don't have a large Hadoop cluster.

Jul 11 2019

Laeeth Isharc <laeeth kaleidic.io> writes:

On Tuesday, 9 July 2019 at 16:58:56 UTC, Eduard Staniloiu wrote:
 Cheers, everybody!

 I was wondering what is the current state of affairs of the D 
 ecosystem with respect to Big Data: are there any libraries out 
 there? If so, which?

 Thank you,
 Edi

Weka.io of course have the world's fastest file system and I 
understand ML at scale is one hot market for them.  It's simple 
to get going from what I saw and it's not expensive in the scheme 
of things.  I don't really understand myself why you would use 
cloud in many cases, but it does work on the cloud if you want.

I guess you know mir and Lubeck.  There's LDA tucked away there 
in case you need.

James Thompson lightning talk was quite interesting - sometimes 
doing things efficiently can reduce the need for all the 
complexity of some of the standard approaches.

I don't know if you consider postgres part of big data solutions, 
but with Timescale DB maybe.  You can quite easily write Foreign 
Data Wrappers in D to integrate with other data sources and you 
can also write server side functions.  I have done maybe half the 
work for that but didn't get time to finish yet.  DPP more or 
less works for postgres headers.

Joyent have an interesting approach to working on big data the 
UNIX way.  They have an object store called Manta that allows you 
to run code on the same node as the data (stored using zfs).   
One could do something similar in D.  I wanted to get comfortable 
with SmartOS but I don't think it's ready for us today.  However 
one could do something similar home-rolled with zfs and Linux 
containers.  I wrapped libzfscore and lxd - alpha quality right 
now.  Not sure if I pushed the latest versions to GitHub yet.

For syncing stuff across a WAN between regions, TCP doesn't have 
great throughput.  You can either strap together a bunch of 
connections or use something on top of UDP to make it reliable.  
We found UDT-D gave us 300x faster file transfers between London 
and HK.  It's up at GitHub though not very polished code.

Jul 12 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Big Data Ecosystem