www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - Updates to the tsv-utils toolkit

reply Jon Degenhardt <jond noreply.com> writes:
It's not quite a year since the open-sourcing of eBay's tsv 
utilities. Since then there have been a number of additions and 
updates, and the tools form a more complete package. The tools 
assist with manipulation of tabular data files common in machine 
learning and data mining environments. They work alongside 
traditional Unix command line tools like 'cut', and 'sort'. They 
also fit well with data mining and stats packages like R and 
Pandas.

The tools include filtering, slicing, joins and other 
manipulation, sampling, and statistical calculations. If you find 
yourself working with large data files from a unix shell, you may 
like these tools.

Speed matters when processing large data files, and these tools 
are fast. I've published new benchmarks comparing the tools to 
similar tools written in several native compiled programming 
languages. The tools are the fastest on five of the six 
benchmarks run, generally by significant margins. It's a good 
result for the D programming language. The benchmarks may be of 
interest regardless of your interest in the tools themselves.

Repository: https://github.com/eBay/tsv-utils-dlang
Performance benchmarks: 
https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md

--Jon
Feb 22 2017
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
wrote:
 Speed matters when processing large data files, and these tools 
 are fast. I've published new benchmarks comparing the tools to 
 similar tools written in several native compiled programming 
 languages. The tools are the fastest on five of the six 
 benchmarks run, generally by significant margins. It's a good 
 result for the D programming language.
Great news!
The specialty toolkits have been anonymized in the tables below.
The purpose of these benchmarks is to gauge performance of the D
tools, not make comparisons between other toolkits.
You're no fun ;)
Feb 22 2017
parent Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 22 February 2017 at 18:43:57 UTC, Jack Stouffer 
wrote:
 On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
 wrote:
 Speed matters when processing large data files, and these 
 tools are fast. I've published new benchmarks comparing the 
 tools to similar tools written in several native compiled 
 programming languages. The tools are the fastest on five of 
 the six benchmarks run, generally by significant margins. It's 
 a good result for the D programming language.
Great news!
Agreed, an outstanding result. I had not anticipated the deltas.
The specialty toolkits have been anonymized in the tables below.
The purpose of these benchmarks is to gauge performance of the D
tools, not make comparisons between other toolkits.
You're no fun ;)
Yeah, I know. Not my style.
Feb 22 2017
prev sibling next sibling parent reply bpr <brogoff gmail.com> writes:
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
wrote:
...snip...
 Repository: https://github.com/eBay/tsv-utils-dlang
 Performance benchmarks: 
 https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md

 --Jon
This is very nice code, and a good result for D. I'll study this carefully. So much of data analysis is reading/transforming files... I wish you didn't anonymize the specialty toolkits. I think I understand why you chose to do so, but it makes the comparison less valuable. Still, great work! Looking forward to a blogpost.
Feb 22 2017
parent Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 22 February 2017 at 21:07:43 UTC, bpr wrote:
 On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
 wrote:
 ...snip...
 Repository: https://github.com/eBay/tsv-utils-dlang
 Performance benchmarks: 
 https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md

 --Jon
This is very nice code, and a good result for D. I'll study this carefully. So much of data analysis is reading/transforming files... ...snip...
Thanks! Both for the feedback and for any evaluation you might do. Any insights or thoughts you may have would be quite welcome. --Jon
Feb 22 2017
prev sibling next sibling parent Joakim <dlang joakim.fea.st> writes:
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
wrote:
 It's not quite a year since the open-sourcing of eBay's tsv 
 utilities. Since then there have been a number of additions and 
 updates, and the tools form a more complete package. The tools 
 assist with manipulation of tabular data files common in 
 machine learning and data mining environments. They work 
 alongside traditional Unix command line tools like 'cut', and 
 'sort'. They also fit well with data mining and stats packages 
 like R and Pandas.

 [...]
Nice writeup, somebody posting this to reddit or will that be done with a future blog post?
Feb 22 2017
prev sibling parent Jon Degenhardt <jond noreply.com> writes:
On Wednesday, 22 February 2017 at 18:12:50 UTC, Jon Degenhardt 
wrote:
 It's not quite a year since the open-sourcing of eBay's tsv 
 utilities. Since then there have been a number of additions and 
 updates, and the tools form a more complete package. The tools 
 assist with manipulation of tabular data files common in 
 machine learning and data mining environments. They work 
 alongside traditional Unix command line tools like 'cut', and 
 'sort'. They also fit well with data mining and stats packages 
 like R and Pandas.

 The tools include filtering, slicing, joins and other 
 manipulation, sampling, and statistical calculations. If you 
 find yourself working with large data files from a unix shell, 
 you may like these tools.

 Speed matters when processing large data files, and these tools 
 are fast. I've published new benchmarks comparing the tools to 
 similar tools written in several native compiled programming 
 languages. The tools are the fastest on five of the six 
 benchmarks run, generally by significant margins. It's a good 
 result for the D programming language. The benchmarks may be of 
 interest regardless of your interest in the tools themselves.

 Repository: https://github.com/eBay/tsv-utils-dlang
 Performance benchmarks: 
 https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md

 --Jon
One more update: Schveiguy helped identify the performance bottleneck in the csv2tsv tool, now the tools are the fastest on all six benchmarks. Benchmarks have been updated (and reformatted a bit). Summary table here: https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md#top-four-in-each-benchmark
Mar 04 2017